ARMv8 Memory Ordering and Speculative Store Issues in Lock Acquisition Code

ARM Cortex-M4 Cache Coherency Problems During DMA Transfers

The core issue revolves around the ARMv8 memory ordering model, specifically concerning the behavior of speculative stores in the context of lock acquisition code. The concern arises from the example code provided in the ARM Architecture Reference Manual (ARM DDI 0487D.a) section K11.3.1, which demonstrates how to acquire a lock using the LDAEX and STREXEQ instructions. The primary worry is that a speculative store after the BNE instruction might be accelerated before the STREXEQ, potentially allowing another observer to see the store before the lock is actually acquired. This could lead to race conditions and undefined behavior in multi-threaded or multi-core environments.

The ARMv8 architecture employs a weakly-ordered memory model, which means that memory operations can be reordered by the processor to improve performance. However, this reordering must not violate the semantics of the program, especially in critical sections where locks are used to ensure mutual exclusion. The example code in question uses LDAEX (Load-Exclusive) and STREXEQ (Store-Exclusive Conditional) to implement a spinlock. The LDAEX instruction loads the value of the lock into a register and marks the memory location as exclusive to the current processor. The STREXEQ instruction attempts to store a new value to the lock location only if the exclusive access is still held. If the store is successful, the lock is acquired; otherwise, the code loops back to retry the operation.

The concern is that the ARMv8 memory model might allow speculative execution to proceed past the BNE instruction, potentially executing stores that are meant to be within the critical section before the lock is actually acquired. This could lead to a situation where another processor or thread observes these speculative stores, violating the mutual exclusion property that the lock is supposed to enforce.

Memory Barrier Omission and Cache Invalidation Timing

The root cause of this issue lies in the interaction between speculative execution and the ARMv8 memory ordering model. Speculative execution is a technique used by modern processors to execute instructions ahead of time, based on predictions about the program’s control flow. While speculative execution can significantly improve performance, it can also lead to subtle issues if not properly managed, especially in multi-threaded or multi-core environments.

In the context of the lock acquisition code, the processor might speculatively execute instructions that follow the BNE instruction, even before the condition of the branch is resolved. This speculative execution could include stores to memory locations that are meant to be protected by the lock. If these speculative stores are propagated to other observers (e.g., other processors or threads) before the branch condition is resolved, it could lead to a violation of the mutual exclusion property.

The ARMv8 architecture provides mechanisms to control speculative execution and ensure that memory operations are properly ordered. However, the documentation in the ARM Architecture Reference Manual does not explicitly state whether speculative stores are allowed to be observed by other observers before the branch condition is resolved. This lack of clarity has led to concerns about the correctness of the lock acquisition code, especially in safety-critical systems where incorrect behavior could have severe consequences.

One possible cause of this issue is the omission of memory barriers or other synchronization primitives that would prevent speculative stores from being propagated before the lock is acquired. Memory barriers, such as the DMB (Data Memory Barrier) instruction, can be used to enforce ordering constraints on memory operations. By inserting a DMB instruction after the STREXEQ instruction, the processor can be forced to wait until all previous memory operations are completed before proceeding with the next instruction. This would prevent speculative stores from being propagated before the lock is acquired, ensuring that the critical section is properly protected.

Another potential cause is the timing of cache invalidation. In a multi-core system, each processor has its own cache, and changes to memory must be propagated to all caches to ensure consistency. If a speculative store is executed and propagated to the cache before the lock is acquired, it could be observed by another processor, leading to incorrect behavior. Proper cache management, including the use of cache invalidation instructions, can help prevent this issue by ensuring that changes to memory are not visible to other processors until the lock is acquired.

Implementing Data Synchronization Barriers and Cache Management

To address the issue of speculative stores in the lock acquisition code, several steps can be taken to ensure that the critical section is properly protected. These steps involve the use of data synchronization barriers, cache management techniques, and careful consideration of the ARMv8 memory ordering model.

Data Synchronization Barriers

The first step is to insert a DMB (Data Memory Barrier) instruction after the STREXEQ instruction. The DMB instruction ensures that all memory operations before the barrier are completed before any memory operations after the barrier are executed. This prevents speculative stores from being propagated before the lock is acquired, ensuring that the critical section is properly protected.

The modified lock acquisition code would look like this:

AArch32 Px
    PLDW [R1]                ; preload into cache in unique state
Loop
    LDAEX R5, [R1]           ; read lock with acquire
    CMP R5, #0               ; check if 0
    STREXEQ R5, R0, [R1]     ; attempt to store new value
    DMB                      ; data memory barrier
    CMPEQ R5, #0             ; test if store succeeded
    BNE Loop                 ; retry if not
    ; loads and stores in the critical region can now be performed

The DMB instruction ensures that the store to the lock location is completed before any subsequent memory operations are executed. This prevents speculative stores from being propagated before the lock is acquired, ensuring that the critical section is properly protected.

Cache Management

In addition to using data synchronization barriers, proper cache management is essential to ensure that speculative stores are not observed by other processors before the lock is acquired. This involves the use of cache invalidation instructions, such as DC IVAC (Data Cache Invalidate by Virtual Address to PoC), to ensure that changes to memory are not visible to other processors until the lock is acquired.

The cache invalidation instruction should be executed after the STREXEQ instruction and before the DMB instruction. This ensures that any speculative stores are invalidated in the cache before the memory barrier is executed, preventing them from being observed by other processors.

The modified lock acquisition code with cache invalidation would look like this:

AArch32 Px
    PLDW [R1]                ; preload into cache in unique state
Loop
    LDAEX R5, [R1]           ; read lock with acquire
    CMP R5, #0               ; check if 0
    STREXEQ R5, R0, [R1]     ; attempt to store new value
    DC IVAC, [R1]            ; invalidate cache line
    DMB                      ; data memory barrier
    CMPEQ R5, #0             ; test if store succeeded
    BNE Loop                 ; retry if not
    ; loads and stores in the critical region can now be performed

The DC IVAC instruction invalidates the cache line containing the lock location, ensuring that any speculative stores are not visible to other processors. The DMB instruction then ensures that the cache invalidation is completed before any subsequent memory operations are executed.

ARMv8 Memory Ordering Model

Finally, it is important to carefully consider the ARMv8 memory ordering model and its implications for speculative execution. The ARMv8 architecture employs a weakly-ordered memory model, which allows memory operations to be reordered by the processor to improve performance. However, this reordering must not violate the semantics of the program, especially in critical sections where locks are used to ensure mutual exclusion.

The ARMv8 memory model provides several mechanisms to control memory ordering, including the use of acquire and release semantics. The LDAEX instruction has acquire semantics, which means that it ensures that all previous memory operations are completed before the load is executed. The STREXEQ instruction has release semantics, which means that it ensures that the store is completed before any subsequent memory operations are executed.

By using these mechanisms, along with data synchronization barriers and cache management techniques, it is possible to ensure that the lock acquisition code is correct and that speculative stores are not observed by other processors before the lock is acquired.

Conclusion

In conclusion, the issue of speculative stores in the ARMv8 lock acquisition code can be addressed through the use of data synchronization barriers, cache management techniques, and careful consideration of the ARMv8 memory ordering model. By inserting a DMB instruction after the STREXEQ instruction and invalidating the cache line containing the lock location, it is possible to prevent speculative stores from being observed by other processors before the lock is acquired. This ensures that the critical section is properly protected and that the mutual exclusion property of the lock is maintained.

In safety-critical systems, where incorrect behavior could have severe consequences, it is essential to carefully consider the implications of speculative execution and to use the appropriate mechanisms to ensure that memory operations are properly ordered. By following the steps outlined in this guide, developers can ensure that their lock acquisition code is correct and that their systems are reliable and safe.

ARMv8 Memory Ordering and Speculative Store Issues in Lock Acquisition Code

ARM Cortex-M4 Cache Coherency Problems During DMA Transfers

Memory Barrier Omission and Cache Invalidation Timing