ARM Cortex-A35 128-bit Atomic Access Limitations
The ARM Cortex-A35, based on the ARMv8.0-A architecture, is a highly efficient processor designed for low-power applications. However, one of its limitations is the lack of native support for 128-bit atomic read/write operations between multiple cores. This limitation is rooted in the ARMv8-A architecture’s atomicity model, which only guarantees atomicity for up to 64-bit accesses. The ARMv8-A Reference Manual, specifically in section B2.2 "Atomicity in the ARM architecture," explicitly states that atomic operations are limited to 64 bits. This means that any attempt to perform a 128-bit read or write operation across multiple cores without additional synchronization mechanisms will not be inherently atomic.
The Cortex-A35’s external cluster bus width is 128 bits, which might lead to the assumption that 128-bit atomic accesses are possible. However, the bus width does not directly translate to atomicity guarantees at the architectural level. Atomicity is a property that must be enforced by the processor’s memory subsystem and instruction set architecture (ISA). While the bus can handle 128-bit transactions, the ARMv8-A architecture does not provide the necessary instructions or hardware support to ensure that these transactions are atomic across multiple cores.
Memory Subsystem Constraints and Cache Coherency Implications
The inability to perform 128-bit atomic accesses on the Cortex-A35 is further compounded by the memory subsystem’s design and cache coherency protocols. In a multi-core system, each core has its own cache hierarchy, and maintaining coherency between these caches is critical for ensuring correct behavior. The ARMv8-A architecture employs the MOESI (Modified, Owned, Exclusive, Shared, Invalid) cache coherency protocol, which ensures that all cores see a consistent view of memory. However, this protocol is designed around the assumption that atomic operations are limited to 64 bits.
When a core attempts to perform a 128-bit access, the memory subsystem must ensure that the entire 128-bit value is consistent across all cores. This requires additional hardware support, such as double-width load/store units or specialized cache coherency mechanisms, which are not present in the Cortex-A35. Furthermore, even if the cache is disabled for a specific memory region, the underlying architecture still enforces the 64-bit atomicity limit. Disabling the cache might reduce some of the coherency overhead, but it does not change the fundamental architectural constraints.
Another factor to consider is the timing of cache invalidation and memory barriers. In a multi-core system, ensuring that a 128-bit access is atomic would require precise control over when caches are invalidated and when memory barriers are executed. The ARMv8-A architecture provides instructions like Data Synchronization Barriers (DSB) and Data Memory Barriers (DMB) to enforce ordering of memory operations, but these instructions are not sufficient to extend atomicity beyond 64 bits. The lack of 128-bit atomicity is therefore a hardware limitation that cannot be fully mitigated through software alone.
Implementing 128-bit Atomicity with Exclusive Monitors and Retry Loops
Given the architectural limitations of the Cortex-A35, implementing 128-bit atomicity requires a combination of software techniques and careful design considerations. One approach is to use the ARMv8-A exclusive monitor mechanism, which allows for atomic updates to memory locations through a combination of load-exclusive (LDXR) and store-exclusive (STXR) instructions. These instructions can be used to implement a retry loop that ensures atomicity for larger data sizes, such as 128 bits.
The basic idea is to split the 128-bit access into two 64-bit halves and use the exclusive monitor to ensure that both halves are updated atomically. The producer core would first perform a load-exclusive on the first 64-bit half, followed by a load-exclusive on the second 64-bit half. It would then perform the necessary computations and attempt to store the updated values back to memory using store-exclusive instructions. If either store-exclusive fails, the entire operation is retried until both halves are successfully updated.
On the consumer side, a similar approach is used to ensure that the 128-bit value is read atomically. The consumer core would perform a load-exclusive on the first 64-bit half, followed by a load-exclusive on the second 64-bit half. If both loads succeed, the consumer can be confident that it has read a consistent 128-bit value. If either load fails, the consumer must retry the operation.
While this approach introduces some overhead due to the retry loop, it provides a way to achieve 128-bit atomicity on the Cortex-A35. The key is to ensure that the exclusive monitor is properly configured and that memory barriers are used to enforce the correct ordering of memory operations. Additionally, the memory region used for the 128-bit access should be aligned to a 128-bit boundary to avoid potential issues with cache line boundaries.
Another consideration is the impact of cache coherency on the performance of the retry loop. If the memory region is cached, the exclusive monitor must ensure that the cache lines are properly invalidated and updated across all cores. This can introduce additional latency, especially in systems with a high degree of contention for the memory location. In some cases, it may be beneficial to disable caching for the memory region used for 128-bit accesses, although this comes at the cost of increased memory access latency.
In summary, while the ARM Cortex-A35 does not natively support 128-bit atomic accesses, it is possible to implement a workaround using exclusive monitors and retry loops. This approach requires careful design and consideration of the memory subsystem’s behavior, but it provides a viable solution for applications that require 128-bit atomicity. By understanding the architectural limitations and leveraging the available hardware features, developers can achieve the desired level of atomicity on the Cortex-A35.