ARM Cortex-A72 ACP Deadlock During GDMA and Prefetch Read Operations
The ARM Cortex-A72 processor, when interfacing with the Accelerator Coherency Port (ACP) and a Generic Direct Memory Access (GDMA) controller, can encounter a deadlock scenario under specific conditions. This deadlock arises due to a combination of GDMA backpressure and speculative prefetch read operations from the A72 core, which block arbitration for ACP writes. The issue manifests when the GDMA writes to the ACP port are backpressured, causing GDMA reads to also be backpressured. Since the memory channel for GDMA reads and A72 reads is shared, the A72 read operations are blocked, preventing ACP writes from being released. This creates a cyclic dependency, leading to a deadlock.
The deadlock is further exacerbated by the A72 core’s speculative prefetch read operations (e.g., addresses 0xAE0 and 0x2B2) to non-CPU memory regions. These prefetch operations monopolize the ACP arbitration, preventing ACP writes from gaining access to the memory channel. The Program Counter (PC) pointer gets stuck at a specific address (0x51C7DC), indicating that the core is unable to proceed past a sequence of instructions involving speculative reads and arithmetic operations. The on-chip memory in this scenario is configured with a device attribute, which does not support cacheable operations, further complicating the issue.
Memory Channel Contention and Prefetch Read Arbitration
The root cause of the deadlock lies in the interaction between the GDMA controller, the A72 core’s prefetch mechanism, and the ACP arbitration logic. The GDMA controller, when writing to the ACP port, can experience backpressure due to high memory traffic or contention on the shared memory channel. This backpressure propagates to GDMA read operations, which are then blocked. Since the A72 core shares the same memory channel for its read operations, these reads are also blocked. The blocked A72 reads prevent the ACP writes from being released, creating a deadlock.
The A72 core’s speculative prefetch read operations further aggravate the situation. The core issues prefetch reads to non-CPU memory regions (e.g., 0xAE0 and 0x2B2), which are not cacheable due to the device attribute configuration of the on-chip memory. These prefetch operations continuously request arbitration on the ACP, preventing ACP writes from gaining access. The core’s PC pointer gets stuck at 0x51C7DC, indicating that the core is unable to proceed past a sequence of instructions that involve speculative reads and arithmetic operations. This sequence includes instructions such as add x0, x0, #0xAE4
, ldr w0, [x0]
, add x1, x1, #0xAE0
, ldr w1, [x1]
, and udiv w2, w0, w1
, among others.
The on-chip memory’s device attribute configuration plays a critical role in this deadlock scenario. The device attribute does not support cacheable operations, meaning that the A72 core’s prefetch reads cannot be cached. This results in continuous arbitration requests for these reads, further blocking ACP writes. The RTL analysis suggests two potential solutions: configuring the on-chip memory MMU attribute to support write-back, write-allocate, and read-allocate operations, or enabling the replay threshold timeout attribute in the L2 Auxiliary Control Register.
Configuring MMU Attributes and L2 Auxiliary Control for Deadlock Resolution
To resolve the A72 ACP deadlock, two primary solutions have been proposed based on RTL analysis. The first solution involves configuring the on-chip memory MMU attribute to support write-back, write-allocate, and read-allocate operations. This configuration allows the memory controller to generate a pass signal, which masks the prefetch reads and unblocks the ACP port write path. By making the instruction area cacheable, the A72 core’s prefetch reads can be cached, reducing the number of arbitration requests and allowing ACP writes to proceed.
The second solution involves configuring the L2 Auxiliary Control Register’s bit [1] to enable the replay threshold timeout attribute. This configuration introduces a timeout mechanism for arbitration requests. After a certain amount of time, if an ACP write has not gained arbitration, the replay threshold timeout will force the arbitration logic to grant access to the ACP write, unblocking the write path. This solution does not require changing the MMU attributes but introduces a delay in the arbitration process.
The design intention behind configuring the instruction area as cacheable is to reduce the number of arbitration requests generated by the A72 core’s prefetch reads. By caching these reads, the core can avoid continuously requesting arbitration for the same memory locations, allowing ACP writes to gain access to the memory channel. However, this solution may have side effects, such as increased cache usage and potential cache thrashing if the prefetch reads are not properly managed.
Enabling the replay threshold timeout attribute in the L2 Auxiliary Control Register introduces a delay in the arbitration process, which may impact the overall system performance. This solution is less intrusive than changing the MMU attributes but may not be suitable for systems with strict timing requirements. Additionally, the timeout value must be carefully configured to balance between avoiding deadlocks and maintaining system performance.
In conclusion, the A72 ACP deadlock is a complex issue that arises from the interaction between the GDMA controller, the A72 core’s prefetch mechanism, and the ACP arbitration logic. The proposed solutions involve either configuring the on-chip memory MMU attributes to support cacheable operations or enabling the replay threshold timeout attribute in the L2 Auxiliary Control Register. Both solutions have their trade-offs and must be carefully evaluated based on the specific system requirements and performance constraints.