ARMv8 PMU Cycle Counter Instability During Secure-EL1 Code Execution
The Performance Monitoring Unit (PMU) in ARMv8 architectures is a critical tool for measuring the performance of code execution, particularly in low-level environments such as Secure-EL1. However, when attempting to measure the cycle count of a simple loop in Secure-EL1 on an ARMv8 big.LITTLE system, unstable and inconsistent cycle counter values are observed. These values fluctuate significantly, ranging from tens to thousands of cycles, despite the code being executed in an atomic context with interrupts disabled. This behavior is particularly puzzling because the instruction counter remains stable, suggesting that the issue is specific to the cycle counter or its underlying mechanisms.
The problem is further compounded by the fact that the ARM timer (CNTPCT_EL0) also exhibits similar instability, indicating that the issue may not be isolated to the PMU but rather tied to broader system behavior. The system in question is running Android and a secure OS, with CPU caches enabled and other cores active, which introduces additional complexity to the analysis. Understanding the root cause of this instability requires a deep dive into the ARMv8 architecture, the behavior of the PMU and ARM timer, and the interaction between secure and non-secure worlds in a multi-core environment.
Potential Causes: Cache Effects, Interrupts, and CPU Throttling
Several factors could contribute to the observed instability in the PMU cycle counter and ARM timer readings. One potential cause is the effect of CPU caches. Even though the code is executed in an atomic context with interrupts disabled, the presence of enabled caches can introduce variability in cycle counts due to cache hits and misses. For example, if the loop code or data is evicted from the cache during execution, the additional cycles required to fetch the data from main memory could result in higher cycle counts. Conversely, if the data remains in the cache, the cycle counts would be lower. This variability is particularly pronounced in a multi-core system where other cores may be accessing the same cache lines, leading to cache contention.
Another potential cause is the presence of non-secure interrupts. Although interrupts are disabled in the secure world, non-secure interrupts may still be active and handled by the EL3 firmware. These interrupts could introduce jitter in the cycle counts, especially if they cause the CPU to enter a lower power state or if they trigger cache maintenance operations. The ARMv8 architecture allows for the trapping of certain system registers, including CNTPCT_EL0, to the secure world. If the reading of CNTPCT_EL0 is trapped and handled by the secure OS, the additional overhead of the trap handler could also contribute to the observed variability.
CPU throttling is another factor that could explain the unstable cycle counts. In a big.LITTLE architecture, the operating system may dynamically switch between high-performance (big) and power-efficient (LITTLE) cores based on workload and thermal conditions. If the core executing the loop is switched during the measurement period, the change in clock frequency could result in significant variations in cycle counts. Additionally, power management features such as dynamic voltage and frequency scaling (DVFS) could further exacerbate the issue by altering the core’s clock frequency in response to system load or thermal constraints.
Resolving Cycle Counter Instability: Cache Management, Interrupt Isolation, and Throttling Mitigation
To address the instability in PMU cycle counter and ARM timer readings, a systematic approach is required to isolate and mitigate the potential causes. The first step is to ensure proper cache management. Disabling caches entirely is not a practical solution for performance measurement, as it would distort the results by introducing unnecessary memory access latency. Instead, the focus should be on minimizing cache contention and ensuring that the loop code and data remain in the cache throughout the measurement period. This can be achieved by prefetching the relevant data into the cache before starting the measurement and using memory barriers to ensure that the cache state is consistent.
The second step is to isolate the measurement from non-secure interrupts. While interrupts are disabled in the secure world, non-secure interrupts may still be active and could introduce jitter. To mitigate this, the GIC (Generic Interrupt Controller) configuration should be examined to ensure that non-secure interrupts are not being routed to the secure world during the measurement period. Additionally, the EL3 firmware should be configured to minimize the overhead of handling non-secure interrupts, or ideally, to defer their handling until after the measurement is complete.
The third step is to address CPU throttling and frequency scaling. In a big.LITTLE architecture, the core executing the measurement loop should be pinned to a specific core type (e.g., a big core) to avoid the variability introduced by core switching. Additionally, the CPU frequency should be fixed during the measurement period to prevent fluctuations in cycle counts due to DVFS. This can be achieved by disabling DVFS or setting the CPU to a fixed performance mode before starting the measurement.
Finally, it is essential to validate the measurement setup by running the loop in a bare-metal environment where the influence of the operating system and other software layers is minimized. This approach eliminates many of the variables introduced by a complex software stack and provides a more controlled environment for performance measurement. If the cycle counter readings stabilize in the bare-metal environment, the issue can be attributed to interactions with the operating system or other software components. Conversely, if the instability persists, the problem may be related to hardware or firmware behavior, requiring further investigation and potential vendor support.
By systematically addressing cache effects, interrupt isolation, and CPU throttling, it is possible to achieve stable and reliable cycle counter readings in Secure-EL1 on an ARMv8 big.LITTLE system. This approach not only resolves the immediate issue but also provides a framework for accurate performance measurement in complex, multi-core environments.