Understanding Cortex-A72 Prefetchers and Their Impact on Performance
The Cortex-A72 processor, a high-performance ARM core, employs sophisticated prefetching mechanisms to optimize memory access patterns and improve overall system performance. These prefetchers, which include both L1 and L2 cache prefetchers, predict future memory accesses based on historical patterns and preload data into the cache to reduce latency. While these mechanisms are generally beneficial, there are scenarios where disabling them is necessary, such as during performance benchmarking, debugging, or when dealing with specific workloads that do not benefit from prefetching.
The L1 cache prefetcher in the Cortex-A72 is responsible for predicting and fetching data into the Level 1 cache, while the L2 cache prefetcher operates at a higher level, targeting the Level 2 cache. Both prefetchers are designed to work in tandem, but their behavior can be controlled through specific system registers. Understanding how to manipulate these registers is crucial for developers who need to fine-tune the performance of their systems.
The Cortex-A72 also includes a load/store hardware prefetcher, which generates prefetches targeting both the L1D cache and L2 cache. This prefetcher uses a hybrid mechanism based on both physical-address (PA) and virtual-address (VA) prefetching, allowing it to adapt to different memory access patterns. However, this adaptability can sometimes lead to suboptimal performance in certain scenarios, necessitating the need to disable or reconfigure the prefetchers.
System Register Access and Privilege Levels in Cortex-A72
To disable or configure the prefetchers in the Cortex-A72, developers must interact with specific system registers. The primary instructions for accessing these registers in the A64 instruction set are MRS (Move to System Register) and MSR (Move from System Register). These instructions allow reading from and writing to system registers, respectively. However, accessing these registers is not straightforward and requires an understanding of the privilege levels and access controls in the ARM architecture.
The Cortex-A72 operates at multiple exception levels (ELs), with EL3 being the highest privilege level. Access to certain system registers, such as the CPU Extended Control Register (CPUECTLR_EL1) and the Auxiliary Control Register (ACTLR_EL3), is controlled by the privilege level at which the code is executing. For example, the CPUECTLR_EL1 register, which controls the L2 cache prefetcher, can only be accessed at EL1 or higher. However, access to this register at EL1 and EL2 is further controlled by the ACTLR_EL3 register, which means that modifications to the prefetcher settings may require execution at EL3.
In a Linux environment, accessing these registers can be particularly challenging due to the operating system’s management of privilege levels. Typically, user-space applications run at EL0, which does not have the necessary privileges to modify system registers. Therefore, any attempt to disable or configure the prefetchers from within a Linux application would require kernel-level modifications or the use of a kernel module to elevate the privilege level temporarily.
Practical Steps to Disable Cortex-A72 Prefetchers in Linux and Bare-Metal Environments
Disabling the prefetchers in the Cortex-A72 involves writing to specific bits in the CPUECTLR_EL1 and CPUACTLR_EL1 registers. The CPUECTLR_EL1 register controls the L2 cache prefetcher, while the CPUACTLR_EL1 register controls the L1 cache prefetcher and the load/store hardware prefetcher. Below, we outline the steps required to disable these prefetchers in both Linux and bare-metal environments.
Disabling the L2 Cache Prefetcher
To disable the L2 cache prefetcher, you need to modify the CPUECTLR_EL1 register. Specifically, you need to set the appropriate bits to disable the prefetcher. The exact bit positions may vary depending on the specific implementation of the Cortex-A72, but generally, you will need to set bit [2] to disable the L2 prefetcher.
In a bare-metal environment, you can directly use the MRS and MSR instructions to read from and write to the CPUECTLR_EL1 register. For example, the following assembly code snippet demonstrates how to disable the L2 prefetcher:
MRS X0, CPUECTLR_EL1 // Read the current value of CPUECTLR_EL1
ORR X0, X0, #(1 << 2) // Set bit 2 to disable the L2 prefetcher
MSR CPUECTLR_EL1, X0 // Write the modified value back to CPUECTLR_EL1
In a Linux environment, you would need to write a kernel module or modify the kernel source code to execute the above instructions at the appropriate privilege level. This typically involves creating a custom system call or using a kernel module to elevate the privilege level and execute the necessary assembly instructions.
Disabling the L1 Cache Prefetcher and Load/Store Hardware Prefetcher
To disable the L1 cache prefetcher and the load/store hardware prefetcher, you need to modify the CPUACTLR_EL1 register. The CPUACTLR_EL1 register contains several bits that control the behavior of the L1 prefetcher and the load/store prefetcher. Specifically, you need to set bit [0] to disable the L1 prefetcher and bit [1] to disable the load/store prefetcher.
In a bare-metal environment, you can use the following assembly code snippet to disable both prefetchers:
MRS X0, CPUACTLR_EL1 // Read the current value of CPUACTLR_EL1
ORR X0, X0, #(1 << 0) // Set bit 0 to disable the L1 prefetcher
ORR X0, X0, #(1 << 1) // Set bit 1 to disable the load/store prefetcher
MSR CPUACTLR_EL1, X0 // Write the modified value back to CPUACTLR_EL1
In a Linux environment, similar to the L2 prefetcher, you would need to write a kernel module or modify the kernel source code to execute the above instructions at the appropriate privilege level.
Additional Considerations and Caveats
When disabling the prefetchers, it is important to consider the potential impact on system performance. Prefetchers are designed to improve performance by reducing memory access latency, and disabling them can lead to increased latency and reduced throughput, particularly in workloads with predictable memory access patterns. Therefore, it is crucial to carefully evaluate the performance impact of disabling the prefetchers in your specific use case.
Additionally, the ability to modify the prefetcher settings may be restricted by the system’s firmware or hypervisor. In some cases, the ACTLR_EL3 register may be locked, preventing modifications to the CPUECTLR_EL1 and CPUACTLR_EL1 registers at lower privilege levels. In such scenarios, you may need to work with the system’s firmware or hypervisor vendor to unlock the necessary registers or obtain the required privileges.
Finally, it is important to note that the behavior of the prefetchers and the specific bits used to control them may vary between different revisions of the Cortex-A72. Therefore, it is essential to consult the relevant technical documentation for your specific processor revision to ensure that you are modifying the correct bits and achieving the desired behavior.
Conclusion
Disabling the prefetchers in the Cortex-A72 processor is a powerful tool for developers who need to fine-tune the performance of their systems. However, it requires a deep understanding of the ARM architecture, system registers, and privilege levels. By following the steps outlined in this guide, you can successfully disable the L1 and L2 cache prefetchers in both bare-metal and Linux environments. However, it is crucial to carefully evaluate the performance impact of these changes and consider the potential restrictions imposed by the system’s firmware or hypervisor. With the right approach, you can achieve the desired level of control over the Cortex-A72’s prefetching mechanisms and optimize your system for specific workloads.