Cortex-A72 Memory Benchmark Performance Degradation and Gradual Improvement
The observed behavior in the Cortex-A72 memory benchmark, where performance gradually improves over multiple runs, is a complex interplay between the processor’s cache hierarchy, DRAM behavior, and the memory access patterns of the benchmark itself. The benchmark involves a 400MB buffer with pseudo-random memory accesses, which inherently stresses the memory subsystem. The gradual improvement in performance over hundreds of runs suggests that the system is not reaching a steady state immediately but instead undergoes a slow optimization process. This phenomenon can be attributed to several factors, including cache warming, DRAM timing adjustments, and potential prefetching mechanisms.
The Cortex-A72’s cache hierarchy consists of private L1 caches (48KB instruction cache and 32KB data cache per core) and a shared L2 cache (2048KB per cluster of 4 cores). The benchmark’s random access pattern means that the L1 and L2 caches are constantly being filled and evicted, leading to a high rate of cache misses initially. Over time, as the benchmark repeats the same access pattern, the caches begin to retain more of the frequently accessed data, reducing the miss rate and improving performance. However, this alone does not fully explain the gradual improvement over hundreds of runs, as caches typically reach a steady state within a few iterations.
The DRAM subsystem also plays a significant role. Modern DRAM modules, especially those used in high-performance systems like the one in question, often include features such as row buffers, bank grouping, and adaptive timing adjustments. These features can lead to a "warming up" effect, where the DRAM controller optimizes access patterns and timing parameters over time. Additionally, the DRAM’s self-refresh mechanisms and power-saving states may initially introduce latency, which decreases as the DRAM becomes more active.
Cache Warming, DRAM Timing, and Prefetching Mechanisms
The primary causes of the observed behavior can be broken down into three main areas: cache warming, DRAM timing adjustments, and prefetching mechanisms. Each of these factors contributes to the gradual improvement in benchmark performance.
Cache warming refers to the process by which the caches become populated with the data most frequently accessed by the benchmark. In the case of the Cortex-A72, the L1 and L2 caches are initially cold, meaning they contain no relevant data for the benchmark. As the benchmark runs, the caches begin to fill with the data accessed by the random memory operations. However, due to the random nature of the accesses, the cache warming process is slower than it would be for a sequential or predictable access pattern. Over time, the caches retain more of the frequently accessed data, reducing the number of cache misses and improving performance.
DRAM timing adjustments are another critical factor. Modern DRAM controllers dynamically adjust timing parameters such as row activation latency, column access strobe (CAS) latency, and refresh intervals based on the observed access patterns. Initially, these parameters may be set conservatively to ensure stability, leading to higher latency. As the DRAM controller observes the benchmark’s access patterns, it can optimize these parameters, reducing latency and improving performance. This optimization process can take hundreds of iterations, contributing to the gradual improvement observed in the benchmark.
Prefetching mechanisms, both in the processor and the DRAM controller, also play a role. The Cortex-A72 includes hardware prefetchers that attempt to predict future memory accesses based on past patterns. In the case of the random access benchmark, the prefetchers may initially struggle to make accurate predictions, leading to inefficient prefetching and increased memory latency. Over time, as the prefetchers gather more data about the access patterns, they may become more effective, reducing the number of cache misses and improving performance.
Implementing Cache and DRAM Optimization Strategies
To address the gradual performance improvement observed in the Cortex-A72 memory benchmark, several optimization strategies can be implemented. These strategies focus on reducing the time required for cache warming, optimizing DRAM timing parameters, and improving the effectiveness of prefetching mechanisms.
One approach is to pre-warm the caches before running the benchmark. This can be done by running a preliminary phase that accesses the entire 400MB buffer in a controlled manner, ensuring that the caches are populated with the relevant data before the benchmark begins. This reduces the initial cache miss rate and allows the benchmark to reach steady-state performance more quickly.
Another strategy is to manually tune the DRAM timing parameters. This can be done by accessing the DRAM controller’s configuration registers and adjusting parameters such as row activation latency, CAS latency, and refresh intervals. By setting these parameters to values that are optimal for the benchmark’s access patterns, the DRAM latency can be reduced, improving performance. However, this approach requires a deep understanding of the DRAM controller’s architecture and should be done with caution to avoid instability.
Improving the effectiveness of prefetching mechanisms can also help. This can be achieved by analyzing the benchmark’s access patterns and configuring the hardware prefetchers to better match these patterns. For example, if the benchmark exhibits a certain degree of spatial locality, the prefetchers can be configured to prefetch adjacent memory locations. Additionally, software prefetching instructions can be inserted into the benchmark code to explicitly prefetch data that is likely to be accessed in the near future.
Finally, it is important to consider the impact of the operating system and system configuration on the benchmark’s performance. Ensuring that the benchmark runs on an isolated core with minimal interference from other processes can help reduce variability and improve consistency. Additionally, disabling power-saving features that may introduce latency, such as CPU frequency scaling and DRAM self-refresh, can help maintain optimal performance throughout the benchmark.
By implementing these strategies, the gradual performance improvement observed in the Cortex-A72 memory benchmark can be mitigated, leading to more consistent and predictable performance.