ARM Cortex Cache Hierarchy and Memory Access Patterns
The ARM Cortex architecture, particularly the ARMv8-A used in the Snapdragon 765 SoC, features a sophisticated memory hierarchy designed to optimize performance and power efficiency. This hierarchy typically includes L1, L2, and L3 caches, each serving different roles in data access and storage. The L1 cache is split into instruction and data caches, providing low-latency access to frequently used data and instructions. The L2 cache serves as a secondary level, offering a larger but slightly higher latency storage, while the L3 cache, often shared among multiple cores, acts as a last-level cache before accessing the main memory.
In the context of the observed memory access patterns, it is crucial to understand how data moves through this hierarchy. Normally, data requested by the CPU is first checked in the L1 cache. If not found (a cache miss), the request propagates to the L2 cache, and subsequently to the L3 cache if the data is still not found. However, the profiling data from the Pixel 5 phone reveals anomalies where the number of memory accesses exceeds the number of cache accesses, and L3 cache accesses surpass L2 cache accesses. This suggests that certain data accesses bypass the typical cache hierarchy, either skipping levels entirely or being handled differently by the memory subsystem.
The ARMv8 architecture includes mechanisms for cache bypassing, particularly for data that is unlikely to be reused, such as streaming data or large blocks of memory that are written once and read once. These mechanisms are designed to prevent cache pollution, where transient data fills up the cache, evicting more frequently accessed data and degrading overall performance. The use of non-temporal (or streaming) stores is one such mechanism, allowing data to be written directly to memory without being cached.
Temporal and Non-Temporal Memory Access Instructions
ARMv8-A architecture introduces specific instructions for handling temporal and non-temporal memory accesses. Temporal accesses are the standard memory operations where data is expected to be reused, and thus, it is cached at various levels to speed up subsequent accesses. Non-temporal accesses, on the other hand, are used for data that is not expected to be reused, such as streaming data or large memory blocks that are written once and read once. These instructions bypass the cache hierarchy to avoid polluting the cache with data that will not benefit from caching.
The ARM architecture provides several instructions for non-temporal memory accesses, such as STNP
(Store Pair of Registers with Non-temporal hint) and LDNP
(Load Pair of Registers with Non-temporal hint). These instructions signal to the memory subsystem that the data being accessed is unlikely to be reused, allowing the hardware to optimize its handling. For example, non-temporal stores can write data directly to memory without allocating a cache line, while non-temporal loads can fetch data directly from memory without caching it.
In the context of the profiling data, the higher number of memory accesses compared to cache accesses suggests that non-temporal instructions are being used extensively. This is particularly evident in applications like Chrome and YouTube, where large amounts of streaming data (such as video frames or web page content) are processed. These applications benefit from non-temporal accesses as they reduce cache pollution and improve overall performance by ensuring that frequently accessed data remains in the cache.
Performance Monitoring Unit (PMU) and Cache Bypass Detection
The Performance Monitoring Unit (PMU) in ARM Cortex processors provides a wealth of information about the system’s performance, including cache accesses, memory accesses, and various other hardware events. However, detecting cache bypassing and understanding its impact on performance requires a deep understanding of the PMU’s capabilities and limitations.
The PMU can be configured to count specific events, such as L1 cache accesses, L2 cache accesses, L3 cache accesses, and memory accesses. By analyzing these counts, one can infer the behavior of the memory subsystem and identify anomalies such as cache bypassing. However, the PMU does not directly count cache bypass events, making it necessary to use indirect methods to detect and quantify them.
One approach to detecting cache bypassing is to compare the number of memory accesses with the number of cache accesses. If the number of memory accesses is significantly higher than the number of cache accesses, it suggests that a substantial portion of memory accesses are bypassing the cache hierarchy. Additionally, comparing the number of L2 and L3 cache accesses can provide insights into whether certain accesses are bypassing the L2 cache and going directly to the L3 cache.
To accurately calculate cache misses in the presence of cache bypassing, it is essential to account for the bypassed accesses. For example, to calculate L1 cache misses, one would typically subtract the number of L2 cache accesses from the number of L1 cache accesses. However, if some L1 accesses bypass the L2 cache and go directly to the L3 cache or memory, this calculation would be inaccurate. Therefore, it is necessary to adjust the calculation to account for the bypassed accesses, either by estimating their number based on the profiling data or by using additional PMU events that provide more detailed information about the memory access patterns.
In conclusion, understanding and optimizing memory access patterns in ARM Cortex processors requires a comprehensive approach that considers the cache hierarchy, the use of temporal and non-temporal memory access instructions, and the capabilities of the Performance Monitoring Unit. By carefully analyzing the profiling data and accounting for cache bypassing, one can gain valuable insights into the system’s performance and identify opportunities for optimization.