ARM CCI-550 Snoop Filter Sizing and Configuration Challenges

The ARM CoreLink Cache Coherent Interconnect (CCI) 550 is a critical component in modern ARM-based systems, particularly in Big.LITTLE configurations where multiple clusters of processors with varying cache sizes must maintain coherence. The snoop filter in the CCI-550 plays a pivotal role in reducing unnecessary snoop traffic by tracking which cache lines are present in the private caches of the processors. However, configuring the snoop filter correctly requires a deep understanding of its architecture, the cache hierarchy of the system, and the implications of its associativity and size.

In systems like the one described—comprising four Cortex-A53 cores (each with 32 KB L1 data cache and 512 KB L2 cache) and four Cortex-A73 cores (each with 64 KB L1 data cache and 2 MB L2 cache)—the snoop filter must be sized appropriately to avoid performance degradation due to excessive snoop traffic or snoop filter conflicts. The ARM Technical Reference Manual (TRM) recommends configuring the snoop filter directory to be 0.75 to 1 times the total size of the exclusive caches of the processors attached to the CCI-550. This recommendation, while seemingly straightforward, raises several questions about implementation details, such as which caches to include in the calculation, how to determine the snoop filter size in a given SoC, and how the snoop filter’s associativity affects its behavior.

Defining Total Size of Exclusive Caches in Big.LITTLE Systems

The term "total size of exclusive caches of processors that are attached to the CCI-550" refers to the combined size of all private caches that are not shared between processors. In a Big.LITTLE system, this includes both the L1 and L2 caches of each core, as these are typically private to each core or cluster. For the example system with four Cortex-A53 cores and four Cortex-A73 cores, the calculation must account for the L1 and L2 caches of all eight cores.

The Cortex-A53 cores each have a 32 KB L1 data cache and a 512 KB L2 cache. Since the L2 cache is typically shared within a cluster, the total exclusive cache size for the Cortex-A53 cluster is the sum of the L1 caches of all four cores plus the L2 cache shared by the cluster. However, if the L2 cache is shared among the cores, it should only be counted once. Similarly, the Cortex-A73 cores each have a 64 KB L1 data cache and a 2 MB L2 cache. Again, if the L2 cache is shared within the cluster, it should only be counted once.

To calculate the total size of exclusive caches, we can use the following formula:

Total Exclusive Cache Size = (Number of Cortex-A53 cores × L1 Cache Size per Cortex-A53 core) + (Number of Cortex-A73 cores × L1 Cache Size per Cortex-A73 core) + (L2 Cache Size per Cortex-A53 cluster) + (L2 Cache Size per Cortex-A73 cluster)

For the example system:

Total Exclusive Cache Size = (4 × 32 KB) + (4 × 64 KB) + 512 KB + 2 MB = 128 KB + 256 KB + 512 KB + 2 MB = 128 KB + 256 KB = 384 KB; 384 KB + 512 KB = 896 KB; 896 KB + 2 MB = 896 KB + 2048 KB = 2944 KB

Thus, the total size of exclusive caches in this system is 2944 KB. According to the ARM TRM, the snoop filter directory should be configured to be 0.75 to 1 times this size, which would be between 2208 KB and 2944 KB.

Determining Snoop Filter Size in a Given SoC

The size of the snoop filter in the CCI-550 is implementation-defined, meaning it can vary between different SoCs. Unfortunately, there is no standardized method to query the snoop filter size directly from the hardware. However, there are several approaches to determine or estimate the snoop filter size in a given SoC.

One approach is to consult the SoC’s technical documentation or contact the SoC vendor directly. The vendor may provide details about the snoop filter size as part of the SoC’s datasheet or technical reference manual. If this information is not available, another approach is to perform empirical testing by measuring the performance impact of different workloads under varying cache pressure conditions. By observing the point at which performance degrades due to snoop filter conflicts, one can infer the approximate size of the snoop filter.

Additionally, some ARM-based SoCs provide performance monitoring counters that can be used to track snoop filter-related events, such as the number of snoop filter hits and misses. By analyzing these counters, it may be possible to estimate the effective size of the snoop filter. However, this approach requires a deep understanding of the performance monitoring unit (PMU) and the specific events that are relevant to the snoop filter.

Snoop Filter Associativity and Conflict Behavior

The snoop filter in the CCI-550 is described as being 8-way set associative. This means that the snoop filter is organized into sets, each of which can hold up to 8 tags. The concept of associativity in the snoop filter is indeed similar to that in a cache. In a cache, associativity determines how many cache lines can be stored in each set before a conflict occurs, leading to eviction. Similarly, in the snoop filter, associativity determines how many tags can be stored in each set before a conflict occurs.

When a conflict occurs in the snoop filter, the behavior is analogous to cache eviction. The snoop filter must evict an existing tag to make room for a new one. This eviction process can lead to increased snoop traffic, as the snoop filter may need to issue additional snoop requests to ensure coherence. Therefore, minimizing conflicts in the snoop filter is critical to maintaining system performance.

The ARM TRM states that the snoop filter stores twice as many tags as the configured size to minimize conflicts. This means that even though the snoop filter is 8-way set associative, it can effectively store more tags by using additional storage. However, this does not eliminate the possibility of conflicts entirely, particularly in systems with large cache sizes or high levels of cache activity.

To understand the impact of snoop filter associativity, consider the following example. Suppose the snoop filter is configured with a size of 2944 KB (1 times the total size of exclusive caches in the example system). Given that the snoop filter is 8-way set associative, the number of sets in the snoop filter can be calculated as follows:

Number of Sets = (Snoop Filter Size) / (Associativity × Tag Size)

Assuming a tag size of 64 bytes (a common value for ARM systems), the number of sets would be:

Number of Sets = 2944 KB / (8 × 64 B) = 2944 × 1024 B / 512 B = 5888 sets

This means that the snoop filter can store up to 8 tags per set, for a total of 47,104 tags (5888 sets × 8 tags per set). However, since the snoop filter stores twice as many tags as the configured size, it can effectively store up to 94,208 tags. This additional storage helps reduce the likelihood of conflicts, but it does not eliminate them entirely.

Implementing Snoop Filter Configuration and Optimization

To optimize the snoop filter configuration in a Big.LITTLE system, several steps can be taken. First, ensure that the snoop filter size is configured according to the ARM TRM recommendation of 0.75 to 1 times the total size of exclusive caches. This can be done by calculating the total exclusive cache size as described earlier and setting the snoop filter size accordingly.

Next, monitor the performance of the system under typical workloads to identify any potential issues related to snoop filter conflicts. This can be done using performance monitoring counters or by analyzing system performance metrics such as cache miss rates and snoop traffic. If performance degradation is observed, consider increasing the snoop filter size or adjusting the associativity if possible.

Finally, consider the impact of cache partitioning and memory access patterns on snoop filter behavior. In systems with large caches or high levels of cache activity, it may be beneficial to partition the cache or adjust memory access patterns to reduce the likelihood of snoop filter conflicts. For example, ensuring that frequently accessed data is distributed evenly across cache sets can help minimize conflicts and improve system performance.

In conclusion, configuring and optimizing the snoop filter in the ARM CCI-550 requires a thorough understanding of the cache hierarchy, snoop filter architecture, and system workload characteristics. By carefully calculating the total size of exclusive caches, determining the snoop filter size, and monitoring system performance, it is possible to minimize snoop filter conflicts and ensure optimal system performance in Big.LITTLE configurations.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *