ARM Cortex-M7 Performance Bottlenecks: High CPU Load and Cache Configuration Issues

ARM Cortex-M7 High CPU Load Despite Higher Clock Speed

The ARM Cortex-M7 microcontroller, specifically the CYT4BFBCJE model running at 160 MHz, is exhibiting a significantly higher CPU load (95%) compared to an ARM Cortex-M4-based system (CYT2B9X) running at 80 MHz, which only experiences a 25% CPU load. Both systems are running identical vector BSW (Basic Software), BSW configurations, and application software, with the same compiler options. This discrepancy in performance is counterintuitive, as the Cortex-M7 is architecturally more advanced and operates at twice the clock speed of the Cortex-M4. The primary issue revolves around the Cortex-M7’s inability to leverage its full potential, leading to inefficiencies that manifest as high CPU load.

The Cortex-M7’s performance bottleneck is likely due to improper cache configuration and utilization. The Cortex-M7 features both instruction and data caches, which are critical for achieving optimal performance, especially when dealing with high-speed operations and large datasets. However, the system in question has only enabled the instruction cache (SCB_EnableICache()), which reduced the CPU load by 57%. This improvement indicates that the instruction cache was previously disabled, forcing the CPU to fetch instructions directly from flash memory, which is significantly slower. Despite this improvement, the CPU load remains high, suggesting that the data cache is either improperly configured or not enabled at all.

The Cortex-M7’s memory subsystem is more complex than that of the Cortex-M4, featuring a Harvard architecture with separate buses for instructions and data, coupled with optional instruction and data caches. The Cortex-M7 also supports Tightly Coupled Memory (TCM) and an optional Memory Protection Unit (MPU). These features, while powerful, require careful configuration to avoid performance degradation. For instance, the MPU must be configured to define memory regions and their attributes, such as cacheability and access permissions. Failure to configure the MPU correctly can lead to exceptions or suboptimal performance.

Improper Cache Configuration and MPU Misalignment

The high CPU load on the Cortex-M7 can be attributed to several potential causes, primarily revolving around cache configuration and MPU settings. The Cortex-M7’s instruction and data caches are disabled by default, and enabling them requires explicit initialization. While enabling the instruction cache provided a significant performance boost, the data cache remains a critical factor. The data cache is essential for reducing memory access latency, especially when dealing with frequent read/write operations to external memory or peripherals. Without proper data cache configuration, the CPU is forced to access slower memory regions, leading to increased CPU load.

The MPU plays a crucial role in defining memory regions and their attributes, including cacheability. The Cortex-M7’s MPU allows developers to specify which memory regions should be cached and which should bypass the cache. However, improper MPU configuration can lead to exceptions or performance issues. In the provided scenario, enabling the MPU resulted in an exception, indicating a misconfiguration. The MPU region settings must align with the memory map of the system, and the attributes must be carefully defined to avoid conflicts or unintended behavior.

Another potential cause is the interaction between the cache and DMA (Direct Memory Access) operations. The Cortex-M7’s cache coherency mechanism ensures that the CPU and DMA controller have a consistent view of memory. However, if the cache is not properly managed, DMA operations can lead to stale data in the cache, causing the CPU to operate on incorrect data or forcing frequent cache invalidations, which degrade performance. The Cortex-M7 provides mechanisms such as cache maintenance operations (e.g., SCB_InvalidateDCache()) to ensure coherency, but these must be used judiciously to avoid unnecessary overhead.

The Cortex-M7’s flash memory interface also plays a role in performance. The flash memory access time can become a bottleneck if the CPU frequently fetches instructions or data from flash. The instruction cache mitigates this by storing frequently accessed instructions, but the data cache is equally important for reducing flash access latency. Additionally, the Cortex-M7’s ART (Accelerator Memory Technology) prefetcher can improve flash access performance by anticipating instruction fetches and preloading them into the cache. However, this feature must be enabled and configured correctly.

Optimizing Cache and MPU Configuration for Cortex-M7 Performance

To address the high CPU load on the Cortex-M7, a systematic approach to cache and MPU configuration is required. The following steps outline the necessary actions to optimize performance:

Enabling and Configuring the Instruction Cache

The instruction cache should be enabled using SCB_EnableICache(), as demonstrated in the initial improvement. However, it is also essential to invalidate the instruction cache before enabling it to ensure that stale data does not affect performance. This can be achieved using SCB_InvalidateICache(). The instruction cache does not require MPU configuration, as it operates transparently to the software. However, developers should ensure that the flash memory region containing the executable code is cacheable.

Configuring the Data Cache and MPU

The data cache must be enabled and configured in conjunction with the MPU. The following steps outline the process:

Define MPU Regions: The MPU must be configured to define memory regions and their attributes. For example, the flash memory region should be marked as cacheable, while peripheral regions should be marked as non-cacheable. The MPU region settings must align with the system’s memory map. For instance, the following code snippet configures a 16 KB cacheable region starting at address 0x28050000:
```
ARM_MPU_SetRegionEx(0UL, 0x28050000, ARM_MPU_RASR(1UL, ARM_MPU_AP_FULL, 0UL, 0UL, 1UL, 1UL, 0x00UL, ARM_MPU_REGION_SIZE_16KB));
```
Enable the MPU: The MPU should be enabled with appropriate control bits. The MPU_CTRL_PRIVDEFENA_Msk bit enables the default memory map for privileged mode, while MPU_CTRL_HFNMIENA_Msk allows the MPU to operate during hard faults, NMI, and FAULTMASK handlers. The MPU can be enabled using:
```
ARM_MPU_Enable(MPU_CTRL_PRIVDEFENA_Msk | MPU_CTRL_HFNMIENA_Msk);
```
Enable the Data Cache: The data cache should be enabled after configuring the MPU. Similar to the instruction cache, the data cache should be invalidated before enabling it to ensure coherency. This can be achieved using:
```
SCB_InvalidateDCache();
SCB_EnableDCache();
```

Managing Cache Coherency with DMA

When using DMA, it is crucial to ensure cache coherency between the CPU and DMA controller. The Cortex-M7 provides cache maintenance operations to manage coherency. Before initiating a DMA transfer, the data cache should be cleaned to ensure that any modified data in the cache is written back to memory. After the DMA transfer completes, the data cache should be invalidated to ensure that the CPU accesses the updated data from memory. The following functions can be used for cache maintenance:

SCB_CleanDCache();      // Clean the data cache before DMA transfer
SCB_InvalidateDCache(); // Invalidate the data cache after DMA transfer

Optimizing Flash Memory Access

The Cortex-M7’s ART prefetcher can significantly improve flash memory access performance by preloading instructions into the cache. The ART prefetcher should be enabled if it is not already active. Additionally, developers should ensure that the flash memory access latency is configured correctly in the microcontroller’s flash accelerator settings. This can be adjusted in the device-specific configuration registers.

Monitoring and Tuning Performance

After implementing the above steps, it is essential to monitor the system’s performance to ensure that the CPU load is reduced. Tools such as performance counters and profiling tools can be used to identify any remaining bottlenecks. If necessary, further tuning of the cache and MPU settings may be required to achieve optimal performance.

By following these steps, developers can effectively address the high CPU load on the Cortex-M7 and leverage its full performance potential. Proper cache and MPU configuration, combined with careful management of DMA operations and flash memory access, will ensure that the Cortex-M7 operates efficiently, even under demanding workloads.

ARM Cortex-M7 Performance Bottlenecks: High CPU Load and Cache Configuration Issues

ARM Cortex-M7 High CPU Load Despite Higher Clock Speed

Improper Cache Configuration and MPU Misalignment

Optimizing Cache and MPU Configuration for Cortex-M7 Performance