ARM Cortex-M7 Cache Coherency Problems During UART DMA Transfers

The issue at hand involves the STM32F746 microcontroller, which utilizes an ARM Cortex-M7 core. The problem manifests when using UART in DMA mode for receiving data. The first DMA transfer works correctly, but subsequent transfers fail to update the receive buffer, despite the DMA completion callback being triggered. This behavior is indicative of a cache coherency problem, which is a common issue when dealing with DMA transfers in systems with caches.

The ARM Cortex-M7 processor features both instruction and data caches (I-cache and D-cache). These caches are designed to improve performance by reducing the latency of memory accesses. However, they can introduce coherency issues when peripherals like DMA controllers access memory directly, bypassing the CPU and its caches. In this scenario, the DMA controller writes data directly to the memory, but the CPU may still be reading stale data from its cache, leading to the observed behavior.

The cache coherency problem arises because the DMA controller and the CPU operate independently. When the DMA controller writes data to memory, it does not invalidate the corresponding cache lines in the CPU’s cache. As a result, the CPU may continue to read outdated data from its cache, even though the memory has been updated by the DMA controller. This is particularly problematic in systems where the DMA controller and the CPU share the same memory space, as is the case with the STM32F746.

To understand the issue more deeply, consider the following sequence of events:

  1. First DMA Transfer: The DMA controller receives data from the UART and writes it to the receive buffer in memory. Since this is the first transfer, the cache lines corresponding to the receive buffer are likely invalid or empty. Therefore, when the CPU reads the data, it fetches the correct values from memory.

  2. Subsequent DMA Transfers: The DMA controller writes new data to the same memory location. However, the CPU’s cache lines for that memory location may still hold the old data. Since the cache is not invalidated, the CPU continues to read the stale data from the cache, leading to the perception that the receive buffer has not been updated.

This issue is exacerbated by the fact that the STM32 HAL library does not automatically handle cache coherency for DMA transfers. The library assumes that the developer will manage cache coherency manually, which is a common practice in embedded systems programming.

Memory Barrier Omission and Cache Invalidation Timing

The root cause of the problem lies in the omission of memory barriers and improper cache invalidation timing. Memory barriers are instructions that enforce the order of memory operations, ensuring that all previous memory accesses are completed before subsequent ones begin. In the context of DMA transfers, memory barriers are crucial to ensure that the CPU sees the updated data in memory after the DMA transfer completes.

Cache invalidation is another critical aspect. Invalidation ensures that the CPU’s cache lines are marked as invalid, forcing the CPU to fetch the latest data from memory the next time it accesses that memory location. Without proper cache invalidation, the CPU may continue to use stale data from its cache, leading to incorrect behavior.

In the STM32F746, the D-cache is enabled by default, which means that the CPU will cache data from memory. When the DMA controller writes data to memory, it does not automatically invalidate the corresponding cache lines in the CPU’s cache. As a result, the CPU may continue to read outdated data from its cache, even though the memory has been updated by the DMA controller.

The timing of cache invalidation is also crucial. If cache invalidation is performed too early, before the DMA transfer completes, the CPU may still read stale data. Conversely, if cache invalidation is performed too late, the CPU may have already read stale data, leading to incorrect behavior. Therefore, cache invalidation must be performed at the right time, typically after the DMA transfer has completed but before the CPU accesses the data.

In the provided code, there is no explicit cache invalidation or memory barrier instructions. This omission leads to the observed behavior where the first DMA transfer works correctly, but subsequent transfers fail to update the receive buffer.

Implementing Data Synchronization Barriers and Cache Management

To resolve the issue, it is necessary to implement data synchronization barriers and proper cache management. The ARM Cortex-M7 provides several instructions and mechanisms for managing cache coherency, including Data Synchronization Barriers (DSB), Data Memory Barriers (DMB), and cache maintenance operations.

Data Synchronization Barriers (DSB)

A Data Synchronization Barrier (DSB) ensures that all memory accesses before the barrier are completed before any memory accesses after the barrier are executed. In the context of DMA transfers, a DSB can be used to ensure that the DMA transfer has completed before the CPU accesses the data.

For example, after initiating a DMA transfer, a DSB instruction can be inserted to ensure that the DMA transfer has completed before the CPU reads the data:

HAL_UART_Receive_DMA(&UartHandle, pRxData, 2);
__DSB();

Data Memory Barriers (DMB)

A Data Memory Barrier (DMB) ensures that all memory accesses before the barrier are completed before any memory accesses after the barrier are executed, but only for memory accesses of the same type (e.g., reads or writes). In the context of DMA transfers, a DMB can be used to ensure that the DMA transfer has completed before the CPU accesses the data.

For example, after initiating a DMA transfer, a DMB instruction can be inserted to ensure that the DMA transfer has completed before the CPU reads the data:

HAL_UART_Receive_DMA(&UartHandle, pRxData, 2);
__DMB();

Cache Maintenance Operations

Cache maintenance operations are used to invalidate, clean, or flush cache lines. In the context of DMA transfers, cache invalidation is typically used to ensure that the CPU fetches the latest data from memory.

The ARM Cortex-M7 provides several cache maintenance operations, including:

  • Invalidate Data Cache (DC IVAC): Invalidates a specific cache line, forcing the CPU to fetch the latest data from memory the next time it accesses that memory location.
  • Clean Data Cache (DC CVAC): Cleans a specific cache line, writing any modified data back to memory.
  • Clean and Invalidate Data Cache (DC CIVAC): Cleans and invalidates a specific cache line, writing any modified data back to memory and invalidating the cache line.

In the context of the STM32F746, cache invalidation should be performed after the DMA transfer has completed but before the CPU accesses the data. This can be done using the SCB_InvalidateDCache_by_Addr function provided by the STM32 HAL library:

HAL_UART_Receive_DMA(&UartHandle, pRxData, 2);
while (UartReady != SET) {
    // Wait for DMA transfer to complete
}
UartReady = RESET;
SCB_InvalidateDCache_by_Addr(pRxData, 2);

Disabling the Data Cache

As a temporary workaround, disabling the data cache (D-cache) can resolve the issue. However, this is not a recommended long-term solution, as it will significantly impact performance. Disabling the D-cache ensures that the CPU always reads data directly from memory, bypassing the cache. This eliminates the cache coherency issue but at the cost of increased memory access latency.

To disable the D-cache, the following code can be used:

SCB_DisableDCache();

However, this approach should only be used as a last resort or for debugging purposes. The preferred solution is to implement proper cache management using data synchronization barriers and cache maintenance operations.

Example Implementation

The following example demonstrates how to implement proper cache management in the provided code:

while (1) {
    HAL_GPIO_TogglePin(LED_PORT, LED_GREEN); // Led Example
    HAL_Delay(200);
    pRxData[0] = 5;
    pRxData[1] = 0;

    if (HAL_UART_Receive_DMA(&UartHandle, pRxData, 2) == HAL_OK) {
        __DSB(); // Ensure DMA transfer has completed
        while (UartReady != SET) { // Wait for msg len
            HAL_GPIO_WritePin(LED_PORT, LED_GREEN, 1);
            HAL_Delay(20);
            HAL_GPIO_WritePin(LED_PORT, LED_GREEN, 0);
            HAL_Delay(400);
        }
        UartReady = RESET; // Reset transmission flag

        SCB_InvalidateDCache_by_Addr(pRxData, 2); // Invalidate cache for receive buffer

        LEN = pRxData[0] + pRxData[1] * 256; // Calculate msg length
        if (LEN > 0) {
            pTxData[0] = 25; // ACK
            if (HAL_UART_Transmit_DMA(&UartHandle, pTxData, 1) == HAL_OK) {
                __DSB(); // Ensure DMA transfer has completed
                while (UartReady != SET) {
                    HAL_Delay(1);
                }
                UartReady = RESET;

                if (HAL_UART_Receive_IT(&UartHandle, pRxData, LEN) == HAL_OK) {
                    __DSB(); // Ensure DMA transfer has completed
                    while (UartReady != SET) { // Waiting loop for msg
                        HAL_GPIO_WritePin(LED_PORT, LED_GREEN, 1);
                        HAL_Delay(300);
                        HAL_GPIO_WritePin(LED_PORT, LED_GREEN, 0);
                        HAL_Delay(50);
                    }
                    UartReady = RESET;

                    SCB_InvalidateDCache_by_Addr(pRxData, LEN); // Invalidate cache for receive buffer

                    if (MBU_StrCompareNC(pRxData, "LED_ON_BLUE", LEN) == 0) {
                        HAL_GPIO_WritePin(LED_PORT, LED_BLUE, 1);
                    }
                    if (MBU_StrCompareNC(pRxData, "LED_OFF_BLUE", LEN) == 0) {
                        HAL_GPIO_WritePin(LED_PORT, LED_BLUE, 0);
                    }
                    if (MBU_StrCompareNC(pRxData, "LED_ON_RED", LEN) == 0) {
                        HAL_GPIO_WritePin(LED_PORT, LED_RED, 1);
                    }
                    if (MBU_StrCompareNC(pRxData, "LED_OFF_RED", LEN) == 0) {
                        HAL_GPIO_WritePin(LED_PORT, LED_RED, 0);
                    }
                }
                HAL_UART_Abort(&UartHandle);
            }
        }
    }
}

In this example, __DSB() is used to ensure that DMA transfers have completed before proceeding, and SCB_InvalidateDCache_by_Addr() is used to invalidate the cache for the receive buffer after each DMA transfer. This ensures that the CPU always reads the latest data from memory, resolving the cache coherency issue.

Conclusion

The issue of UART DMA receiving data correctly only the first time is a classic example of cache coherency problems in systems with DMA and CPU caches. By understanding the underlying mechanisms of cache coherency and implementing proper data synchronization barriers and cache management, the issue can be resolved without disabling the cache. This approach not only fixes the problem but also maintains the performance benefits of using caches in embedded systems.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *