MMU Translation Faults During DDR ECC Error Handling in Abort Mode

The core issue revolves around MMU translation faults occurring during the handling of DDR ECC uncorrectable errors in an ARM Cortex-A9 dual-core system. The system is designed to handle Data Abort and Prefetch Abort exceptions, specifically targeting asynchronous external memory aborts triggered by DDR ECC uncorrectable errors. These errors manifest as AXI slave errors on the AXI read bus, leading to a Data Abort with a Fault Status Register (DFSR) value of 0x16 in short descriptor format. The goal is to diagnose the nature of the DDR ECC error (e.g., stuck bits or flipped bits) by performing a sequence of operations: disabling DDR ECC, writing and reading specific patterns to/from memory, and re-enabling DDR ECC. However, during this process, MMU translation faults occur, leading to system instability.

The primary challenge lies in the interaction between the MMU, DDR controller, and the CPU mode switching (from System mode to Abort mode). The MMU translation faults are triggered when the Data Abort handler attempts to access memory regions using physical addresses instead of logical AXI addresses. Additionally, the system experiences further exceptions when re-enabling DDR ECC, particularly when using non-inline functions to access DDR controller registers. These issues highlight critical hardware-software interaction challenges, including MMU table management, cache behavior, and function call overhead during exception handling.

Physical Address Translation and MMU Table Configuration Issues

The root cause of the MMU translation faults stems from the use of physical addresses instead of logical AXI addresses during the DDR ECC error handling sequence. The DDR controller reports the physical address of the error, but the MMU expects a logical AXI address for translation. When the Data Abort handler attempts to access the memory region using the physical address, it encounters a page with the NO_ACCESS attribute (AP[1:0] = 0x0), leading to a translation fault. This issue is exacerbated by the fact that the MMU table is shared between System mode and Abort mode, and the table’s configuration does not account for the physical address space used by the DDR controller.

Another contributing factor is the configuration of the Translation Table Base Register (TTBR0). The system uses TTBR0 exclusively, with TTBR1 unused. While TTBR0 is typically sufficient for bare-metal applications, the lack of TTBR1 usage limits the system’s ability to manage kernel and user space separately, which could simplify address translation during exception handling. Furthermore, the cache configuration plays a role in the issue. Although the caches are disabled during application execution, the MMU table’s memory attributes (IRGN and RGN bits) are set to write-back write-allocate. This configuration could lead to unintended cache behavior, such as evictions or write-backs, even when caches are disabled.

The use of non-inline functions for accessing DDR controller registers introduces additional complexity. When dramps_DisableECC and dramps_EnableECC are called, the function call overhead results in a jump to a different MMU section, requiring the MMU to load a new Page Table Entry (PTE). This process can trigger evictions or translations faults, especially if the MMU table is not properly synchronized or if the cache behavior is inconsistent. The issue is mitigated when using inline functions, as the function call overhead is eliminated, and the MMU does not need to load new PTEs.

Implementing Logical Address Translation and Inline Function Optimization

To resolve the MMU translation faults and ensure reliable DDR ECC error handling, the following steps are recommended:

  1. Logical Address Translation Function: Develop a translation function to convert the physical address reported by the DDR controller into a logical AXI address. This function should be integrated into the Data Abort handler to ensure that all memory accesses use valid logical addresses. The translation function must account for the memory attributes and access permissions defined in the MMU table to avoid NO_ACCESS faults.

  2. MMU Table Configuration Review: Review and update the MMU table configuration to ensure that the physical address space used by the DDR controller is properly mapped with the correct access permissions. Consider using TTBR1 for kernel space mappings to separate user and kernel address spaces, which can simplify address translation during exception handling.

  3. Cache Configuration and Synchronization: Ensure that the cache configuration is consistent with the system’s requirements. If caches are disabled, verify that the MMU table’s memory attributes do not inadvertently enable cache behavior. Implement data synchronization barriers (DSB) and instruction synchronization barriers (ISB) to ensure proper cache and MMU synchronization during mode switches and exception handling.

  4. Inline Function Optimization: Define dramps_DisableECC and dramps_EnableECC as static inline functions in the header file (dramps.h). This approach eliminates the function call overhead and prevents unnecessary MMU table lookups or evictions. Ensure that all low-level driver functions that access hardware registers are defined as inline to minimize runtime overhead and improve predictability.

  5. Error Injection and Testing: Implement error injection mechanisms to test the DDR ECC error handling sequence under controlled conditions. Verify that no extra read or write operations to DDR memory occur between disabling and re-enabling ECC, except for the intended error injection writes. This step ensures that the system behaves as expected during error recovery.

  6. Exception Handler Optimization: Optimize the Data Abort and Prefetch Abort handlers to minimize the time spent in Abort mode. Ensure that all necessary registers are saved and restored correctly, and that the handlers do not perform unnecessary operations that could trigger additional exceptions.

By addressing these issues systematically, the system can achieve reliable DDR ECC error handling without MMU translation faults or unexpected exceptions. The use of logical address translation, optimized MMU table configuration, and inline function definitions ensures that the hardware-software interaction is predictable and efficient, even during critical exception handling scenarios.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *