ARM Cortex-A78 Atomic Instruction Execution Failure During Kernel Boot
The issue at hand involves the ARM Cortex-A78 CPU failing to execute atomic instructions, specifically Load-Exclusive (LDXR) and Store-Exclusive (STXR), during the Linux kernel boot process. The kernel version in use is 5.10.39, and the bootloader is U-Boot 2021.10-rc2. The problem manifests when the kernel attempts to execute the _percpu_or_case_32
function, which is part of the percpu.h
file in the kernel source. The kernel execution halts at the pr_info
function, and further debugging reveals that the CPU is unable to read from or write to memory using the LDXR and STXR instructions.
The Cortex-A78 is a high-performance CPU core designed for advanced applications, and it supports ARMv8.2-A architecture, which includes the necessary features for atomic operations. The failure of these atomic instructions suggests a deeper issue related to memory configuration, privilege levels, or system register settings. The problem is particularly critical because atomic operations are fundamental to ensuring data consistency in multi-threaded environments, such as the Linux kernel.
The issue is further complicated by the fact that the memory attributes are set to "normal" memory using the MAIR_EL1 system register, but the memory is still being treated as "device" memory. This discrepancy indicates that the memory type configuration is not being applied correctly, or there is an underlying issue with the memory management unit (MMU) or the translation tables.
Memory Attribute Configuration and Privilege Level Mismatch
The root cause of the issue appears to be related to the memory attribute configuration and the privilege level at which the kernel is executing. The Cortex-A78 CPU uses the MAIR_EL1 (Memory Attribute Indirection Register) to define memory types, such as normal memory, device memory, and strongly-ordered memory. The MAIR_EL1 register is typically configured during the early stages of kernel boot to ensure that the memory attributes are correctly set before any memory access occurs.
In this case, the kernel is booting in EL1 (Exception Level 1), which is the typical privilege level for the Linux kernel. However, the memory attributes set in MAIR_EL1 are not being respected, and the memory is still being treated as device memory. This could be due to several reasons:
-
Incorrect MAIR_EL1 Configuration: The MAIR_EL1 register might not be configured correctly during the early boot stages. The kernel might be setting the wrong attributes or not applying the changes correctly.
-
Translation Table Configuration: The memory attributes defined in MAIR_EL1 are only effective if the translation tables (page tables) are correctly configured to use these attributes. If the translation tables are not properly set up, the memory attributes will not be applied, and the memory will default to device memory.
-
Privilege Level Mismatch: Although the kernel is booting in EL1, there might be a mismatch between the privilege level and the memory attributes. For example, if the memory is mapped as device memory in the translation tables, the CPU will treat it as such, regardless of the MAIR_EL1 settings.
-
Cache Coherency Issues: The Cortex-A78 CPU relies on cache coherency to ensure that atomic operations are performed correctly. If the cache is not properly invalidated or flushed before the atomic operations, the CPU might fail to execute the LDXR and STXR instructions.
-
Hardware Bug or Errata: Although less likely, there could be a hardware bug or errata in the Cortex-A78 CPU that affects the execution of atomic instructions under certain conditions. This would require a firmware update or workaround from the CPU vendor.
Implementing Correct Memory Attribute Configuration and Debugging Steps
To resolve the issue, the following steps should be taken to ensure that the memory attributes are correctly configured and applied:
-
Verify MAIR_EL1 Configuration: The first step is to verify that the MAIR_EL1 register is correctly configured during the early boot stages. This can be done by adding debug prints or using a debugger to inspect the value of MAIR_EL1 before and after the kernel sets it. The MAIR_EL1 register should be configured with the appropriate memory attributes for normal memory, such as cacheable and shareable attributes.
-
Check Translation Table Configuration: The next step is to ensure that the translation tables are correctly configured to use the memory attributes defined in MAIR_EL1. This involves checking the page table entries (PTEs) to ensure that they reference the correct memory type indices from MAIR_EL1. The translation tables should be set up to map the memory as normal memory, not device memory.
-
Ensure Correct Privilege Level: The kernel should be running in EL1, and the memory should be mapped with the appropriate attributes for normal memory. If the kernel is running in a different privilege level, such as EL2, the memory attributes might not be applied correctly. This can be verified by checking the CurrentEL register to confirm the current exception level.
-
Invalidate and Flush Caches: Before executing atomic operations, the caches should be invalidated and flushed to ensure that the CPU has a consistent view of memory. This can be done using the Data Cache Invalidate (DC IVAC) and Data Cache Clean and Invalidate (DC CIVAC) instructions. This step is crucial to ensure that the LDXR and STXR instructions operate correctly.
-
Check for Hardware Errata: If the issue persists after verifying the memory configuration and cache coherency, it is worth checking for any known hardware errata related to the Cortex-A78 CPU. The ARM Cortex-A78 Technical Reference Manual (TRM) and any errata documents should be consulted to identify any potential issues and recommended workarounds.
-
Debugging with a JTAG Probe: If the issue is still unresolved, using a JTAG probe to inspect the CPU state and memory contents can provide additional insights. The JTAG probe can be used to inspect the values of system registers, such as MAIR_EL1, TCR_EL1 (Translation Control Register), and the translation tables. It can also be used to single-step through the kernel boot process to identify the exact point where the atomic instructions fail.
-
Kernel Configuration and U-Boot Settings: Finally, the kernel configuration and U-Boot settings should be reviewed to ensure that they are compatible with the Cortex-A78 CPU. This includes checking the kernel configuration options related to memory management, cache coherency, and atomic operations. The U-Boot settings should also be verified to ensure that the memory is correctly initialized and mapped before the kernel starts execution.
By following these steps, the issue with the ARM Cortex-A78 CPU failing to execute atomic instructions during kernel boot can be systematically diagnosed and resolved. The key is to ensure that the memory attributes are correctly configured and applied, and that the CPU is operating in the correct privilege level with proper cache coherency. If the issue is related to a hardware bug or errata, consulting the ARM documentation and applying the recommended workarounds will be necessary.
In conclusion, the failure of atomic instructions on the ARM Cortex-A78 CPU during kernel boot is a complex issue that requires a thorough understanding of the ARM architecture, memory management, and system configuration. By carefully verifying the memory attributes, translation tables, privilege levels, and cache coherency, the issue can be resolved, ensuring that the kernel boots successfully and operates correctly on the Cortex-A78 CPU.