ARM Cortex-A73 and Cortex-A53 Performance Discrepancy in ResNet50 Inference

When running neural network inference tasks on a HiKey970 board equipped with ARM Cortex-A73 (big) and Cortex-A53 (LITTLE) cores, an unexpected performance discrepancy was observed. Specifically, the inference time for ResNet50 was nearly double when running on the big cores compared to the LITTLE cores. This behavior contradicts the expected performance hierarchy, where big cores should outperform LITTLE cores due to their higher clock speeds and more advanced microarchitecture. The issue was isolated to ResNet50, as other networks like VGG16 and VGG19 exhibited the expected performance improvement on big cores.

The HiKey970 board runs a Linux kernel version 4.9.78-147538-g244928755bbe, and the inference process was pinned to specific cores using taskset. The LITTLE cores (0-3) and big cores (4-7) were tested independently, with the LITTLE cores consistently outperforming the big cores in ResNet50 inference. This anomaly was not observed in other neural networks, suggesting a specific interaction between ResNet50’s computational characteristics and the ARM big.LITTLE architecture.

Thermal Throttling, Cache Coherency, and Frequency Scaling Misconfigurations

Several potential causes could explain the unexpected performance discrepancy between the LITTLE and big cores during ResNet50 inference. The first possibility is thermal throttling. The Cortex-A73 cores, being more power-hungry, generate more heat under load. If the thermal management system on the HiKey970 board is not adequately dissipating this heat, the big cores may throttle their frequency to prevent overheating, leading to reduced performance. However, monitoring the CPU frequency during inference showed no transitions, ruling out thermal throttling as the primary cause.

Another potential cause is cache coherency issues. The ARM big.LITTLE architecture relies on a cache coherency mechanism to ensure that data shared between the LITTLE and big cores remains consistent. If this mechanism is not functioning optimally, it could lead to increased latency and reduced performance on the big cores. This is particularly relevant for ResNet50, which involves extensive matrix multiplications and data transfers that are sensitive to cache performance. The Cortex-A53 cores, with their simpler and more predictable cache behavior, might be less affected by such issues.

Frequency scaling misconfigurations could also contribute to the performance discrepancy. If the big cores are not operating at their maximum frequency due to incorrect governor settings or power management policies, their performance would be suboptimal. The Cortex-A73 cores have a higher maximum frequency (2.36 GHz) compared to the Cortex-A53 cores (1.86 GHz), but if the big cores are not consistently reaching this frequency, their performance advantage would be negated. However, setting the CPU frequency to a fixed maximum value did not resolve the issue, indicating that frequency scaling is not the root cause.

Optimizing Cache Utilization and Investigating Hardware Counters

To address the performance discrepancy, several troubleshooting steps and solutions can be implemented. The first step is to optimize cache utilization. ResNet50’s computational workload involves large matrix multiplications, which can benefit significantly from efficient cache usage. Ensuring that the data accessed by the big cores is properly aligned and prefetched can reduce cache misses and improve performance. Additionally, using software prefetching instructions or adjusting the cache line size can help mitigate cache coherency issues.

Investigating hardware counters is another critical step. ARM processors provide a wealth of performance counters that can offer insights into the underlying causes of performance bottlenecks. Tools like ARM Streamline can be used to monitor these counters and identify specific areas where the big cores are underperforming. Key counters to monitor include cache miss rates, branch mispredictions, and instruction pipeline stalls. By analyzing these metrics, it is possible to pinpoint the exact cause of the performance discrepancy and implement targeted optimizations.

Another approach is to experiment with different memory frequencies. The DDR memory frequency can significantly impact the performance of memory-bound workloads like ResNet50. Setting the DDR frequency to its maximum value ensures that memory bandwidth is not a limiting factor. On the HiKey970 board, the DDR frequency can be controlled by writing to the appropriate sysfs files, such as /sys/class/devfreq/ddr_devfreq/min_freq and /sys/class/devfreq/ddr_devfreq/max_freq. Ensuring that the DDR frequency is consistent across all tests eliminates memory bandwidth as a variable.

Finally, running both the LITTLE and big cores at the same frequency can help isolate architectural differences. By setting both clusters to operate at 1.86 GHz, the performance impact of the Cortex-A73’s more advanced microarchitecture can be evaluated independently of frequency scaling. This approach can reveal whether the performance discrepancy is due to architectural differences or other factors like cache coherency or memory bandwidth.

In conclusion, the unexpected higher performance of LITTLE cores over big cores during ResNet50 inference on the HiKey970 board is likely due to a combination of cache coherency issues and architectural differences. By optimizing cache utilization, investigating hardware counters, and experimenting with memory and CPU frequencies, it is possible to identify and address the root cause of this performance anomaly. Further analysis using tools like ARM Streamline and performance counters will provide deeper insights into the behavior of the ARM big.LITTLE architecture under different workloads.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *