Inaccurate Timing Measurements with ARM Cortex-M4 DWT_CYCCNT on FVP
When benchmarking code on the FVP_MPS2_Cortex-M4 simulator, one of the primary challenges is obtaining accurate timing measurements. The ARM Cortex-M4 processor provides a Data Watchpoint and Trace (DWT) unit, which includes a cycle counter (DWT_CYCCNT) that can be used to measure the number of clock cycles elapsed during code execution. However, when using the FVP_MPS2_Cortex-M4 simulator, the DWT_CYCCNT register does not always provide cycle-accurate results. This is particularly evident when attempting to measure short-duration events or when comparing the cycle count against wall-clock time.
The DWT_CYCCNT register is a 32-bit counter that increments with each clock cycle of the processor. It is typically used for performance profiling and benchmarking. However, the FVP_MPS2_Cortex-M4 simulator, which is part of ARM’s Fast Models, is designed for functional simulation rather than cycle-accurate simulation. As a result, the DWT_CYCCNT register may not reflect the actual number of clock cycles elapsed, especially for short loops or when the simulator is running in a high-speed mode.
For example, when measuring a simple loop with a low iteration count, the DWT_CYCCNT register may report a cycle count that is significantly lower than expected. This discrepancy arises because the Fast Models use a "quantum" of instructions to accelerate simulation speed, which can lead to inaccuracies in cycle counting. Additionally, the simulator’s internal timing mechanisms may not align perfectly with the wall-clock time, further complicating the interpretation of timing measurements.
Fast Models’ Trade-Off Between Simulation Speed and Cycle Accuracy
The inaccuracies observed in the DWT_CYCCNT measurements are a direct consequence of the trade-off between simulation speed and cycle accuracy in ARM’s Fast Models. Fast Models are designed to provide fast functional simulation of ARM-based systems, making them suitable for software development, debugging, and early-stage hardware validation. However, this speed comes at the cost of cycle accuracy, which is not a primary goal of these models.
Fast Models achieve their high simulation speed by using a technique called "quantum-based simulation." In this approach, the simulator executes a block of instructions (a quantum) before synchronizing with other components in the system. This reduces the overhead associated with fine-grained cycle-by-cycle simulation but introduces inaccuracies in cycle counting and timing-sensitive behavior. As a result, the DWT_CYCCNT register, which relies on precise cycle counting, may not provide accurate results when used in a Fast Model environment.
Furthermore, Fast Models do not model certain timing-sensitive behaviors, such as bus traffic, instruction scheduling, or memory access timing. These limitations can affect the accuracy of performance measurements, particularly when benchmarking code that involves memory accesses, interrupts, or other timing-sensitive operations. For example, the timing of memory accesses in the FVP_MPS2_Cortex-M4 simulator may not reflect the actual timing on real hardware, leading to discrepancies in cycle counts.
Strategies for Accurate Benchmarking on FVP_MPS2_Cortex-M4
Given the limitations of the FVP_MPS2_Cortex-M4 simulator in providing cycle-accurate timing measurements, several strategies can be employed to obtain more reliable benchmarking results. These strategies include adjusting simulation parameters, using alternative timing methods, and considering the use of cycle-accurate simulation tools when necessary.
Adjusting Simulation Parameters
One approach to improving the accuracy of timing measurements is to adjust the simulation parameters of the FVP_MPS2_Cortex-M4 simulator. The simulator provides command-line options that allow users to control the quantum size and the minimum synchronization latency. By reducing the quantum size (-Q option) and the minimum synchronization latency (-M option), users can increase the granularity of the simulation, potentially improving the accuracy of cycle counting.
However, it is important to note that reducing these parameters will also slow down the simulation speed. The trade-off between simulation speed and accuracy must be carefully considered, especially when benchmarking large or complex code segments. In some cases, the improved accuracy may not justify the significant increase in simulation time.
Using Alternative Timing Methods
When the DWT_CYCCNT register does not provide accurate results, alternative timing methods can be used to measure code execution time. One such method is to use the SysTick timer, which is a standard feature of the ARM Cortex-M4 processor. The SysTick timer can be configured to generate periodic interrupts, allowing it to function as a simple real-time clock.
While the SysTick timer typically has a lower resolution than the DWT_CYCCNT register (e.g., 1 ms for a 1 kHz timer), it can still be useful for measuring longer-duration events. Additionally, the SysTick timer is less affected by the inaccuracies of the Fast Models, as it relies on elapsed time rather than cycle counting. However, for short-duration events or high-resolution timing, the SysTick timer may not be sufficient.
Another alternative is to use external timing tools or profilers that can measure the execution time of code running on the simulator. These tools may provide more accurate results by bypassing the limitations of the simulator’s internal timing mechanisms. However, this approach may require additional setup and integration effort.
Considering Cycle-Accurate Simulation Tools
For applications that require precise cycle-accurate timing measurements, the FVP_MPS2_Cortex-M4 simulator may not be the most suitable tool. In such cases, it may be necessary to use a cycle-accurate simulator or an FPGA-based emulation platform. ARM offers cycle-accurate simulation tools as part of its Flexible Access program, which provides access to a range of simulation and emulation solutions.
Cycle-accurate simulators model the hardware at a much finer granularity than Fast Models, allowing them to provide accurate cycle counts and timing information. However, these simulators are typically much slower than Fast Models, making them less suitable for large-scale software development or debugging. FPGA-based emulation platforms offer a balance between speed and accuracy, providing a hardware-like environment for testing and benchmarking.
When selecting a simulation or emulation tool, it is important to consider the specific requirements of the application, including the need for cycle accuracy, simulation speed, and the complexity of the code being benchmarked. In some cases, a combination of tools may be used, with Fast Models employed for initial development and cycle-accurate tools used for final performance validation.
Practical Example: Benchmarking a Simple Loop
To illustrate the challenges and solutions discussed above, consider the example of benchmarking a simple loop on the FVP_MPS2_Cortex-M4 simulator. The loop consists of a fixed number of iterations, and the goal is to measure the number of clock cycles required to execute the loop.
Using the DWT_CYCCNT register, the cycle count may be inaccurate, especially for small loop counts. For example, a loop with 100 iterations may report a cycle count of 1040, which is clearly incorrect given that each iteration should require at least one cycle. This discrepancy is due to the quantum-based simulation approach used by Fast Models, which groups instructions into blocks and does not account for individual cycles.
To address this issue, the simulation parameters can be adjusted by reducing the quantum size and the minimum synchronization latency. For example, setting the quantum size to 1 (-Q 1) and the minimum synchronization latency to 1 (-M 1) may improve the accuracy of the cycle count. However, this will significantly slow down the simulation, making it impractical for large loop counts or complex code.
Alternatively, the SysTick timer can be used to measure the elapsed time for the loop. While the resolution of the SysTick timer is lower than that of the DWT_CYCCNT register, it can provide a more reliable measurement for longer-duration events. For example, a loop with 10,000 iterations may take approximately 1.25 seconds to execute, as measured by the SysTick timer. This result can be compared against the wall-clock time to verify its accuracy.
Finally, if cycle-accurate timing is required, a cycle-accurate simulator or FPGA-based emulation platform can be used. These tools will provide precise cycle counts for the loop, allowing for accurate performance analysis. However, the increased simulation time and setup complexity must be taken into account.
Conclusion
Benchmarking code on the FVP_MPS2_Cortex-M4 simulator presents several challenges, particularly when it comes to obtaining accurate timing measurements. The DWT_CYCCNT register, while useful for cycle counting on real hardware, may not provide reliable results in a Fast Model environment due to the trade-off between simulation speed and cycle accuracy. To address these challenges, users can adjust simulation parameters, use alternative timing methods, or consider cycle-accurate simulation tools. Each approach has its own advantages and limitations, and the choice of method will depend on the specific requirements of the application. By carefully selecting the appropriate tools and techniques, it is possible to obtain meaningful benchmarking results even in a non-cycle-accurate simulation environment.