mercredi 19 novembre 2014

How Linux changes the CPU Frequency

The ondemand governor adjust the processor frequency making a trade-off between power and performance, and is the implementation of Dynamic Frequency Scaling under Linux. However, it was also causing unexpected delays in my performance benchmarks. The following figure shows the delay density of RPC requests for the CPU frequency governor ondemand and performance. Delay variation is lower when the CPU frequency is fixed.

I wanted to visualize how the frequency changes according to time and scheduling. I used LTTng to record power_cpu_frequency and sched_switch events. I loaded them in Trace Compass, and defined the analysis declaratively using XML (available here). The CPU frequency is shown in shades of red (stronger means higher frequency) and in green whether the CPU is busy or idle. (you guessed correctly, holiday season is coming). We see in the following figure that it takes some time for the frequency to ramp-up.



In fact, changing the CPU frequency is equivalent to distorts the elapsed time of the execution. Timestamps in the trace are in nanoseconds instead of clock cycles, but to compare executions, the base should be clock cycles instead. We could scale the time using the frequency events to approximate the time in processor cycles.

The task kworker, woken-up from a timer softirq, changes the CPU frequency. The kworker thread runs with a varying period, typically between 4ms and 20ms. The algorithm seems to take into account the average CPU usage, because if the CPU does rapid context switch between idle and busy, the frequency never reaches the maximum.

Therefore, a task performing frequent blocking/running cycles may run faster if it shares the CPU with another task, because the average CPU usage will increase and frequency will be set accordingly. We could also estimate the relative power cost of a segment of computation with great precision.

In conclusion, disabling frequency scaling for benchmarks should reduce variability. Here is how to do it on recent Linux kernels:

set_scaling_gov() {
        gov=${1-performance}
        for i in $(ls -1 /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor); do
                echo ${gov} | sudo tee $i > /dev/null
        done
}

Here is the view for the whole trace. Happy hacking!