perf statand Hyper-Threading
In the previous blog post, we had a look at how “background load” can impact the execution time of processes (basically by inducing thermal throttling which lowers the CPU frequency). I came to the conclusion that measuring CPU cycles is a better way than measuring execution time.
While it’s better, it’s still flawed quite a bit, especially when you use it on modern CPUs with standard configuration.
Since I’m a bit paranoid, Hyper-Threading is usually disabled on my machines. It’s probably not really necessary in my scenarios, but it doesn’t hurt that much, either. Those machines are blazing fast anyway (even though they’re from 2013/2014). So I’m not running a “standard configuration”. “Standard configuration” would mean having it enabled.
Because it was disabled, I didn’t realize the following effect until now.
Let’s start with a short program for Linux/amd64:
.global _start _start: movq $1000000, %rax .L0: decq %rax cmp $0, %rax jne .L0 /* _exit(0) */ movq $60, %rax movq $0, %rdi syscall
$ as -o prog.o prog.s $ ld -o prog prog.o
It doesn’t do anything useful. It decreases the value of a register a million times. The point of this is to have a good idea of how many instructions are executed by this program:
So, in total, we should see
1 + 3 * 1000000 + 3 instructions being
executed. Let’s have a look:
$ perf stat -e instructions:u ./prog Performance counter stats for './prog': 3,000,005 instructions:u
Huh, one too many. Well, close enough for this example. :)
To recall, different instructions can have different “cost”. So simply counting the number of instructions isn’t a good metric. That’s why the previous blog post introduced the number of CPU cycles:
$ perf stat -e cycles:u,instructions:u ./prog Performance counter stats for './prog': 1,001,300 cycles:u 3,000,005 instructions:u # 3.00 insn per cycle
What I was hoping for was that
cycles:u remains more or less the
same, no matter what the system is doing otherwise. (This assumption
only makes sense for strictly CPU bound tasks.)
With Hyper-Threading disabled, this is true. But when it’s enabled and there’s background load happening on the system, we can see this:
$ while true; do true; done ← this runs in a bunch of terminals $ perf stat -e cycles:u,instructions:u ./prog Performance counter stats for './prog': 1,971,297 cycles:u 3,000,005 instructions:u # 1.52 insn per cycle
Awwww, bummer. The number of instructions stays the same, but the CPU needs more cycles to execute them now.
When we look at the full output of
perf stat, we get a hint:
$ perf stat ./prog Performance counter stats for './prog': 0.61 msec task-clock:u # 0.641 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 1 page-faults:u # 1.652 K/sec 2,020,909 cycles:u # 3.339 GHz ———————→ 1,212,002 stalled-cycles-frontend:u # 59.97% frontend cycles idle 3,000,005 instructions:u # 1.48 insn per cycle # 0.40 stalled cycles per insn 1,000,002 branches:u # 1.652 G/sec 4 branch-misses:u # 0.00% of all branches
CPUs are truly complex beasts these days. It’s getting a bit complicated now to find out why exactly this happens. I don’t really know of a way to inspect what the CPU is doing here internally.
There’s this piece of documentation:
I haven’t dug through this, yet. From what I understand so far, my program has been stalled for 1212002 cycles, because, well, the CPU was busy doing “something else”. What exactly, I’m not sure. If you can explain this in detail, please get in touch.
This phenomenon only occurs when
logical_cores / 2 - 1background processes running.
– Edit: Originally, this said
real_cores, which was wrong. I meant the
number of logical cores, which you can see in
htop, for example. A
full example: 8 logical cores in
htop with HT enabled → we can see the
effect when there are more than 3 active background processes.
time(1)is not a good way to measure “how much work your CPU has to do” in order to complete a task. Background tasks can trigger thermal throttling and thus affect the task you want to measure.
perf stat -e cycles:uis not a good way, either, if Hyper-Threading is enabled. “Hyper-Threads” are not true CPU cores after all and using them can stall execution of the task you want to measure.
cycles:uhas little meaning. This number doesn’t tell you how many of the cycles have been completely wasted in idle mode. It’s just the number of cycles that has been attributed to your process.
stalled-cycles-frontend:uif Hyper-Threading is enabled.
cycles:uis not “the sum of all cycles”, as can be seen in
perf stat datewhere
stalled-cycles-frontend:uis larger than
We’re barely scratching the surface here. There are so much more
components involved. Doing “simple benchmarks”, like I originally wanted
to do using
time(1), is much more complicated than it seems,
especially if your system is doing more than one thing at once (which it
almost certainly is in 2021).