blog · git · desktop · images · contact & privacy · gopher
perf stat
and Hyper-Threading2021-08-19
In the previous blog post, we had a look at how “background load” can impact the execution time of processes (basically by inducing thermal throttling which lowers the CPU frequency). I came to the conclusion that measuring CPU cycles is a better way than measuring execution time.
While it’s better, it’s still flawed quite a bit, especially when you use it on modern CPUs with standard configuration.
Since I’m a bit paranoid, Hyper-Threading is usually disabled on my machines. It’s probably not really necessary in my scenarios, but it doesn’t hurt that much, either. Those machines are blazing fast anyway (even though they’re from 2013/2014). So I’m not running a “standard configuration”. “Standard configuration” would mean having it enabled.
Because it was disabled, I didn’t realize the following effect until now.
Let’s start with a short program for Linux/amd64:
.global _start
_start:
movq $1000000, %rax
.L0:
decq %rax
cmp $0, %rax
jne .L0
/* _exit(0) */
movq $60, %rax
movq $0, %rdi
syscall
Compile with:
$ as -o prog.o prog.s
$ ld -o prog prog.o
It doesn’t do anything useful. It decreases the value of a register a million times. The point of this is to have a good idea of how many instructions are executed by this program:
.L0
.So, in total, we should see 1 + 3 * 1000000 + 3
instructions being
executed. Let’s have a look:
$ perf stat -e instructions:u ./prog
Performance counter stats for './prog':
3,000,005 instructions:u
Huh, one too many. Well, close enough for this example. :)
To recall, different instructions can have different “cost”. So simply counting the number of instructions isn’t a good metric. That’s why the previous blog post introduced the number of CPU cycles:
$ perf stat -e cycles:u,instructions:u ./prog
Performance counter stats for './prog':
1,001,300 cycles:u
3,000,005 instructions:u # 3.00 insn per cycle
What I was hoping for was that cycles:u
remains more or less the
same, no matter what the system is doing otherwise. (This assumption
only makes sense for strictly CPU bound tasks.)
With Hyper-Threading disabled, this is true. But when it’s enabled and there’s background load happening on the system, we can see this:
$ while true; do true; done ← this runs in a bunch of terminals
$ perf stat -e cycles:u,instructions:u ./prog
Performance counter stats for './prog':
1,971,297 cycles:u
3,000,005 instructions:u # 1.52 insn per cycle
Awwww, bummer. The number of instructions stays the same, but the CPU needs more cycles to execute them now.
When we look at the full output of perf stat
, we get a hint:
$ perf stat ./prog
Performance counter stats for './prog':
0.61 msec task-clock:u # 0.641 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
1 page-faults:u # 1.652 K/sec
2,020,909 cycles:u # 3.339 GHz
———————→ 1,212,002 stalled-cycles-frontend:u # 59.97% frontend cycles idle
3,000,005 instructions:u # 1.48 insn per cycle
# 0.40 stalled cycles per insn
1,000,002 branches:u # 1.652 G/sec
4 branch-misses:u # 0.00% of all branches
CPUs are truly complex beasts these days. It’s getting a bit complicated now to find out why exactly this happens. I don’t really know of a way to inspect what the CPU is doing here internally.
There’s this piece of documentation:
I haven’t dug through this, yet. From what I understand so far, my program has been stalled for 1212002 cycles, because, well, the CPU was busy doing “something else”. What exactly, I’m not sure. If you can explain this in detail, please get in touch.
This phenomenon only occurs when
logical_cores / 2 - 1
background processes
running.– Edit: Originally, this said real_cores
, which was wrong. I meant the
number of logical cores, which you can see in htop
, for example. A
full example: 8 logical cores in htop
with HT enabled → we can see the
effect when there are more than 3 active background processes.
time(1)
is not a good way to measure “how much work your CPU has
to do” in order to complete a task. Background tasks can trigger
thermal throttling and thus affect the task you want to measure.perf stat -e cycles:u
is not a good way, either, if
Hyper-Threading is enabled. “Hyper-Threads” are not true CPU cores
after all and using them can stall execution of the task you want to
measure.cycles:u
has little meaning. This number
doesn’t tell you how many of the cycles have been completely wasted
in idle mode. It’s just the number of cycles that has been
attributed to your process. stalled-cycles-frontend:u
if
Hyper-Threading is enabled.cycles:u
is not “the sum of all cycles”, as can be seen in perf
stat date
where stalled-cycles-frontend:u
is larger than
cycles:u
…We’re barely scratching the surface here. There are so much more
components involved. Doing “simple benchmarks”, like I originally wanted
to do using time(1)
, is much more complicated than it seems,
especially if your system is doing more than one thing at once (which it
almost certainly is in 2021).