Scratching the surface: `perf stat` and Hyper-Threading

2021-08-19

In the previous blog post, we had a look at how “background load” can impact the execution time of processes (basically by inducing thermal throttling which lowers the CPU frequency). I came to the conclusion that measuring CPU cycles is a better way than measuring execution time.

While it’s better, it’s still flawed quite a bit, especially when you use it on modern CPUs with standard configuration.

Since I’m a bit paranoid, Hyper-Threading is usually disabled on my machines. It’s probably not really necessary in my scenarios, but it doesn’t hurt that much, either. Those machines are blazing fast anyway (even though they’re from 2013/2014). So I’m not running a “standard configuration”. “Standard configuration” would mean having it enabled.

Because it was disabled, I didn’t realize the following effect until now.

Let’s start with a short program for Linux/amd64:

.global _start

_start:
    movq $1000000, %rax

    .L0:
        decq %rax
        cmp $0, %rax
        jne .L0

    /* _exit(0) */
    movq $60, %rax
    movq $0, %rdi
    syscall

Compile with:

$ as -o prog.o prog.s
$ ld -o prog prog.o

It doesn’t do anything useful. It decreases the value of a register a million times. The point of this is to have a good idea of how many instructions are executed by this program:

1 instruction until we reach the label .L0.
3 instructions are executed in the loop, a million times.
3 instructions to exit.

So, in total, we should see 1 + 3 * 1000000 + 3 instructions being executed. Let’s have a look:

$ perf stat -e instructions:u ./prog

 Performance counter stats for './prog':

         3,000,005      instructions:u

Huh, one too many. Well, close enough for this example. :)

To recall, different instructions can have different “cost”. So simply counting the number of instructions isn’t a good metric. That’s why the previous blog post introduced the number of CPU cycles:

$ perf stat -e cycles:u,instructions:u ./prog

 Performance counter stats for './prog':

         1,001,300      cycles:u
         3,000,005      instructions:u            #    3.00  insn per cycle

What I was hoping for was that cycles:u remains more or less the same, no matter what the system is doing otherwise. (This assumption only makes sense for strictly CPU bound tasks.)

With Hyper-Threading disabled, this is true. But when it’s enabled and there’s background load happening on the system, we can see this:

$ while true; do true; done    ← this runs in a bunch of terminals

$ perf stat -e cycles:u,instructions:u ./prog

 Performance counter stats for './prog':

         1,971,297      cycles:u
         3,000,005      instructions:u            #    1.52  insn per cycle

Awwww, bummer. The number of instructions stays the same, but the CPU needs more cycles to execute them now.

When we look at the full output of perf stat, we get a hint:

$ perf stat ./prog

 Performance counter stats for './prog':

              0.61 msec task-clock:u              #    0.641 CPUs utilized
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
                 1      page-faults:u             #    1.652 K/sec
         2,020,909      cycles:u                  #    3.339 GHz
———————→ 1,212,002      stalled-cycles-frontend:u #   59.97% frontend cycles idle
         3,000,005      instructions:u            #    1.48  insn per cycle
                                                  #    0.40  stalled cycles per insn
         1,000,002      branches:u                #    1.652 G/sec
                 4      branch-misses:u           #    0.00% of all branches

CPUs are truly complex beasts these days. It’s getting a bit complicated now to find out why exactly this happens. I don’t really know of a way to inspect what the CPU is doing here internally.

There’s this piece of documentation:

https://software.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/methodologies/top-down-microarchitecture-analysis-method.html

I haven’t dug through this, yet. From what I understand so far, my program has been stalled for 1212002 cycles, because, well, the CPU was busy doing “something else”. What exactly, I’m not sure. If you can explain this in detail, please get in touch.

This phenomenon only occurs when

Hyper-Threading is enabled in BIOS,
there are more than logical_cores / 2 - 1 background processes running.

– Edit: Originally, this said real_cores, which was wrong. I meant the number of logical cores, which you can see in htop, for example. A full example: 8 logical cores in htop with HT enabled → we can see the effect when there are more than 3 active background processes.

Conclusion (so far)

time(1) is not a good way to measure “how much work your CPU has to do” in order to complete a task. Background tasks can trigger thermal throttling and thus affect the task you want to measure.
perf stat -e cycles:u is not a good way, either, if Hyper-Threading is enabled. “Hyper-Threads” are not true CPU cores after all and using them can stall execution of the task you want to measure.
The plain value of cycles:u has little meaning. This number doesn’t tell you how many of the cycles have been completely wasted in idle mode. It’s just the number of cycles that has been attributed to your process. ~~At the very least, you should probably subtract the value of stalled-cycles-frontend:u if Hyper-Threading is enabled.~~ Nope, that’s not correct, either. cycles:u is not “the sum of all cycles”, as can be seen in perf stat date where stalled-cycles-frontend:u is larger than cycles:u …

We’re barely scratching the surface here. There are so much more components involved. Doing “simple benchmarks”, like I originally wanted to do using time(1), is much more complicated than it seems, especially if your system is doing more than one thing at once (which it almost certainly is in 2021).

Comments?

Scratching the surface: perf stat and Hyper-Threading

Conclusion (so far)

Scratching the surface: `perf stat` and Hyper-Threading