blog · git · desktop · images · contact


Usage of binfmt_misc in multiarch Docker

2022-09-02

You can pull Docker images of another architecture and try to run them:

$ docker run -it --rm --platform linux/arm64 alpine /bin/sh
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
9b18e9b68314: Pull complete
Digest: sha256:bc41182d7ef5ffc53a40b044e725193bc10142a1243f395ee852a8d9730fc2ad
Status: Downloaded newer image for alpine:latest
exec /bin/sh: exec format error

I’m on amd64 and the image I just pulled is for arm64:

# file /var/lib/docker/..../busybox
busybox: ELF 64-bit LSB pie executable, ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-musl-aarch64.so.1, stripped

So, naturally, I cannot execute that binary.

People on the internet usually tell you to run something like the following, without ever explaining what happens:

$ docker run --rm --privileged tonistiigi/binfmt --install all
Unable to find image 'tonistiigi/binfmt:latest' locally
latest: Pulling from tonistiigi/binfmt
8d4d64c318a5: Pull complete 
e9c608ddc3cb: Pull complete 
Digest: sha256:66e11bea77a5ea9d6f0fe79b57cd2b189b5d15b93a2bdb925be22949232e4e55
Status: Downloaded newer image for tonistiigi/binfmt:latest
installing: mips64 OK
installing: arm64 OK
installing: arm OK
installing: s390x OK
installing: ppc64le OK
installing: riscv64 OK
installing: mips64le OK
{
  "supported": [
    "linux/amd64",
    "linux/arm64",
    "linux/riscv64",
    "linux/ppc64le",
    "linux/s390x",
    "linux/386",
    "linux/mips64le",
    "linux/mips64",
    "linux/arm/v7",
    "linux/arm/v6"
  ],
  "emulators": [
    "qemu-aarch64",
    "qemu-arm",
    "qemu-mips64",
    "qemu-mips64el",
    "qemu-ppc64le",
    "qemu-riscv64",
    "qemu-s390x"
  ]
}

Now it works:

$ docker run -it --rm --platform linux/arm64 alpine /bin/sh
/ # uname -m
aarch64

But why?

Judging from the output, this probably “installed” some QEMU? How, where? You can stop the Docker daemon and wipe /var/lib/docker, it will still work. So where are the QEMU binaries now? Have they been installed to some directory on the host? After all, we ran that command with --privileged, it has full access to everything, so who knows what happened. (In other words, it might not be the best idea to run commands like that lightly just because someone of StackOverflow said so.)

Inspect the source code of that image and you’ll find two things:

  1. Dockerfile: The Docker image carries the QEMU executables.
  2. The ENTRYPOINT is a Go program which does little more than writing lines into the binfmt_misc interface and thus talks directly to the kernel of the host.

binfmt_misc is an interesting thing in and of itself, but the really interesting question remains: Where are the QEMU binaries located? It can’t be a reference to something in /var/lib/docker, because you’re free to delete that directory and it still works.

So let’s have a look at the Linux kernel source code.

Disclaimer: As always, this is to the best of my knowledge. I’m not a kernel developer. Please notify me if you spot mistakes.

We’ll be looking at binfmt_misc.c.

In line 777, you’ll see that the struct called bm_register_operations gets registered as valid operations for the register file, which is where the Go program writes to. It’s defined in line 717 and refers us to bm_register_write() for write operations. In line 660 inside bm_register_write(), we call open_exec(e->interpreter); the argument being the full path to the interpreter file (i.e., our QEMU binary), see function create_entry(). open_exec() comes from the include linux/fs.h and it opens the file for us, so we get a reference to an open file, which is stored in the e struct as well. Later on in line 699, the entire e is added to the entries list. This list will be consulted in check_file() in line 90 to see if an interpreter for a certain binary exists.

(What also threw me off a bit was that check_file() calls file_clone_open() from fs.h and, when you keep digging, it talks a lot about “file path”. Are we talking about a name in a filesystem here? That would have been weird, it wouldn’t fit the concept of “this is an open file”. But no, that “path” is a struct and points directly to a dentry, so everything’s fine.)

So, long story short, this is the magic: Running that tonistiigi/binfmt image instructs the kernel to open a file from that image and keep a pointer to it around, even long after this Docker image has been disposed of.

(Another interesting side effect of all this: Since we’re using binfmt_misc here, Docker itself isn’t concerned with QEMU at all. It simply tells the kernel: “Launch this binary in that namespace!” And, if the kernel can somehow run it, even if it’s through binfmt_misc, it will.)

Comments?