blog · git · desktop · images · contact


fork() vs. vfork() and subprocess.Popen in Python

2023-11-26

In BundleWrap, we fork off a lot of processes. It’s part of bw’s design: It is supposed to operate over SSH and it is supposed to use your native SSH implementation (so that you can rely on all features being available). This means that, in order to check items like files and their permissions or whatever, we have to run something like ssh myhost stat /etc/foo.conf. In practice, we have around 1300 managed systems (called “nodes”) and each node has about 1000 items. That is a lot of forking.

We have long known that there might have been something “wrong” in this area: We have seen a high sys time for the main bw process in tools like htop, meaning it spends a lot of time being stuck in syscalls. Until recently, though, we have not had the chance to investigate this any further.

What bw’s code ultimately does is roughly this pattern:

#!/usr/bin/env python3


from concurrent.futures import ThreadPoolExecutor, wait
from subprocess import Popen


def target(i):
    p = Popen(['touch', f'/tmp/foo-{i}'])
    p.communicate()


executor = ThreadPoolExecutor(max_workers=64)
futures = set()

for i in range(10_000):
    futures.add(executor.submit(target, i))

wait(futures)

We have a ThreadPoolExecutor and each thread then spawns the processes.

(There are several such pools in bw. It allows the user to handle $n nodes in parallel and, for each node, $m items in parallel. Since those systems are usually independent from each other, it’s not an unreasonable thing to set the size of the node worker pool to something like 32 or even 128 or 256, and then the item worker pool to 8 or 16. All this hinges on SSH multiplexing and requires the target systems to be configured accordingly. It is certainly up for debate whether we have to use threads here, but that’s just how it is at the moment. Long story short, we have a lot of threads.)

Now, what bw must do is avoid that those SSH processes receive any SIGINT signals when the user presses ^C. If they did receive that, then they would terminate right away. We don’t want that, but we want to have a “soft shutdown” behavior: ^C shall tell bw to stop scheduling new jobs and existing jobs shall get a chance to finish properly. Only a second ^C shall trigger a “hard shutdown” and thus kill the child processes right away.

We cannot tell the SSH processes to ignore signals. They manage their own signals handlers. Instead, we must avoid that our children receive any such signals in the first place. SIGINT is sent by the terminal emulator to the foreground process group – hence, if we put those SSH processes in a different process group, they won’t get the signal.

Historically, the only way to do this in Python was to use preexec_fn:

from os import setpgrp

p = Popen(['touch', f'/tmp/foo-{i}'], preexec_fn=setpgrp)

While investigating this whole issue, I noticed that performance dramatically increases when I omit preexec_fn=. It wasn’t really clear at that point why this happened and it didn’t even matter which function I used – the mere presence of this argument made a difference. For unrelated reasons, preexec_fn is being deprecated anyway and they added a different way to create a new process group for the child in Python 3.11 (which is still relatively new):

p = Popen(['touch', f'/tmp/foo-{i}'], process_group=0)

The performance of bw was still good, so it had to be some internal Python thing, I thought. I closed the case and moved on.

Shortly after, we found out that the bad performance was back with Python 3.12. Bummer. This led me to have a closer look at CPython’s source code. Whether or not I found a bug is still being investigated by the CPython devs, but I did get a better understanding of what’s going on here.

The underlying cause appears to be: Setting preexec_fn triggers a code path in CPython that uses fork() on Linux. Without that setting, it uses vfork() instead.

To isolate this whole thing from any further CPython issues, let’s have a look at the following C snippet forker.c:

/*

    $ rm -Rf /tmp/forker; mkdir /tmp/forker; \
        cc -std=c99 -Wall -Wextra -O2 -o forker forker.c -lpthread && \
        time ./forker ...

*/

#define _DEFAULT_SOURCE /* vfork */
#define _XOPEN_SOURCE   /* getopt */

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

size_t num_threads = 64;
size_t desired_total = 32768;
size_t per_thread;
pid_t (*fork_function)(void) = fork;

void *
thread(void *arg)
{
    size_t *tnum = (size_t *)arg;
    size_t i, this_task;
    char buf[255] = "";
    int r;

    for (i = 0; i < per_thread; i++)
    {
        this_task = per_thread * *tnum + i;
        r = snprintf(buf, sizeof buf, "/tmp/forker/%zu", this_task);
        if (r < 0 || (size_t)r >= sizeof buf)
        {
            perror("snprintf");
            abort();
        }

        switch ((*fork_function)())
        {
            case 0:
                execl("/usr/bin/touch", "touch", buf, (char *)NULL);
                _exit(255);
            case -1:
                perror("fork failed");
                abort();
        }
    }

    return NULL;
}

int
main(int argc, char **argv)
{
    size_t i;
    size_t *i_copy = NULL;
    pthread_t *tids = NULL;
    int opt;

    while ((opt = getopt(argc, argv, "n:t:v")) != -1)
    {
        switch (opt)
        {
            case 'n': num_threads = atoi(optarg);   break;
            case 't': desired_total = atoi(optarg); break;
            case 'v': fork_function = vfork;        break;
            default:
                fprintf(stderr, "Unknown command line argument\n");
                return EXIT_FAILURE;
        }
    }

    if (desired_total % num_threads != 0)
    {
        fprintf(stderr, "desired_total not divisible by num_threads\n");
        return EXIT_FAILURE;
    }
    per_thread = desired_total / num_threads;

    tids = calloc(num_threads, sizeof (pthread_t));
    if (tids == NULL)
    {
        perror("calloc tids");
        abort();
    }

    for (i = 0; i < num_threads; i++)
    {
        i_copy = malloc(sizeof (size_t));
        if (i_copy == NULL)
        {
            perror("malloc i_copy");
            abort();
        }

        *i_copy = i;

        if (pthread_create(&tids[i], NULL, thread, (void *)i_copy) != 0)
        {
            perror("pthread_create");
            abort();
        }
    }

    for (i = 0; i < num_threads; i++)
    {
        if (pthread_join(tids[i], NULL) != 0)
        {
            perror("pthread_join");
            abort();
        }
    }

    return EXIT_SUCCESS;
}

It does a similar thing than the Python example from above:

The program has a switch to either use fork() or vfork(). It will try to write files to /tmp/forker and you must create this directory beforehand.

Be aware that this creates a lot of processes. Maybe you even run into your ulimit limits.

Running ./forker -n 1 -t 1 through strace shows the difference. Without -v:

10:10:17.195961 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f88b13b3990) = 2079279

With -v:

10:12:40.038518 vfork() = 2079623

In our simple example, this is pretty much the only difference. In the real world with CPython, more things differ, but I don’t think that they matter for the issue at hand.

On Linux, the description of vfork() roughly reads as follows:

vfork()  is  a  special case of clone(2).  It is used to create
new processes without copying the page  tables  of  the  parent
process.

...

vfork() differs from fork(2) in that the calling thread is sus‐
pended until the child terminates (either normally, by  calling
_exit(2),  or abnormally, after delivery of a fatal signal), or
it makes a call to execve(2).   Until  that  point,  the  child
shares  all  memory  with its parent, including the stack.

It is a much “lighter” version of fork() and comes with critical restrictions. vfork() is/was intended to be used in cases where you pretty much immediately call exec() right after it:

Historic description
    Under Linux, fork(2) is implemented using copy-on-write  pages,
    so  the only penalty incurred by fork(2) is the time and memory
    required to duplicate the parent's page tables, and to create a
    unique task structure for the child.  However, in the  bad  old
    days  a  fork(2)  would  require  making a complete copy of the
    caller's data space, often needlessly,  since  usually  immedi‐
    ately  afterward  an  exec(3) is done.  Thus, for greater effi‐
    ciency, BSD introduced the vfork() system call, which  did  not
    fully  copy  the  address space of the parent process, but bor‐
    rowed the parent's memory and thread of control until a call to
    execve(2) or an exit occurred.  The  parent  process  was  sus‐
    pended  while  the  child  was using its resources.  The use of
    vfork() was tricky: for example, not modifying data in the par‐
    ent process depended on knowing which variables were held in  a
    register.

With Python’s preexec_fn, it is clear that we cannot use vfork(), because preexec_fn means calling back into Python code after fork() and that just won’t work with vfork().

Now, one other thing about BundleWrap’s performance is that we always suspected that the performance gets much worse when you ramp up the number of threads (but we have never really investigated this, either). So, let’s have a look how our little C program above behaves when we do that.

Running time ./forker -t 32768 with different -n arguments, we get the following numbers (sys is seconds):

Without -v for fork():

-n    sys
---------
  1   1.6
  2   1.7
  4   2.0
  8   2.4
 16   3.1
 32   4.6
 64   7.9
128  14.3
256  27.3
512  58.2

So, the more threads there are, the slower it gets. It appears to be O(n).

And with -v for vfork():

-n    sys
---------
  1   0.5
  2   0.6
  4   0.6
  8   0.6
 16   0.6
 32   0.6
 64   0.6
128   0.6
256   0.6
512   0.6

Boom, O(1).

And it appears that this is exactly the penalty we’re seeing in BundleWrap (on Linux) when you use lots of threads.

Trying to verify that it really is faster

This is such a dramatic difference that I’m a bit skeptical. Did I make a mistake here? Did I not measure the times correctly? Something else? If you find an error, please tell me.

What I can tell you is that the created files really do have the expected timestamps. For example, with vfork():

$ rm -Rf /tmp/forker; mkdir /tmp/forker; \
    cc -std=c99 -Wall -Wextra -O2 -o forker forker.c -lpthread && \
    time ./forker -n 256 -t 32768 -v

real    0m2.796s
user    0m0.036s
sys     0m0.613s

This run took about 3 seconds. And here are the timestamps:

$ stat --printf '%y\n' * | sort | sed -n '1p; $p'
2023-11-26 14:07:44.278949269 +0100
2023-11-26 14:07:49.525587566 +0100

This is a bit more than 3 seconds, but remember that our program only launches the processes and doesn’t care if/when they return. It is to be expected that forker finished before all the children did their work.

So let’s see how it looks without -v:

$ rm -Rf /tmp/forker; mkdir /tmp/forker; \
    cc -std=c99 -Wall -Wextra -O2 -o forker forker.c -lpthread && \
    time ./forker -n 256 -t 32768

real    0m29.658s
user    0m0.149s
sys     0m28.645s

The timestamps are 30 seconds apart now, as expected:

$ stat --printf '%y\n' * | sort | sed -n '1p; $p'
2023-11-26 14:10:29.371345172 +0100
2023-11-26 14:10:59.001168275 +0100

In both cases, find -type f | wc -l yields the expected number of 32768.

On platforms other than Linux

On OpenBSD 7.4, -v or no -v makes no difference.

openbsd$ time ./forker -n 128 -t 1024
    0m01.89s real     0m00.03s user     0m00.43s system

openbsd$ time ./forker -n 128 -t 1024 -v
    0m01.93s real     0m00.02s user     0m00.42s system

Their manpage says:

HISTORY
     The vfork() function call appeared in 3.0BSD with the additional
     semantics that the child process ran in the memory of the parent until it
     called execve(2) or exited.  That sharing of memory was removed in
     4.4BSD, leaving just the semantics of blocking the parent until the child
     calls execve(2) or exits.  On many other systems the original behavior
     has been restored, making this interface particularly unportable.

The Linux manpage mentions that NetBSD has vfork() (again), so let’s have a look at NetBSD 9.3:

netbsd$ time ./forker -n 128 -t 1024
       14.34 real         1.91 user        21.83 sys

netbsd$ time ./forker -n 128 -t 1024 -v
        0.81 real         0.00 user         0.04 sys

There you go.

Due to its general wonkiness, vfork() is not POSIX. The manpage is riddled with caveats and warnings. The CPython behavior (different code path with either fork() or vfork()) is a Linux thing only and in fact behind an #ifdef __linux__. This whole issue is very platform specific.

Real world implications in BundleWrap

That C program is certainly an academic example. What is the actual performance benefit that we saw in bw on Linux?

This is a bit hard to measure, because we’re talking about real server systems here. They do more than just sit there and wait for me to do some tests. Another complication is that I did some more optimizations in bw than just the vfork() thingy, even though this has the largest impact. But finally, it was so slow before that proper benchmarking would take a couple of hours or a day – I’m not willing to do that at the moment.

What I can say is this:

We regularly run bw apply on all of our systems. We do that in batches of 100 nodes. Before, one such batch took roughly one hour. Now it is down to roughly 8 minutes. My guess is that the vfork() issue accounts for a factor of roughly 5 here.

Most importantly, though, we should now be able to use larger worker pools. My laptop at work only has 4 cores (plus hyperthreading). What if I had a larger machine? We no longer get a penalty for having lots of workers and I’d be really curious to see what a modern CPU with 16, 32 or 64 cores could do.

What about posix_spawn()?

That’s another area to explore. It looks promising:

fork() step
    Since glibc 2.24, the posix_spawn() function commences by call‐
    ing clone(2) with CLONE_VM and CLONE_VFORK flags.  Older imple‐
    mentations use fork(2), or possibly vfork(2) (see below).

It might be similarly fast but with a much cleaner interface. It’s a topic for another time, though.

Comments?