blog · git · desktop · images · contact
fork()
vs. vfork()
and subprocess.Popen
in Python2023-11-26
In BundleWrap, we fork off a lot of processes. It’s part of bw’s
design: It is supposed to operate over SSH and it is supposed to use
your native SSH implementation (so that you can rely on all features
being available). This means that, in order to check items like files
and their permissions or whatever, we have to run something like ssh
myhost stat /etc/foo.conf
. In practice, we have around 1300 managed
systems (called “nodes”) and each node has about 1000 items. That is a
lot of forking.
We have long known that there might have been something “wrong” in this
area: We have seen a high sys
time for the main bw process in tools
like htop
, meaning it spends a lot of time being stuck in syscalls.
Until recently, though, we have not had the chance to investigate this
any further.
What bw’s code ultimately does is roughly this pattern:
#!/usr/bin/env python3
from concurrent.futures import ThreadPoolExecutor, wait
from subprocess import Popen
def target(i):
p = Popen(['touch', f'/tmp/foo-{i}'])
p.communicate()
executor = ThreadPoolExecutor(max_workers=64)
futures = set()
for i in range(10_000):
futures.add(executor.submit(target, i))
wait(futures)
We have a ThreadPoolExecutor
and each thread then spawns the
processes.
(There are several such pools in bw. It allows the user to handle $n
nodes in parallel and, for each node, $m
items in parallel. Since
those systems are usually independent from each other, it’s not an
unreasonable thing to set the size of the node worker pool to something
like 32 or even 128 or 256, and then the item worker pool to 8 or 16.
All this hinges on SSH multiplexing and requires the target
systems to be configured accordingly. It is certainly up for debate
whether we have to use threads here, but that’s just how it is at the
moment. Long story short, we have a lot of threads.)
Now, what bw must do is avoid that those SSH processes receive any
SIGINT
signals when the user presses ^C
. If they did receive that,
then they would terminate right away. We don’t want that, but we want to
have a “soft shutdown” behavior: ^C
shall tell bw to stop scheduling
new jobs and existing jobs shall get a chance to finish properly. Only a
second ^C
shall trigger a “hard shutdown” and thus kill the child
processes right away.
We cannot tell the SSH processes to ignore signals. They manage their
own signals handlers. Instead, we must avoid that our children receive
any such signals in the first place. SIGINT
is sent by the terminal
emulator to the foreground process group – hence, if we put those SSH
processes in a different process group, they won’t get the signal.
Historically, the only way to do this in Python was to use preexec_fn
:
from os import setpgrp
p = Popen(['touch', f'/tmp/foo-{i}'], preexec_fn=setpgrp)
While investigating this whole issue, I noticed that performance
dramatically increases when I omit preexec_fn=
. It wasn’t really clear
at that point why this happened and it didn’t even matter which
function I used – the mere presence of this argument made a difference.
For unrelated reasons, preexec_fn
is being
deprecated anyway and they added a different way to
create a new process group for the child in Python 3.11 (which is still
relatively new):
p = Popen(['touch', f'/tmp/foo-{i}'], process_group=0)
The performance of bw was still good, so it had to be some internal Python thing, I thought. I closed the case and moved on.
Shortly after, we found out that the bad performance was back with Python 3.12. Bummer. This led me to have a closer look at CPython’s source code. Whether or not I found a bug is still being investigated by the CPython devs, but I did get a better understanding of what’s going on here.
The underlying cause appears to be: Setting preexec_fn
triggers a code
path in CPython that uses fork()
on Linux. Without that setting, it
uses vfork()
instead.
To isolate this whole thing from any further CPython issues, let’s have
a look at the following C snippet forker.c
:
/*
$ rm -Rf /tmp/forker; mkdir /tmp/forker; \
cc -std=c99 -Wall -Wextra -O2 -o forker forker.c -lpthread && \
time ./forker ...
*/
#define _DEFAULT_SOURCE /* vfork */
#define _XOPEN_SOURCE /* getopt */
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
size_t num_threads = 64;
size_t desired_total = 32768;
size_t per_thread;
pid_t (*fork_function)(void) = fork;
void *
thread(void *arg)
{
size_t *tnum = (size_t *)arg;
size_t i, this_task;
char buf[255] = "";
int r;
for (i = 0; i < per_thread; i++)
{
this_task = per_thread * *tnum + i;
r = snprintf(buf, sizeof buf, "/tmp/forker/%zu", this_task);
if (r < 0 || (size_t)r >= sizeof buf)
{
perror("snprintf");
abort();
}
switch ((*fork_function)())
{
case 0:
execl("/usr/bin/touch", "touch", buf, (char *)NULL);
_exit(255);
case -1:
perror("fork failed");
abort();
}
}
return NULL;
}
int
main(int argc, char **argv)
{
size_t i;
size_t *i_copy = NULL;
pthread_t *tids = NULL;
int opt;
while ((opt = getopt(argc, argv, "n:t:v")) != -1)
{
switch (opt)
{
case 'n': num_threads = atoi(optarg); break;
case 't': desired_total = atoi(optarg); break;
case 'v': fork_function = vfork; break;
default:
fprintf(stderr, "Unknown command line argument\n");
return EXIT_FAILURE;
}
}
if (desired_total % num_threads != 0)
{
fprintf(stderr, "desired_total not divisible by num_threads\n");
return EXIT_FAILURE;
}
per_thread = desired_total / num_threads;
tids = calloc(num_threads, sizeof (pthread_t));
if (tids == NULL)
{
perror("calloc tids");
abort();
}
for (i = 0; i < num_threads; i++)
{
i_copy = malloc(sizeof (size_t));
if (i_copy == NULL)
{
perror("malloc i_copy");
abort();
}
*i_copy = i;
if (pthread_create(&tids[i], NULL, thread, (void *)i_copy) != 0)
{
perror("pthread_create");
abort();
}
}
for (i = 0; i < num_threads; i++)
{
if (pthread_join(tids[i], NULL) != 0)
{
perror("pthread_join");
abort();
}
}
return EXIT_SUCCESS;
}
It does a similar thing than the Python example from above:
The program has a switch to either use fork()
or vfork()
. It will
try to write files to /tmp/forker
and you must create this directory
beforehand.
Be aware that this creates a lot of processes. Maybe you even run
into your ulimit
limits.
Running ./forker -n 1 -t 1
through strace
shows the difference.
Without -v
:
10:10:17.195961 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f88b13b3990) = 2079279
With -v
:
10:12:40.038518 vfork() = 2079623
In our simple example, this is pretty much the only difference. In the real world with CPython, more things differ, but I don’t think that they matter for the issue at hand.
On Linux, the description of vfork()
roughly reads as follows:
vfork() is a special case of clone(2). It is used to create
new processes without copying the page tables of the parent
process.
...
vfork() differs from fork(2) in that the calling thread is sus‐
pended until the child terminates (either normally, by calling
_exit(2), or abnormally, after delivery of a fatal signal), or
it makes a call to execve(2). Until that point, the child
shares all memory with its parent, including the stack.
It is a much “lighter” version of fork()
and comes with critical
restrictions. vfork()
is/was intended to be used in cases where you
pretty much immediately call exec()
right after it:
Historic description
Under Linux, fork(2) is implemented using copy-on-write pages,
so the only penalty incurred by fork(2) is the time and memory
required to duplicate the parent's page tables, and to create a
unique task structure for the child. However, in the bad old
days a fork(2) would require making a complete copy of the
caller's data space, often needlessly, since usually immedi‐
ately afterward an exec(3) is done. Thus, for greater effi‐
ciency, BSD introduced the vfork() system call, which did not
fully copy the address space of the parent process, but bor‐
rowed the parent's memory and thread of control until a call to
execve(2) or an exit occurred. The parent process was sus‐
pended while the child was using its resources. The use of
vfork() was tricky: for example, not modifying data in the par‐
ent process depended on knowing which variables were held in a
register.
With Python’s preexec_fn
, it is clear that we cannot use vfork()
,
because preexec_fn
means calling back into Python code after
fork()
and that just won’t work with vfork()
.
Now, one other thing about BundleWrap’s performance is that we always suspected that the performance gets much worse when you ramp up the number of threads (but we have never really investigated this, either). So, let’s have a look how our little C program above behaves when we do that.
Running time ./forker -t 32768
with different -n
arguments, we get
the following numbers (sys
is seconds):
Without -v
for fork()
:
-n sys
---------
1 1.6
2 1.7
4 2.0
8 2.4
16 3.1
32 4.6
64 7.9
128 14.3
256 27.3
512 58.2
So, the more threads there are, the slower it gets. It appears to be
O(n)
.
And with -v
for vfork()
:
-n sys
---------
1 0.5
2 0.6
4 0.6
8 0.6
16 0.6
32 0.6
64 0.6
128 0.6
256 0.6
512 0.6
Boom, O(1)
.
And it appears that this is exactly the penalty we’re seeing in BundleWrap (on Linux) when you use lots of threads.
This is such a dramatic difference that I’m a bit skeptical. Did I make a mistake here? Did I not measure the times correctly? Something else? If you find an error, please tell me.
What I can tell you is that the created files really do have the
expected timestamps. For example, with vfork()
:
$ rm -Rf /tmp/forker; mkdir /tmp/forker; \
cc -std=c99 -Wall -Wextra -O2 -o forker forker.c -lpthread && \
time ./forker -n 256 -t 32768 -v
real 0m2.796s
user 0m0.036s
sys 0m0.613s
This run took about 3 seconds. And here are the timestamps:
$ stat --printf '%y\n' * | sort | sed -n '1p; $p'
2023-11-26 14:07:44.278949269 +0100
2023-11-26 14:07:49.525587566 +0100
This is a bit more than 3 seconds, but remember that our program only
launches the processes and doesn’t care if/when they return. It is to be
expected that forker
finished before all the children did their work.
So let’s see how it looks without -v
:
$ rm -Rf /tmp/forker; mkdir /tmp/forker; \
cc -std=c99 -Wall -Wextra -O2 -o forker forker.c -lpthread && \
time ./forker -n 256 -t 32768
real 0m29.658s
user 0m0.149s
sys 0m28.645s
The timestamps are 30 seconds apart now, as expected:
$ stat --printf '%y\n' * | sort | sed -n '1p; $p'
2023-11-26 14:10:29.371345172 +0100
2023-11-26 14:10:59.001168275 +0100
In both cases, find -type f | wc -l
yields the expected number of
32768
.
On OpenBSD 7.4, -v
or no -v
makes no difference.
openbsd$ time ./forker -n 128 -t 1024
0m01.89s real 0m00.03s user 0m00.43s system
openbsd$ time ./forker -n 128 -t 1024 -v
0m01.93s real 0m00.02s user 0m00.42s system
Their manpage says:
HISTORY
The vfork() function call appeared in 3.0BSD with the additional
semantics that the child process ran in the memory of the parent until it
called execve(2) or exited. That sharing of memory was removed in
4.4BSD, leaving just the semantics of blocking the parent until the child
calls execve(2) or exits. On many other systems the original behavior
has been restored, making this interface particularly unportable.
The Linux manpage mentions that NetBSD has vfork()
(again), so let’s
have a look at NetBSD 9.3:
netbsd$ time ./forker -n 128 -t 1024
14.34 real 1.91 user 21.83 sys
netbsd$ time ./forker -n 128 -t 1024 -v
0.81 real 0.00 user 0.04 sys
There you go.
Due to its general wonkiness, vfork()
is not POSIX.
The manpage is riddled with caveats and warnings. The CPython behavior
(different code path with either fork()
or vfork()
) is a Linux thing
only and in fact behind an #ifdef __linux__
. This whole issue is very
platform specific.
That C program is certainly an academic example. What is the actual performance benefit that we saw in bw on Linux?
This is a bit hard to measure, because we’re talking about real server
systems here. They do more than just sit there and wait for me to do
some tests. Another complication is that I did some more optimizations
in bw than just the vfork()
thingy, even though this has the largest
impact. But finally, it was so slow before that proper benchmarking
would take a couple of hours or a day – I’m not willing to do that at
the moment.
What I can say is this:
We regularly run bw apply
on all of our systems. We do that in batches
of 100 nodes. Before, one such batch took roughly one hour. Now it is
down to roughly 8 minutes. My guess is that the vfork()
issue
accounts for a factor of roughly 5 here.
Most importantly, though, we should now be able to use larger worker pools. My laptop at work only has 4 cores (plus hyperthreading). What if I had a larger machine? We no longer get a penalty for having lots of workers and I’d be really curious to see what a modern CPU with 16, 32 or 64 cores could do.
posix_spawn()
?That’s another area to explore. It looks promising:
fork() step
Since glibc 2.24, the posix_spawn() function commences by call‐
ing clone(2) with CLONE_VM and CLONE_VFORK flags. Older imple‐
mentations use fork(2), or possibly vfork(2) (see below).
It might be similarly fast but with a much cleaner interface. It’s a topic for another time, though.