blog · git · desktop · images · contact & privacy · gopher
2021-05-28
This is a repost from my posting at nixers.net.
Dear UNIX diary,
today, I had fun with file descriptor flags at work.
I had a Python script that failed with EAGAIN
while doing a print()
on stdout
. Say what?
Looking at write(2)
, we can see this:
EAGAIN The file descriptor fd refers to a file other than a
socket and has been marked nonblocking (O_NONBLOCK), and
the write would block. See open(2) for further details
on the O_NONBLOCK flag.
EAGAIN or EWOULDBLOCK
The file descriptor fd refers to a socket and has been
marked nonblocking (O_NONBLOCK), and the write would
block. [...]
You can trigger the error using the following snippet of C code:
#include <fcntl.h>
#include <stdio.h>
int
main()
{
int i;
if (fcntl(1, F_SETFL, O_NONBLOCK) == -1)
{
perror("fcntl");
return 1;
}
for (i = 0; i < 4 * 8192; i++)
printf("hello world"); /* no trailing \n */
return 0;
}
Just run it in a terminal, maybe under strace
to see what’s going on.
It goes something like this:
17:53:50.053752 write(1, "ello worldhello worldhello world"..., 1024) = 1024
17:53:50.053779 write(1, "llo worldhello worldhello worldh"..., 1024) = 1024
17:53:50.053949 write(1, "rldhello worldhello worldhello w"..., 1024) = -1 EAGAIN (Resource temporarily unavailable)
17:53:50.053975 write(1, "hello worldhello worldhello worl"..., 1024) = -1 EAGAIN (Resource temporarily unavailable)
17:53:50.054436 write(1, "hello worldhello worldhello worl"..., 1024) = 1024
17:53:50.054569 write(1, " worldhello worldhello worldhell"..., 1024) = 1024
17:53:50.054596 write(1, "worldhello worldhello worldhello"..., 1024) = -1 EAGAIN (Resource temporarily unavailable)
17:53:50.054620 write(1, "hello worldhello worldhello worl"..., 1024) = -1 EAGAIN (Resource temporarily unavailable)
17:53:50.054645 write(1, "hello worldhello worldhello worl"..., 1024) = -1 EAGAIN (Resource temporarily unavailable)
libc does some buffering and then eventually tries to write chunks of
1024 bytes. Since we set the file descriptor to nonblocking, the syscall
immediately returns and doesn’t wait for the data to be actually written
to its final destination. So, I guess, there’s another kernel-internal
buffer. When that one is full, we get EAGAIN
.
My original Python script was indeed writing lots of data. So, somehow,
my stdout
must have been set to nonblocking mode. I checked the code,
but I couldn’t find anything that did this. But okay, maybe Python does
this internally for some weird reason? I tried to reproduce this in
isolated test cases, but nope, I couldn’t see Python doing an fcntl()
.
Alright, what else does my script do? It forks and runs ssh. Mhm. But I
used Python’s check_output()
, so ssh’s stdout
was a newly created
pipe. At first, I couldn’t see how ssh could possibly botch my
stdout
?!
Then another idea: This script was being run as a systemd service.
Meaning, the stdout
we’re talking about is actually a socket which is
connected to journald. So maybe systemd screws up and sets this socket
to nonblocking for whatever reason … ? But I couldn’t verify that
theory, either.
And then it dawned on me: systemd connects the same socket to
both stdout
and stderr
. So, if ssh did something to its stderr
,
that could affect my Python script’s stdout
.
Let’s give this a shot:
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
int
main()
{
int ret, i, fd;
if (fork() == 0)
{
fd = open("/dev/null", O_WRONLY);
dup2(fd, 1);
close(fd);
execlp("strace", "strace", "-o", "ssh.log",
"ssh", "myserver.example.com", "date",
(char *)0);
return 1;
}
for (i = 0; i < 10; i++)
{
if ((ret = fcntl(1, F_GETFL)) < 0)
{
perror("fcntl");
return 1;
}
if (ret & O_NONBLOCK)
puts("Is NONblocking");
else
puts("Is blocking");
sleep(1);
}
return 0;
}
And here’s the glorious output, stdout
and stderr
in the parent
being the same pipe:
$ ./borked 2>&1 | cat
Is blocking
Is NONblocking
Is NONblocking
Is NONblocking
Is NONblocking
Is NONblocking
Is NONblocking
Is NONblocking
Is NONblocking
Is NONblocking
The child process did close its original stdout
(got reopened from
/dev/null
), but stdout
in the parent process still got changed.
Now, the original environment uses ssh multiplexing, so I used that in my now isolated test case as well. I reproduced it without multiplexing as well, I put that case at the end of the posting.
The ssh process I’ve spawned transfers the three standard file descriptors to the main multiplexing process:
19:11:32.186688 sendmsg(4, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[0]}], msg_controllen=24, msg_flags=0}, 0) = 1
19:11:32.186753 sendmsg(4, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[1]}], msg_controllen=24, msg_flags=0}, 0) = 1
19:11:32.186795 sendmsg(4, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[2]}], msg_controllen=24, msg_flags=0}, 0) = 1
And that process, in turn, does this:
19:11:32.186811 recvmsg(5, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[6]}], msg_controllen=24, msg_flags=0}, 0) = 1
19:11:32.186853 recvmsg(5, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[7]}], msg_controllen=24, msg_flags=0}, 0) = 1
19:11:32.186903 recvmsg(5, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[8]}], msg_controllen=24, msg_flags=0}, 0) = 1
19:11:32.186941 ioctl(6, TCGETS, {B38400 opost isig icanon echo ...}) = 0
19:11:32.186976 ioctl(7, TCGETS, 0x7ffdce246db0) = -1 ENOTTY (Inappropriate ioctl for device)
19:11:32.187004 fcntl(7, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE)
19:11:32.187030 fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 0
19:11:32.187054 ioctl(8, TCGETS, 0x7ffdce246db0) = -1 ENOTTY (Inappropriate ioctl for device)
19:11:32.187079 fcntl(8, F_GETFL) = 0x1 (flags O_WRONLY)
19:11:32.187104 fcntl(8, F_SETFL, O_WRONLY|O_NONBLOCK) = 0 <-------- TADA
There you have it. It sets fd 8, which is the stderr
it received, to
nonblocking. Since that is the same pipe as the parent process’s
stdout
, both are changed.
Duplicated file descriptors share some things, but not everything.
dup(2)
says:
After a successful return, the old and new file descriptors may
be used interchangeably. They refer to the same open file de‐
scription (see open(2)) and thus share file offset and file
status flags; [...]
The two file descriptors do not share file descriptor flags [...]
And “file status flags” are those set by fcntl(..., F_SETFL, ...)
, but
“file descriptor flags” are something else:
F_SETFL Set the file status flags, defined in <fcntl.h>, [...]
F_SETFD Set the file descriptor flags defined in [...]
Now the funny thing is, you might say: “Well, just don’t spawn processes
and let them inherit your file descriptors. Of course they can do nasty
things.” Yeah, but this was actually done intentionally: My
Python script spawned ssh and left its stderr
untouched in an effort
to automatically have ssh’s error messages end up at the same
destination (i.e., systemd journal) as my Python logs. I now have to
manually collect and handle ssh’s stderr
, which is more code and more
error prone.
Alright, so let’s quickly look at the case where we don’t use ssh
multiplexing. At first, I thought this didn’t trigger the “bug”, but it
indeed does, it’s just a little harder to spot. Here’s a partial
strace
log:
20:05:14.074435 dup(0) = 4
20:05:14.074505 dup(1) = 5
20:05:14.074532 dup(2) = 6
...
20:05:14.074674 fcntl(6, F_GETFL) = 0x1 (flags O_WRONLY)
20:05:14.074691 fcntl(6, F_SETFL, O_WRONLY|O_NONBLOCK) = 0
... then at the very end:
20:05:14.167997 close(6) = 0
20:05:14.168232 fcntl(2, F_GETFL) = 0x801 (flags O_WRONLY|O_NONBLOCK)
20:05:14.168249 fcntl(2, F_SETFL, O_WRONLY) = 0
20:05:14.168430 exit_group(0) = ?
In other words, ssh resets the flags in this case. My ssh command finished very quickly, so I didn’t see a “NONblocking” line at first.
All this is heavy usage of strace
. I love this tool. I haven’t read
ssh’s source code, though, that might be interesting as well. But that’s
something for another day.