A story about file descriptors and flags

2021-05-28

This is a repost from my posting at nixers.net.

Dear UNIX diary,

today, I had fun with file descriptor flags at work.

I had a Python script that failed with EAGAIN while doing a print() on stdout. Say what?

Looking at write(2), we can see this:

EAGAIN The  file  descriptor  fd  refers to a file other than a
       socket and has been marked nonblocking (O_NONBLOCK), and
       the  write would block.  See open(2) for further details
       on the O_NONBLOCK flag.

EAGAIN or EWOULDBLOCK
       The file descriptor fd refers to a socket and  has  been
       marked  nonblocking  (O_NONBLOCK),  and  the write would
       block. [...]

You can trigger the error using the following snippet of C code:

#include <fcntl.h>
#include <stdio.h>

int
main()
{
    int i;

    if (fcntl(1, F_SETFL, O_NONBLOCK) == -1)
    {
        perror("fcntl");
        return 1;
    }

    for (i = 0; i < 4 * 8192; i++)
        printf("hello world");  /* no trailing \n */

    return 0;
}

Just run it in a terminal, maybe under strace to see what’s going on. It goes something like this:

17:53:50.053752 write(1, "ello worldhello worldhello world"..., 1024) = 1024
17:53:50.053779 write(1, "llo worldhello worldhello worldh"..., 1024) = 1024
17:53:50.053949 write(1, "rldhello worldhello worldhello w"..., 1024) = -1 EAGAIN (Resource temporarily unavailable)
17:53:50.053975 write(1, "hello worldhello worldhello worl"..., 1024) = -1 EAGAIN (Resource temporarily unavailable)
17:53:50.054436 write(1, "hello worldhello worldhello worl"..., 1024) = 1024
17:53:50.054569 write(1, " worldhello worldhello worldhell"..., 1024) = 1024
17:53:50.054596 write(1, "worldhello worldhello worldhello"..., 1024) = -1 EAGAIN (Resource temporarily unavailable)
17:53:50.054620 write(1, "hello worldhello worldhello worl"..., 1024) = -1 EAGAIN (Resource temporarily unavailable)
17:53:50.054645 write(1, "hello worldhello worldhello worl"..., 1024) = -1 EAGAIN (Resource temporarily unavailable)

libc does some buffering and then eventually tries to write chunks of 1024 bytes. Since we set the file descriptor to nonblocking, the syscall immediately returns and doesn’t wait for the data to be actually written to its final destination. So, I guess, there’s another kernel-internal buffer. When that one is full, we get EAGAIN.

My original Python script was indeed writing lots of data. So, somehow, my stdout must have been set to nonblocking mode. I checked the code, but I couldn’t find anything that did this. But okay, maybe Python does this internally for some weird reason? I tried to reproduce this in isolated test cases, but nope, I couldn’t see Python doing an fcntl().

Alright, what else does my script do? It forks and runs ssh. Mhm. But I used Python’s check_output(), so ssh’s stdout was a newly created pipe. At first, I couldn’t see how ssh could possibly botch my stdout?!

Then another idea: This script was being run as a systemd service. Meaning, the stdout we’re talking about is actually a socket which is connected to journald. So maybe systemd screws up and sets this socket to nonblocking for whatever reason … ? But I couldn’t verify that theory, either.

And then it dawned on me: systemd connects the same socket to both stdout and stderr. So, if ssh did something to its stderr, that could affect my Python script’s stdout.

Let’s give this a shot:

#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

int
main()
{
    int ret, i, fd;

    if (fork() == 0)
    {
        fd = open("/dev/null", O_WRONLY);
        dup2(fd, 1);
        close(fd);

        execlp("strace", "strace", "-o", "ssh.log",
               "ssh", "myserver.example.com", "date",
               (char *)0);
        return 1;
    }

    for (i = 0; i < 10; i++)
    {
        if ((ret = fcntl(1, F_GETFL)) < 0)
        {
            perror("fcntl");
            return 1;
        }
        if (ret & O_NONBLOCK)
            puts("Is NONblocking");
        else
            puts("Is blocking");

        sleep(1);
    }

    return 0;
}

And here’s the glorious output, stdout and stderr in the parent being the same pipe:

$ ./borked 2>&1 | cat
Is blocking
Is NONblocking
Is NONblocking
Is NONblocking
Is NONblocking
Is NONblocking
Is NONblocking
Is NONblocking
Is NONblocking
Is NONblocking

The child process did close its original stdout (got reopened from /dev/null), but stdout in the parent process still got changed.

Now, the original environment uses ssh multiplexing, so I used that in my now isolated test case as well. I reproduced it without multiplexing as well, I put that case at the end of the posting.

The ssh process I’ve spawned transfers the three standard file descriptors to the main multiplexing process:

19:11:32.186688 sendmsg(4, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[0]}], msg_controllen=24, msg_flags=0}, 0) = 1
19:11:32.186753 sendmsg(4, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[1]}], msg_controllen=24, msg_flags=0}, 0) = 1
19:11:32.186795 sendmsg(4, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[2]}], msg_controllen=24, msg_flags=0}, 0) = 1

And that process, in turn, does this:

19:11:32.186811 recvmsg(5, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[6]}], msg_controllen=24, msg_flags=0}, 0) = 1
19:11:32.186853 recvmsg(5, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[7]}], msg_controllen=24, msg_flags=0}, 0) = 1
19:11:32.186903 recvmsg(5, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[8]}], msg_controllen=24, msg_flags=0}, 0) = 1
19:11:32.186941 ioctl(6, TCGETS, {B38400 opost isig icanon echo ...}) = 0
19:11:32.186976 ioctl(7, TCGETS, 0x7ffdce246db0) = -1 ENOTTY (Inappropriate ioctl for device)
19:11:32.187004 fcntl(7, F_GETFL)       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
19:11:32.187030 fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 0
19:11:32.187054 ioctl(8, TCGETS, 0x7ffdce246db0) = -1 ENOTTY (Inappropriate ioctl for device)
19:11:32.187079 fcntl(8, F_GETFL)       = 0x1 (flags O_WRONLY)
19:11:32.187104 fcntl(8, F_SETFL, O_WRONLY|O_NONBLOCK) = 0    <-------- TADA

There you have it. It sets fd 8, which is the stderr it received, to nonblocking. Since that is the same pipe as the parent process’s stdout, both are changed.

Duplicated file descriptors share some things, but not everything. dup(2) says:

After a successful return, the old and new file descriptors may
be used interchangeably.  They refer to the same open file  de‐
scription  (see  open(2))  and  thus share file offset and file
status flags; [...]

The two file descriptors do not  share  file  descriptor  flags [...]

And “file status flags” are those set by fcntl(..., F_SETFL, ...), but “file descriptor flags” are something else:

F_SETFL       Set the file status flags, defined in  <fcntl.h>, [...]

F_SETFD       Set  the  file  descriptor  flags defined in [...]

Now the funny thing is, you might say: “Well, just don’t spawn processes and let them inherit your file descriptors. Of course they can do nasty things.” Yeah, but this was actually done intentionally: My Python script spawned ssh and left its stderr untouched in an effort to automatically have ssh’s error messages end up at the same destination (i.e., systemd journal) as my Python logs. I now have to manually collect and handle ssh’s stderr, which is more code and more error prone.

Alright, so let’s quickly look at the case where we don’t use ssh multiplexing. At first, I thought this didn’t trigger the “bug”, but it indeed does, it’s just a little harder to spot. Here’s a partial strace log:

20:05:14.074435 dup(0)                  = 4
20:05:14.074505 dup(1)                  = 5
20:05:14.074532 dup(2)                  = 6
...
20:05:14.074674 fcntl(6, F_GETFL)       = 0x1 (flags O_WRONLY)
20:05:14.074691 fcntl(6, F_SETFL, O_WRONLY|O_NONBLOCK) = 0

... then at the very end:
20:05:14.167997 close(6)                = 0
20:05:14.168232 fcntl(2, F_GETFL)       = 0x801 (flags O_WRONLY|O_NONBLOCK)
20:05:14.168249 fcntl(2, F_SETFL, O_WRONLY) = 0
20:05:14.168430 exit_group(0)           = ?

In other words, ssh resets the flags in this case. My ssh command finished very quickly, so I didn’t see a “NONblocking” line at first.

All this is heavy usage of strace. I love this tool. I haven’t read ssh’s source code, though, that might be interesting as well. But that’s something for another day.

Comments?