blog · git · desktop · images · contact


Sudden death of a daemon

2020-07-22

Every now and then, a daemon of ours was exiting unexpectedly. It didn’t log anything. The only thing I saw was systemd logging something along these lines: “Process exited with status code 143.” Other than that, the daemon was just gone. Sometimes. Maybe every other day, maybe only once in a month. Fun.

The first question here is: Did the process choose to exit with code 143 or is this a result of a convention, where exit codes above 128 designate the receipt of some signal? If the latter was true, then it would be a SIGTERM, because 128 + 15 = 143.

In theory, there is a chance to find out:

#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

int
main(int argc, char **argv)
{
    pid_t child;
    int wstatus;

    child = fork();
    if (child == 0)
    {
        if (argc >= 2 && strcmp(argv[1], "--loop") == 0)
        {
            printf("Please kill %d\n", getpid());
            for (;;)
                /* noop */ ;
        }
        else
        {
            return 143;
        }
    }
    else if (child == -1)
    {
        perror("fork");
        return 1;
    }

    if (waitpid(child, &wstatus, 0) != -1)
    {
        if (WIFEXITED(wstatus))
            printf("Exited: %d\n", WEXITSTATUS(wstatus));
        else if (WIFSIGNALED(wstatus))
            printf("Signal: %d\n", WTERMSIG(wstatus));
    }

    return 0;
}

But I haven’t found a way to get this information via systemd. And, honestly, it wouldn’t have helped a lot, because the next question is: Which other process sent the signal?

So, all I was seeing was “exit code 143”. Now what?

After digging for a while, I basically gave up and wrote a short wrapper:

/* Call as:
 *
 * ./tracer /usr/bin/my-real-daemon --foo --bar -Dbaz=1
 */

#define _POSIX_C_SOURCE 1

#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <syslog.h>
#include <sys/signalfd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

void
die(char *msg)
{
    perror(msg);
    exit(EXIT_FAILURE);
}

int
main(int argc, char **argv)
{
    sigset_t mask;
    int sfd, wstatus;
    struct signalfd_siginfo fdsi;
    ssize_t s;
    pid_t child;

    child = fork();
    if (child == -1)
        die("fork");
    else if (child == 0)
    {
        argc--;
        argv++;
        execv(argv[0], argv);
        die("exec");
    }

    sigemptyset(&mask);
    sigaddset(&mask, SIGTERM);
    sigaddset(&mask, SIGCHLD);

    if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1)
        die("sigprocmask");

    sfd = signalfd(-1, &mask, 0);
    if (sfd == -1)
        die("signalfd");

    syslog(LOG_DAEMON | LOG_ERR, "Me: %u Child: %u\n", getpid(), child);

    for (;;)
    {
        s = read(sfd, &fdsi, sizeof(struct signalfd_siginfo));
        if (s != sizeof(struct signalfd_siginfo))
            die("read");

        syslog(LOG_DAEMON | LOG_ERR, "Got %u: code %d, pid %u, uid %u, status %d\n",
               fdsi.ssi_signo,
               fdsi.ssi_code,
               fdsi.ssi_pid,
               fdsi.ssi_uid,
               fdsi.ssi_status);

        switch (fdsi.ssi_signo)
        {
            case SIGCHLD:
                waitpid(child, &wstatus, 0);
                if (WIFEXITED(wstatus))
                {
                    syslog(LOG_DAEMON | LOG_ERR, "Child exited normally: %d\n",
                           WEXITSTATUS(wstatus));
                    return WEXITSTATUS(wstatus);
                }
                else if (WIFSIGNALED(wstatus))
                {
                    syslog(LOG_DAEMON | LOG_ERR, "Child exited with signal: %d\n",
                           WTERMSIG(wstatus));
                    return 128 + WTERMSIG(wstatus);
                }
                else
                {
                    syslog(LOG_DAEMON | LOG_ERR, "Child exited for unknown reason\n");
                    exit(EXIT_FAILURE);
                }
            case SIGTERM:
                syslog(LOG_DAEMON | LOG_ERR, "Relaying signal to %u\n", child);
                kill(child, SIGTERM);
                break;
            default:
                syslog(LOG_DAEMON | LOG_ERR, "Unexpected signal\n");
        }
    }
}

(I could also have used strace instead, but that tends to slow everything down massively.)

Anything on the system including systemd would think that my program is the real program. When they send the signal, I get it first and then I can see who sent it.

It didn’t work.

I still only saw “exit code 143”. From the perspective of my wrapper, the program still just quit.

Then a colleague eventually asked: “Maybe it’s a stale PID file?” No, that couldn’t be it. I had already checked that, because, yes, that daemon uses a PID file. And it wouldn’t have made a lot of sense, because the daemon was running properly for some time, and only then quit. On top of all that, the PID file now contained the PID of my wrapper, so I would have seen any signals sent there.

Needless to say, my colleague was right nevertheless.

I forgot about the fact that this daemon launches another daemon at some point. And that second daemon uses a PID file, too. A quick test confirmed the theory: The second daemon did indeed kill the first one because of a stale PID file. Congratulations.

But … why did my wrapper not indicate that?

The only explanation: The second PID file happened to contain the true PID of the first daemon. Yes, of course. I was assuming that something killed the first daemon intentionally, but that was just not the case. Something just happened to kill the true PID of the first daemon.

But what are the odds of that happening regularly? Yes, sometimes it didn’t happen for a month, but sometimes it happened every other day. Strange.

The answer is surprisingly simple: Threads.

A properly running instance of our daemon spawns a ton of threads:

systemd(1)-+-java(684)-+-java(762)
           |            |-{java}(731)
           |            |-{java}(732)
           |            |-{java}(733)
           |            |-{java}(734)
           |            |-... lots more ...
           |            `-{java}(958)

Each individual thread gets a “PID”, which is a potential target for a kill() (man 7 signal):

The signal disposition is a per-process attribute: in a  multi‐
threaded application, the disposition of a particular signal is
the same for all threads.

This increases the odds of a collision in a stale PID file dramatically.

The final ingredient for this disaster: That system was rebooted daily (on purpose – it’s turned off during non-business hours to save money). In other words, there’s a high chance of all those PIDs being in roughly the same range, because everything gets started during boot.

A better approach: auditd

I have to shamefully admit that I didn’t know a lot about the audit subsystem until now. I could have solved this problem a lot quicker.

I could have installed one simple rule to monitor all kill() invocations:

# auditctl -a always,exit -F arch=b64 -S kill -k mykill

That’ll produce log entries like this:

Jul 22 15:30:35 arch audit[996]: SYSCALL arch=c000003e syscall=62 success=yes exit=0 a0=3d6 a1=f a2=0 a3=560e1039942a items=0 ppid=858 pid=996 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts1 ses=1 comm="bla" exe="/root/bla" key="mykill"

That’s all I would have needed to know:

That would have led me very quickly to the offending process. And without the need to write a custom wrapper and what not.

And it was rather naïve of me to only log the sender’s PID in my own wrapper. I would have needed to log something like exe.

The “Please” section

Please, if you write a new daemon, do not use PID files anymore. Just don’t. It’s not needed anymore with today’s init systems, not even for “send a SIGHUP to reload” (see systemd’s $MAINPID, for example).

Please, if you write a new daemon, do not launch other daemons on your own. Just don’t. It is terrible to debug. Mention in the docs that a sysadmin has to launch two daemons and we’ll happily configure the system to do just that.

Comments?