blog · git · desktop · images · contact & privacy · gopher


Logging mishaps

2022-04-15

When you work as a sysadmin, you read lots and lots of log files. I’d like to point out some recurring issues that can be easily™ avoided.

I’ll be focusing on slightly more “modern” logging issues. Things like broken logrotate configs which fill up hard disks are well known and not particularly interesting anymore.

We’ll first cover some issues that arise from bad practice, and then get into, well, “programming errors” later on.

Missing or incomplete timestamps

Let’s get this out of the way first. It’s a (seemingly) simple error, but it can make your logs absolutely worthless, so I feel the need to point this out:

Log lines must show a timestamp in ISO 8601 format, with at least millisecond precision, and it must include the timezone.

As for “why ISO 8601”, I’ll refer to xkcd.

It’s obvious that timestamps are a very good idea, lots of people ignore the timezone, though. Here’s the thing: Not all daemons run in the same timezone, not even the ones running on the same server. When I see “2022-04-15 07:34:12.456”, I have no idea when that was. Is it UTC? Is it “my” local time? Something else?

While we’re at it: Better use numerical notation for timezones, so “2022-04-15 08:17:20 +0200” is better than “2022-04-15 08:17:20 CEST”, because it saves me the need to look up what “CEST” is (or rather, what “CEST” was when that log message was written, if we’re being pedantic).

Also be aware that daylight saving time still exists, annoying as it might be. So, “2022-10-30 02:01” is an ambiguous timestamp in Germany: It can mean either “2022-10-30 02:01 CEST” or “2022-10-30 02:01 CET” (DST on or off), the latter being an hour later than the first one.

Is it better to always use UTC for logs? As a general rule, no, because then I have to convert it in my head, which is very error prone. It depends on the use case, though. And even if you use UTC, then please say so by showing UTC or +0000.

Omitting the timezone is never a good idea. Without a timezone, some people say: “Well, of course it’s UTC, what did you think?” And others say: “Of course it’s local time, duh?!” It’s ambiguous.

(Ideally, you could store UNIX timestamps and let me configure in which timezone they should be displayed. That’s not always possible, though.)

Okay, so this is a trivial error that can be avoided easily, right? Just show the full date, time, timezone? Easy? Nope.

First, what if your daemon is connected to a logging system? The logging system will show its own timestamps, too, so all log lines will then have two timestamps, which is harder to read. You might be tempted to omit timestamps then, but bear in mind that remote logging exist: If you simply dump flat log lines to a remote system, it might take them a while to get there, so the logging system’s timestamps might be wrong: They then show when a message has arrived, not when it was created.

Second, to produce correct timestamps, you need an appropriate library and a timezone database. This opens a whole can of worms of its own. For example, Rust’s standard library has none of that and you need something like chrono. That’s just a very minor problem, though, because timezone databases need maintenance: You cannot just put the current version in your Docker image and be done with it. When timezones change – which they do all the time, just check the release notes of the most popular timezone database, tz –, you need to update your database, or else everything goes haywire.

Okay, so time is obviously horrible and chaotic. What should you do? I’d argue it’s best to go all the way, better safe than sorry:

(I’d have more things to say about the very broken Docker ecosystem, e.g. me relying on your Docker images is the wrong approach anyway, but that’s out of scope.)

Missing or non-useful context

There is nothing worse than reading this:

Connection reset by peer

Well, actually, there is: Reading those messages buried in a massive stack trace.

First, “connection reset by peer” doesn’t tell me anything. It fails to answer this question: Which peer? The hostname or IP address of that other system must be included in the message. That’s not always enough, though, because sometimes it’s just localhost:42132, meaning some process that ran at that time on some non-deterministic port and which has now quit (so it doesn’t run anymore, so I can’t look up what that thing was). So, ideally, these log messages should also include a reason why we were talking to that other system.

A good example:

Failed to fetch users from LDAP server foo.example.com:636: Connection timed out.

As for stack traces: Not only are they pretty much useless to anybody who does not have access to the source code (yes, maybe I can infer some details from them, but it’s mostly just guessing), but they are also often so long that they bury other useful information, hence they are almost harmful.

So, my point is: Do not assume that it’s good enough to just let your runtime or your framework throw exceptions. Catch errors, handle them if possible, show meaningful messages.

(This section is so obvious that I’ve almost wanted to remove it from the blog post. Then again, these errors are so utterly common that we should talk about it.)

Configurability and GDPR

This is still a relatively new issue, because GDPR has only been around since 2018. Logs just used to contain whatever information, because nobody ever cared.

It can sometimes be required to exclude certain information from logs, because we’re not allowed to store them according to some corporate policy. IP addresses are the main thing, but it could also affect other stuff. (Usernames, maybe? It always depends on your scenario.) These things should be configurable, because filtering logs afterwards is madness.

Some daemons still write their own log files (instead of just dumping it to stdout), which usually means they also implement their own version of logrotate. In this case: Please allow me to configure how long logs are being kept. And please don’t bury this setting somewhere deep down. It’s important, so please expose it prominently.

Internal buffering

Let’s get into the more technical stuff.

When you write print('foo') in your code, you’re usually not talking to the operating system directly, but only to the runtime of your programming language. This means your message does not necessarily show up right now. Your runtime might decide to just put it in a buffer for now and then, later on, show a bunch of messages all at once. This is done for performance reasons, because putting things into a buffer is much faster than talking to the OS.

You can usually tell your runtime to disable this buffer (e.g., setvbuf()) or to display the buffer’s contents now (e.g., fflush()).

This first stage, where your data has “left” your print() call but has not yet left your process, is what I call “internal buffering”.

When your program finally talks to the OS, it issues a syscall like write(). The kernel, too, uses buffers. You can ask it politely to flush those buffers, which can be done using a syscall like POSIX’s fsync(). (Not everything can be fsync()’d, though.) Let’s call this second stage “external buffering” or just “kernel buffering”.

Internal buffering: Disappearing logs

This is the most harmful issue of this class. Let’s take this little Python script:

#!/usr/bin/env python3

print('Hello')
print('World')

Run it through strace to see what it’s doing:

$ strace -tt -ff -e write ./disappear.py
06:56:37.017958 write(1, "Hello\n", 6)  = 6
06:56:37.018071 write(1, "World\n", 6)  = 6
06:56:37.019984 +++ exited with 0 +++

That’s what we expect, right? Problem is, it behaves differently when its stdout is not a terminal:

$ strace -tt -ff -e write ./disappear.py | cat
06:56:42.102999 write(1, "Hello\nWorld\n", 12) = 12
06:56:42.104639 +++ exited with 0 +++

Now Python buffers all output until some criterium is met. This means that if your program crashes before that happens, you will never get any logs. And, well, crashes are those situations where you really want to have logs.

It can be simulated easily:

#!/usr/bin/env python3


from os import getpid
from subprocess import run


print('Hello')
print('World')

run(['kill', '-9', str(getpid())])

Python scripts rarely crash on their own, they usually terminate due to some exception, and then the exception handler flushes any remaining buffers. But that’s not the only way a program might terminate, it could also be killed from the outside – and then it can happen that it’s just gone.

Internal buffering: Delayed logs

Another symptom of this error is logs showing up with great delay.

I regularly watch logs live during the startup of a daemon, like journalctl -fau foobar.service. Logs are pretty much the only way for me to find out what a daemon does, so it is important that they show up in a very timely fashion. (This is a big disadvantage when using remote logging.)

When your daemon starts and prints a bunch of lines – but only puts them in its internal buffer for now –, then I won’t see them. This means I have no idea what’s going on. Does it wait on some external dependency? Is it still doing internal initialization? Or has it long finished initializing, but the logs simply don’t appear yet due to buffering?

This is not as bad as lost logs, but still not great.

Internal buffering: Mixups

A simple script:

#!/usr/bin/env python3


from sys import stderr


print('Hello')
print('I am an error message', file=stderr)
print('world')

Again, on a terminal you get:

$ ./reorder.py
Hello
I am an error message
world

But as soon as it’s not a terminal anymore, you get this:

$ ./reorder.py | cat
I am an error message
Hello
world

The reason is that stdout usually gets buffered, but not stderr. This probably stems from the idea of stdout being for “data” (think UNIX filters here) and stderr being for “errors and diagnostics”. You want to see the latter immediately, but it’s legitimate to buffer data for performance. In other words, stdout and the conventions around it were probably never really meant for logging, but we now do it anyway.

Internal buffering: Conclusion

Internal buffering must always be turned off for log messages. The performance impact is usually negligible.

Multiple file descriptors for different log levels

Logs can get very noisy, so you want to split “informational” messages from “warnings” and “errors”. Traditionally, this has been done by using different log files, like an error.log and an access.log. This is going out of fashion, though, and these days, you usually just connect your daemon’s stdout and stderr to some other logging daemon (systemd’s journal, Docker, Kubernetes, …). So what we’re left with is usually stderr for “errors” and stdout for “anything else”.

Splitting log levels by using different file descriptors appears to have one big advantage: You don’t need to concern yourself with structured logging to distinguish between error levels. Structured logging has pros and cons, and more often than not I’m left with reading them verbatim, which is super bad, so I generelly don’t like them.

(The syslog format can be considered structured logging, too, and easier to read by humans – if you ever encounter it verbatim –, but it’s going out of fashion as well. “Structured logging” usually means massive dumps of JSON these days. Some logging systems are too lazy or dumb to actually understand that JSON, so I’m force to read it verbatim.)

So, you have two file descriptors. What that can mean is that your log messages could arrive at the logging daemon via two different ways. Even if your daemon does not suffer from internal buffering, you’re bound to see the order of messages mixed up.

Multiple file descriptors: Mixups at the receiver

Consider a simple C program that writes lots of data:

#include <unistd.h>

int
main()
{
    int i;

    for (i = 0; i < 100000; i++)
    {
        write(1, "hello\n", 6);
        write(2, "error\n", 6);
    }
}

When I run this through Docker, the file descriptors get connected like this:

# ls -al /proc/275510/fd/{1,2}
l-wx------ 1 root root 64 Apr 16 10:49 /proc/275510/fd/1 -> 'pipe:[315373]'
l-wx------ 1 root root 64 Apr 16 10:49 /proc/275510/fd/2 -> 'pipe:[315374]'

And at the other end:

# ls -al /proc/275490/fd/{12,14}
lr-x------ 1 root root 64 Apr 16 10:50 /proc/275490/fd/12 -> 'pipe:[315373]'
lr-x------ 1 root root 64 Apr 16 10:50 /proc/275490/fd/14 -> 'pipe:[315374]'

FD 1 and 2 are two different pipes between those two processes. The output that I see on the terminal or via docker logs are chunks of hello followed by other chunks of error, and so on. The reason is that there are two channels from my program to Docker and Docker has to decide when to read which one. It cannot know the original order of my log messages.

Note that systemd gets this right. When you inspect some daemon running under systemd, you will see this:

$ ls -al /proc/198378/fd/{1,2}
lrwx------ 1 root root 64 Apr 16 10:03 /proc/198378/fd/1 -> 'socket:[234066]'
lrwx------ 1 root root 64 Apr 16 10:03 /proc/198378/fd/2 -> 'socket:[234066]'

It’s the same socket for both stdout and stderr. Of course, this also means that systemd cannot tell messages from those streams apart, which is why they both end up getting the same “priority” in your journal.

(Fun fact: When I ran the above C program as a systemd unit, I still saw the occasional mixup. It was very rare, though. Maybe some kind of race condition internal to journald?)

Multiple file descriptors: Mixups at a proxy

The above problem is made worse when there is an additional component in between. You can then also run into this problem when using systemd.

Let’s say you want to connect your program to journald/syslog manually by doing this:

#!/bin/bash

exec 1> >(logger)
exec 2> >(logger -p user.error)

./simplewrite

# ... and maybe a much longer script follows ...

In this case, simplewrite is just this C program:

#include <unistd.h>

int
main()
{
    write(1, "hello\n", 6);
    write(2, "error\n", 6);
    write(1, "world\n", 6);
    return 0;
}

And what I get in my logs is this happy little accident:

Apr 15 09:08:20 pinguin void[77794]: hello
Apr 15 09:08:20 pinguin void[77794]: world
Apr 15 09:08:20 pinguin void[77796]: error

What happened here? Let’s add some strace calls and have a look. First, let’s make sure that our program did not accidentally suffer from internal buffering (which it shouldn’t, because write() in C should correspond to the syscall – but you never know):

09:08:20.353065 write(1, "hello\n", 6)  = 6
09:08:20.353111 write(2, "error\n", 6)  = 6
09:08:20.353147 write(1, "world\n", 6)  = 6

Looks good. Now the simple case, the logger which is responsible for stderr:

09:08:20.356940 read(0, "error\n", 4096) = 6
09:08:20.356962 sendmsg(3, {...}, MSG_NOSIGNAL) = 31

It reads our string and then sends it off to the logging daemon. Fine.

What about the logger that handles stdout?

09:08:20.356661 read(0, "hello\nworld\n", 4096) = 12
09:08:20.356689 sendmsg(3, {..., {iov_base="hello", iov_len=5}, ...}, MSG_NOSIGNAL) = 31
09:08:20.356723 sendmsg(3, {..., {iov_base="world", iov_len=5}, ...}, MSG_NOSIGNAL) = 31

Our two write() calls happened fast enough, so the kernel didn’t even wake up the other end of the pipe. When it finally did wake up, both lines were already ready to read in a single read() call.

For demonstration, let’s add a short delay between the two calls (I went for for loops here instead of sleep() to avoid any kind of potential side effects from scheduling the process to the S state, even though I’m not sure if there really are any):

#include <unistd.h>

int
main()
{
    write(1, "hello\n", 6);
    write(2, "error\n", 6);

    for (int i = 0; i < 10000; i++)
        for (int j = 0; j < 10000; j++)
            ;

    write(1, "world\n", 6);

    return 0;
}

To confirm, the delay really exists in the strace log of the sender, it’s about 150 ms:

08:12:44.666617 write(1, "hello\n", 6)  = 6
08:12:44.666645 write(2, "error\n", 6)  = 6
08:12:44.826014 write(1, "world\n", 6)  = 6

And now the receiver does get unmerged data (careful, 08:12:44.671246 is the beginning of that read() syscall):

08:12:44.671192 read(0, "hello\n", 4096) = 6
08:12:44.671213 sendmsg(3, {..., {iov_base="hello", iov_len=5}, ...}, MSG_NOSIGNAL) = 31
08:12:44.671246 read(0, "world\n", 4096) = 6
08:12:44.826152 sendmsg(3, {..., {iov_base="world", iov_len=5}, ...}, MSG_NOSIGNAL) = 31

But of course, having two logger processes also shuffles everything around a bit, before the data even reaches journald. I just saw this in my second test (the one with the additional delay):

Apr 16 08:12:44 pinguin void[14779]: error
Apr 16 08:12:44 pinguin void[14780]: hello
Apr 16 08:12:44 pinguin void[14780]: world

The reason is that the first logger process has processed the error message before the second one got its hello message. Look at the timestamps again from that last strace log above:

08:12:44.671192 read(0, "hello\n", 4096) = 6
08:12:44.671213 sendmsg(3, {..., {iov_base="hello", iov_len=5}, ...}, MSG_NOSIGNAL) = 31

And compare this to the other logger process:

08:12:44.670314 read(0, "error\n", 4096) = 6
08:12:44.670339 sendmsg(3, {..., {iov_base="error", iov_len=5}, ...}, MSG_NOSIGNAL) = 31

About a millisecond earlier. (From that log, it is not entirely clear whether hello arrived later than error, because, again, those timestamps show the beginning of the syscalls, not when they returned. The error message certainly got sent to the logging daemon earlier than the hello message.)

Multiple file descriptors: Conclusion

So, in terms of file descriptors, I’d argue that it’s better to just use a single one. This avoids this whole class of problems. If you want to distinguish between error levels, it’s probably better to use some kind of structured logging or prefix your lines using [I] for info, [E] for error, and so on – if that’s feasible.

Conclusion

Alright, these were some examples that came to my mind recently. This is not an exhaustive list, of course, it’s a complex topic.

Logs are one of those things that only “usually somehow kind of work”, probably because they’re often an afterthought. And yet they are incredibly important.

Comments?