blog · git · desktop · images · contact
2022-04-15
When you work as a sysadmin, you read lots and lots of log files. I’d like to point out some recurring issues that can be easily™ avoided.
I’ll be focusing on slightly more “modern” logging issues. Things like broken logrotate configs which fill up hard disks are well known and not particularly interesting anymore.
We’ll first cover some issues that arise from bad practice, and then get into, well, “programming errors” later on.
Let’s get this out of the way first. It’s a (seemingly) simple error, but it can make your logs absolutely worthless, so I feel the need to point this out:
Log lines must show a timestamp in ISO 8601 format, with at least millisecond precision, and it must include the timezone.
As for “why ISO 8601”, I’ll refer to xkcd.
It’s obvious that timestamps are a very good idea, lots of people ignore the timezone, though. Here’s the thing: Not all daemons run in the same timezone, not even the ones running on the same server. When I see “2022-04-15 07:34:12.456”, I have no idea when that was. Is it UTC? Is it “my” local time? Something else?
While we’re at it: Better use numerical notation for timezones, so “2022-04-15 08:17:20 +0200” is better than “2022-04-15 08:17:20 CEST”, because it saves me the need to look up what “CEST” is (or rather, what “CEST” was when that log message was written, if we’re being pedantic).
Also be aware that daylight saving time still exists, annoying as it might be. So, “2022-10-30 02:01” is an ambiguous timestamp in Germany: It can mean either “2022-10-30 02:01 CEST” or “2022-10-30 02:01 CET” (DST on or off), the latter being an hour later than the first one.
Is it better to always use UTC for logs? As a general rule, no, because
then I have to convert it in my head, which is very error prone. It
depends on the use case, though. And even if you use UTC, then please
say so by showing UTC
or +0000
.
Omitting the timezone is never a good idea. Without a timezone, some people say: “Well, of course it’s UTC, what did you think?” And others say: “Of course it’s local time, duh?!” It’s ambiguous.
(Ideally, you could store UNIX timestamps and let me configure in which timezone they should be displayed. That’s not always possible, though.)
Okay, so this is a trivial error that can be avoided easily, right? Just show the full date, time, timezone? Easy? Nope.
First, what if your daemon is connected to a logging system? The logging system will show its own timestamps, too, so all log lines will then have two timestamps, which is harder to read. You might be tempted to omit timestamps then, but bear in mind that remote logging exist: If you simply dump flat log lines to a remote system, it might take them a while to get there, so the logging system’s timestamps might be wrong: They then show when a message has arrived, not when it was created.
Second, to produce correct timestamps, you need an appropriate library and a timezone database. This opens a whole can of worms of its own. For example, Rust’s standard library has none of that and you need something like chrono. That’s just a very minor problem, though, because timezone databases need maintenance: You cannot just put the current version in your Docker image and be done with it. When timezones change – which they do all the time, just check the release notes of the most popular timezone database, tz –, you need to update your database, or else everything goes haywire.
Okay, so time is obviously horrible and chaotic. What should you do? I’d argue it’s best to go all the way, better safe than sorry:
(I’d have more things to say about the very broken Docker ecosystem, e.g. me relying on your Docker images is the wrong approach anyway, but that’s out of scope.)
There is nothing worse than reading this:
Connection reset by peer
Well, actually, there is: Reading those messages buried in a massive stack trace.
First, “connection reset by peer” doesn’t tell me anything. It fails to
answer this question: Which peer? The hostname or IP address of that
other system must be included in the message. That’s not always enough,
though, because sometimes it’s just localhost:42132
, meaning some
process that ran at that time on some non-deterministic port and which
has now quit (so it doesn’t run anymore, so I can’t look up what that
thing was). So, ideally, these log messages should also include a reason
why we were talking to that other system.
A good example:
Failed to fetch users from LDAP server foo.example.com:636: Connection timed out.
As for stack traces: Not only are they pretty much useless to anybody who does not have access to the source code (yes, maybe I can infer some details from them, but it’s mostly just guessing), but they are also often so long that they bury other useful information, hence they are almost harmful.
So, my point is: Do not assume that it’s good enough to just let your runtime or your framework throw exceptions. Catch errors, handle them if possible, show meaningful messages.
(This section is so obvious that I’ve almost wanted to remove it from the blog post. Then again, these errors are so utterly common that we should talk about it.)
This is still a relatively new issue, because GDPR has only been around since 2018. Logs just used to contain whatever information, because nobody ever cared.
It can sometimes be required to exclude certain information from logs, because we’re not allowed to store them according to some corporate policy. IP addresses are the main thing, but it could also affect other stuff. (Usernames, maybe? It always depends on your scenario.) These things should be configurable, because filtering logs afterwards is madness.
Some daemons still write their own log files (instead of just dumping
it to stdout
), which usually means they also implement their own
version of logrotate. In this case: Please allow me to configure how
long logs are being kept. And please don’t bury this setting somewhere
deep down. It’s important, so please expose it prominently.
Let’s get into the more technical stuff.
When you write print('foo')
in your code, you’re usually not talking
to the operating system directly, but only to the runtime of your
programming language. This means your message does not necessarily show
up right now. Your runtime might decide to just put it in a buffer for
now and then, later on, show a bunch of messages all at once. This is
done for performance reasons, because putting things into a buffer is
much faster than talking to the OS.
You can usually tell your runtime to disable this buffer (e.g.,
setvbuf()
) or to display the buffer’s contents now (e.g.,
fflush()
).
This first stage, where your data has “left” your print()
call but has
not yet left your process, is what I call “internal buffering”.
When your program finally talks to the OS, it issues a syscall like
write()
. The kernel, too, uses buffers. You can ask it politely to
flush those buffers, which can be done using a syscall like POSIX’s
fsync()
. (Not everything can be fsync()
’d, though.) Let’s call this
second stage “external buffering” or just “kernel buffering”.
This is the most harmful issue of this class. Let’s take this little Python script:
#!/usr/bin/env python3
print('Hello')
print('World')
Run it through strace
to see what it’s doing:
$ strace -tt -ff -e write ./disappear.py
06:56:37.017958 write(1, "Hello\n", 6) = 6
06:56:37.018071 write(1, "World\n", 6) = 6
06:56:37.019984 +++ exited with 0 +++
That’s what we expect, right? Problem is, it behaves differently when
its stdout
is not a terminal:
$ strace -tt -ff -e write ./disappear.py | cat
06:56:42.102999 write(1, "Hello\nWorld\n", 12) = 12
06:56:42.104639 +++ exited with 0 +++
Now Python buffers all output until some criterium is met. This means that if your program crashes before that happens, you will never get any logs. And, well, crashes are those situations where you really want to have logs.
It can be simulated easily:
#!/usr/bin/env python3
from os import getpid
from subprocess import run
print('Hello')
print('World')
run(['kill', '-9', str(getpid())])
Python scripts rarely crash on their own, they usually terminate due to some exception, and then the exception handler flushes any remaining buffers. But that’s not the only way a program might terminate, it could also be killed from the outside – and then it can happen that it’s just gone.
Another symptom of this error is logs showing up with great delay.
I regularly watch logs live during the startup of a daemon, like
journalctl -fau foobar.service
. Logs are pretty much the only way for
me to find out what a daemon does, so it is important that they show up
in a very timely fashion. (This is a big disadvantage when using remote
logging.)
When your daemon starts and prints a bunch of lines – but only puts them in its internal buffer for now –, then I won’t see them. This means I have no idea what’s going on. Does it wait on some external dependency? Is it still doing internal initialization? Or has it long finished initializing, but the logs simply don’t appear yet due to buffering?
This is not as bad as lost logs, but still not great.
A simple script:
#!/usr/bin/env python3
from sys import stderr
print('Hello')
print('I am an error message', file=stderr)
print('world')
Again, on a terminal you get:
$ ./reorder.py
Hello
I am an error message
world
But as soon as it’s not a terminal anymore, you get this:
$ ./reorder.py | cat
I am an error message
Hello
world
The reason is that stdout
usually gets buffered, but not stderr
.
This probably stems from the idea of stdout
being for “data” (think
UNIX filters here) and stderr
being for “errors and diagnostics”. You
want to see the latter immediately, but it’s legitimate to buffer data
for performance. In other words, stdout
and the conventions around it
were probably never really meant for logging, but we now do it anyway.
Internal buffering must always be turned off for log messages. The performance impact is usually negligible.
Logs can get very noisy, so you want to split “informational” messages
from “warnings” and “errors”. Traditionally, this has been done by using
different log files, like an error.log
and an access.log
. This is
going out of fashion, though, and these days, you usually just connect
your daemon’s stdout
and stderr
to some other logging daemon
(systemd’s journal, Docker, Kubernetes, …). So what we’re left with is
usually stderr
for “errors” and stdout
for “anything else”.
Splitting log levels by using different file descriptors appears to have one big advantage: You don’t need to concern yourself with structured logging to distinguish between error levels. Structured logging has pros and cons, and more often than not I’m left with reading them verbatim, which is super bad, so I generelly don’t like them.
(The syslog format can be considered structured logging, too, and easier to read by humans – if you ever encounter it verbatim –, but it’s going out of fashion as well. “Structured logging” usually means massive dumps of JSON these days. Some logging systems are too lazy or dumb to actually understand that JSON, so I’m force to read it verbatim.)
So, you have two file descriptors. What that can mean is that your log messages could arrive at the logging daemon via two different ways. Even if your daemon does not suffer from internal buffering, you’re bound to see the order of messages mixed up.
Consider a simple C program that writes lots of data:
#include <unistd.h>
int
main()
{
int i;
for (i = 0; i < 100000; i++)
{
write(1, "hello\n", 6);
write(2, "error\n", 6);
}
}
When I run this through Docker, the file descriptors get connected like this:
# ls -al /proc/275510/fd/{1,2}
l-wx------ 1 root root 64 Apr 16 10:49 /proc/275510/fd/1 -> 'pipe:[315373]'
l-wx------ 1 root root 64 Apr 16 10:49 /proc/275510/fd/2 -> 'pipe:[315374]'
And at the other end:
# ls -al /proc/275490/fd/{12,14}
lr-x------ 1 root root 64 Apr 16 10:50 /proc/275490/fd/12 -> 'pipe:[315373]'
lr-x------ 1 root root 64 Apr 16 10:50 /proc/275490/fd/14 -> 'pipe:[315374]'
FD 1 and 2 are two different pipes between those two processes. The
output that I see on the terminal or via docker logs
are chunks of
hello
followed by other chunks of error
, and so on. The reason is
that there are two channels from my program to Docker and Docker has to
decide when to read which one. It cannot know the original order of my
log messages.
Note that systemd gets this right. When you inspect some daemon running under systemd, you will see this:
$ ls -al /proc/198378/fd/{1,2}
lrwx------ 1 root root 64 Apr 16 10:03 /proc/198378/fd/1 -> 'socket:[234066]'
lrwx------ 1 root root 64 Apr 16 10:03 /proc/198378/fd/2 -> 'socket:[234066]'
It’s the same socket for both stdout
and stderr
. Of course, this
also means that systemd cannot tell messages from those streams apart,
which is why they both end up getting the same “priority” in your
journal.
(Fun fact: When I ran the above C program as a systemd unit, I still saw the occasional mixup. It was very rare, though. Maybe some kind of race condition internal to journald?)
The above problem is made worse when there is an additional component in between. You can then also run into this problem when using systemd.
Let’s say you want to connect your program to journald/syslog manually by doing this:
#!/bin/bash
exec 1> >(logger)
exec 2> >(logger -p user.error)
./simplewrite
# ... and maybe a much longer script follows ...
In this case, simplewrite
is just this C program:
#include <unistd.h>
int
main()
{
write(1, "hello\n", 6);
write(2, "error\n", 6);
write(1, "world\n", 6);
return 0;
}
And what I get in my logs is this happy little accident:
Apr 15 09:08:20 pinguin void[77794]: hello
Apr 15 09:08:20 pinguin void[77794]: world
Apr 15 09:08:20 pinguin void[77796]: error
What happened here? Let’s add some strace
calls and have a look.
First, let’s make sure that our program did not accidentally suffer from
internal buffering (which it shouldn’t, because write()
in C should
correspond to the syscall – but you never know):
09:08:20.353065 write(1, "hello\n", 6) = 6
09:08:20.353111 write(2, "error\n", 6) = 6
09:08:20.353147 write(1, "world\n", 6) = 6
Looks good. Now the simple case, the logger
which is responsible for
stderr
:
09:08:20.356940 read(0, "error\n", 4096) = 6
09:08:20.356962 sendmsg(3, {...}, MSG_NOSIGNAL) = 31
It reads our string and then sends it off to the logging daemon. Fine.
What about the logger
that handles stdout
?
09:08:20.356661 read(0, "hello\nworld\n", 4096) = 12
09:08:20.356689 sendmsg(3, {..., {iov_base="hello", iov_len=5}, ...}, MSG_NOSIGNAL) = 31
09:08:20.356723 sendmsg(3, {..., {iov_base="world", iov_len=5}, ...}, MSG_NOSIGNAL) = 31
Our two write()
calls happened fast enough, so the kernel didn’t even
wake up the other end of the pipe. When it finally did wake up, both
lines were already ready to read in a single read()
call.
For demonstration, let’s add a short delay between the two calls (I went
for for
loops here instead of sleep()
to avoid any kind of potential
side effects from scheduling the process to the S
state, even though
I’m not sure if there really are any):
#include <unistd.h>
int
main()
{
write(1, "hello\n", 6);
write(2, "error\n", 6);
for (int i = 0; i < 10000; i++)
for (int j = 0; j < 10000; j++)
;
write(1, "world\n", 6);
return 0;
}
To confirm, the delay really exists in the strace
log of the sender,
it’s about 150 ms:
08:12:44.666617 write(1, "hello\n", 6) = 6
08:12:44.666645 write(2, "error\n", 6) = 6
08:12:44.826014 write(1, "world\n", 6) = 6
And now the receiver does get unmerged data (careful, 08:12:44.671246
is the beginning of that read()
syscall):
08:12:44.671192 read(0, "hello\n", 4096) = 6
08:12:44.671213 sendmsg(3, {..., {iov_base="hello", iov_len=5}, ...}, MSG_NOSIGNAL) = 31
08:12:44.671246 read(0, "world\n", 4096) = 6
08:12:44.826152 sendmsg(3, {..., {iov_base="world", iov_len=5}, ...}, MSG_NOSIGNAL) = 31
But of course, having two logger
processes also shuffles everything
around a bit, before the data even reaches journald. I just saw this in
my second test (the one with the additional delay):
Apr 16 08:12:44 pinguin void[14779]: error
Apr 16 08:12:44 pinguin void[14780]: hello
Apr 16 08:12:44 pinguin void[14780]: world
The reason is that the first logger
process has processed the error
message before the second one got its hello
message. Look at the
timestamps again from that last strace
log above:
08:12:44.671192 read(0, "hello\n", 4096) = 6
08:12:44.671213 sendmsg(3, {..., {iov_base="hello", iov_len=5}, ...}, MSG_NOSIGNAL) = 31
And compare this to the other logger
process:
08:12:44.670314 read(0, "error\n", 4096) = 6
08:12:44.670339 sendmsg(3, {..., {iov_base="error", iov_len=5}, ...}, MSG_NOSIGNAL) = 31
About a millisecond earlier. (From that log, it is not entirely clear
whether hello
arrived later than error
, because, again, those
timestamps show the beginning of the syscalls, not when they returned.
The error
message certainly got sent to the logging daemon earlier
than the hello
message.)
So, in terms of file descriptors, I’d argue that it’s better to just use
a single one. This avoids this whole class of problems. If you want to
distinguish between error levels, it’s probably better to use some kind
of structured logging or prefix your lines using [I]
for info, [E]
for error, and so on – if that’s feasible.
Alright, these were some examples that came to my mind recently. This is not an exhaustive list, of course, it’s a complex topic.
Logs are one of those things that only “usually somehow kind of work”, probably because they’re often an afterthought. And yet they are incredibly important.