blog · git · desktop · images · contact


Why octal notation should be used for UTF-8 (and Unicode)

2016-10-05

Disclaimer: I read about this topic in another blog first, but I just can’t find that posting anymore. If you, the “original author”, happen to read this, drop me a note and I’ll add a link to your posting.

Now. It’s 2016 and I finally took the time to write a very basic UTF-8 encoder and decoder. This is an important thing that, IMHO, anyone should do to get a better understanding of what’s going on.

When you start reading up on the topic, you quickly come across notations like this one from Wikipedia:

hex

Hexadecimal notation. Of course, why not, hex notation is used in many places when dealing with “raw” data. It’s common practice. Sadly, when talking about UTF-8, it makes everything more complicated. Let’s have a quick look at how UTF-8 works – a very quick look.

In UTF-8, there are “leading bytes” and “continuation bytes”. A leading byte tells you “a multibyte sequence begins here”. Some continuation bytes follow immediately. Together, they can be decoded and you’ll get one Unicode code point.

One such multibyte sequence looks something like this:

Actually, that’s all possible options. In theory, UTF-8 can have sequences of 8 bytes, but RFC 3629 restricts it to a maximum of 4 bytes.

So, the x’s above can be arbitrary bits. They specify the Unicode code point. (Let’s ignore that some code points are not “allowed”.) Basically, a code point is an abstract number. Write that number in binary and copy the bits into the x’s. If the code point is a small number, you need just a few bytes (or only one byte) – if it’s a large number, you need more bytes.

Now, look closely. All continuation bytes always carry 6 bits of information. In octal notation, those 6 bits are exactly two digits. The leading bits 10... will always be a leading 2 in octal.

You can do something similar with the leading bytes. If they start with 110... or 11110..., your octal number will start with 3 or 36 and the other digits of that octal number are part of the code point. The 1110... is a little more tricky, but more on that below.

What’s all that good for? Well, when reading UTF-8 encoded data in octal notation, you can get the Unicode code point without any decoding process. Some examples:

There’s one pitfall, though. If you’re dealing with a 3 byte sequence and the code point is above hex <U+7FFF>, then you have to take the second octal digit into account as well – kind of. For example:

So, it all boils down to this:

These few simple rules should make it pretty easy to spot the UTF-8 multibyte sequences in the following dump and you can even “decode” them:

$ od -t o1 <data
0000000 124 150 141 164 342 200 231 163 040 141 156 040 145 170 145 155
0000020 160 154 141 162 171 040 144 165 155 160 056 040 342 200 234 110
0000040 141 154 154 303 266 143 150 145 156 054 342 200 235 040 150 145
0000060 040 163 141 151 144 056 012

If the convention of writing Unicode code points was an octal <U+372115> instead of a hexadecimal <U+1F44D>, that would be really nice. Sadly, that’s not the case. Of course, nobody knew that UTF-8 would become the encoding for Unicode, so you can’t blame people.

Still, when reading octal dumps, you can easily grab the digits of the code point in octal and then just convert it to hex and you’re done. And if you’re not interested in “decoding” anything at all, just knowing about these simple rules above can be helpful: It’s trivial to distinguish bytes starting with 3 or 2 from the rest – and, bam, you know that that’s UTF-8. That’s much more complicated to do in hex.

Comments?