Experiments with writing a filesystem driver for Linux

2017-09-09

Filesystems are an extremely fascinating topic, because you use them every day. There is no way around them. Every bit of data you create will be organized by a filesystem.

And yet, I never took the time to fully understand how one particular filesystem works. I’m aware of most basic concepts like superblocks or inodes and I understand that hardlinks are just different names for the same object. Stuff like that.

I also knew/know close to nothing about things work inside the kernel. I’ve heard the term “VFS”, but what does it do? How does it work?

My brain works in a peculiar way: Until I try to actually do something or implement something, there’s little chance for me to understand how it works. I can read about blocks and inodes and caches all day, that doesn’t help much. So, it was time to get my hands dirty. Let’s try to write a simple filesystem driver – how hard can it be?

First steps

Where do you begin? I need to write a kernel module. Well, I don’t actually have to, I could also use FUSE. I did that back in 2011, but I didn’t get far. Maybe I wasn’t motivated enough at the time. Doesn’t really matter. This time, it had to be an actual kernel module, because that’s what I wanted to do.

One of the first blog posts you come across is this one:

https://kukuruku.co/post/writing-a-file-system-in-linux-kernel/

At the end of the article, there’s a link to this repo on GitHub:

https://github.com/psankar/simplefs

These two resources are what got me started. They teach you the absolute basics:

How to write and compile a simple kernel module?
How do you register a filesystem driver and what are the first steps to “mount” something?

It didn’t take long until I had mountablefs running: It does nothing but “mount” a device and provide a dummy inode showing you an empty filesystem tree. There is no actual data involved, so the “mounting” is just a dummy step.

To the sky and beyond

My next excercise was to read actual data from a disk. Oh, “disk”? I run everything in a virtual machine, of course. When your code crashes in userspace, it doesn’t really matter. Bugs in kernel space can crash your entire machine, though, and there’s no use in risking that.

onefilerofs was born. It’s dead simple:

There is exactly one file and its data starts at block number 1.
The first 64 bits of block number 0 store the size of that file.

The “ro” indicates that it’s read-only. To write data, I used a hex editor and then remounted the filesystem.

But it worked! My first kernel module that reads data from a disk and presents it to userspace.

Soon after that, I discovered “address space operations”. It’s a layer of abstraction in VFS that saves you the need to implement read and write directly. Instead, the kernel asks for the n-th block of a file and you have to tell it that it’s located on the m-th block on disk. All the nasty details like single seeks or single-byte writes can be hidden behind that. It doesn’t even matter if a process wants to read or write data – you only have to define that mapping.

Thus, asopfs came to life. There was still only one file, but it can now be written from userspace. This means that the size of the file can change, so asopfs also needs to implement updates to inodes. An “inode” is still just the first 64 bits of the first block here.

To my surprise, page size and block size appear to be the same. One “chunk” of data in memory is 4096 bytes long, as is one block on disk. I’m still not sure if this is a particularity of my setup, though, and a lot of resources on the internet say that there should be a difference.

Mommy, I fell out of my chair!

All of the above happened in three or four days. I did a lot of reading and dug through the code of the minix filesystem. You pretty much have to do that because “documentation” is scarce. Once you leave userland and enter kernel space, the internet stops being helpful. Let’s be honest here, we all use Google and we use it all the time. Facing a problem? Let’s google it, someone will have solved it for us.

This no longer works when you’re writing a kernel module.

There are no “tutorials”. There are very little blog posts on the topic. Maybe there are books, but I wasn’t willing to spend another 100€ just yet.

What you have to read, is vfs.txt. Start reading the source code of some “simple” filesystems, like minix. There’s a lot of information to digest, even though it’s not presented to you on a silver plate.

Sometimes you need to know what a particular kernel functions does. This is where a cross reference comes in handy.

Also, make baby steps. After I had asopfs running, I thought: “Okay, I know the basics now, I can start implementing an actual filesystem.” And so I did. I started writing basicfs, which was still rather simple in design, but it was supposed to have a bitmap for free blocks, and files and directories could grow to unlimited size. There were actual inodes on disk, symlinks, hardlinks, special files could have been implemented. How hard can it be?

Well, harder than expected. I spent about a month on the code until I threw it away.

Baby steps

basicfs was too much. Too much complexity, too many things to learn at once. I knew too little about VFS and I didn’t know how to structure my code in a meaningful way.

So, I threw basicfs away and started working on oneblockfs. It allowed me to implement more operations (most importantly creating files or directories) while still avoiding harder problems like managing block allocation for files or directories:

Disk block 0 is untouched, possibly allowing boot loaders. (Yeah, doesn’t really matter, but it’s easy to do.)
Disk block 1 contains a superblock which stores metadata about the filesystem.
Following that are inode blocks and data blocks:
- Even numbered blocks contain inodes.
- Odd numbered blocks contain data.

There are two important restrictions:

One “filesystem object” (i.e., a directory or a file) is always one block in size. Block n contains an inode of an object and block n + 1 contains the data of that object. This means the filesystem isn’t suitable for real-world usage, but it makes it a lot easier for me to implement the kernel module.
Inode numbers are not being reused. This makes it trivial to allocate a new inode: Just find an even numbered block which is marked as free and give it the next free number. This counter is stored in the superblock. Of course, this means that you will eventually run out of inode numbers, at which point the filesystem becomes unusable. Deleting stuff doesn’t help, either.

As I said, baby steps. Important ones, though.

Where to go from here

The code so far is here:

filesystem-experiments

In no way do I claim that this code is perfect. It’s probably exactly the opposite. For example, there are no locks at all, which means that your kernel might explode when two processes use the filesystem simultaneously.

There is a lot to learn for me. VFS, caching, locking, you name it. I’m all ears for pull requests or comments. :-)

Why did I publish that repo? I think having more examples of kernel code available on the internet is a good thing. Most importantly, share the simple examples. The ones that help you get started.

Where to go? I hope that all this helps me in understanding other filesystems. Maybe, a long time from now, I can start to read and understand what goes on in beasts like ext4, XFS, or btrfs. Maybe.

Comments?