Filesystems are an extremely fascinating topic, because you use them every day. There is no way around them. Every bit of data you create will be organized by a filesystem.
And yet, I never took the time to fully understand how one particular filesystem works. I’m aware of most basic concepts like superblocks or inodes and I understand that hardlinks are just different names for the same object. Stuff like that.
I also knew/know close to nothing about things work inside the kernel. I’ve heard the term “VFS”, but what does it do? How does it work?
My brain works in a peculiar way: Until I try to actually do something or implement something, there’s little chance for me to understand how it works. I can read about blocks and inodes and caches all day, that doesn’t help much. So, it was time to get my hands dirty. Let’s try to write a simple filesystem driver – how hard can it be?
Where do you begin? I need to write a kernel module. Well, I don’t actually have to, I could also use FUSE. I did that back in 2011, but I didn’t get far. Maybe I wasn’t motivated enough at the time. Doesn’t really matter. This time, it had to be an actual kernel module, because that’s what I wanted to do.
One of the first blog posts you come across is this one:
At the end of the article, there’s a link to this repo on GitHub:
These two resources are what got me started. They teach you the absolute basics:
It didn’t take long until I had
mountablefs running: It does nothing
but “mount” a device and provide a dummy inode showing you an empty
filesystem tree. There is no actual data involved, so the “mounting” is
just a dummy step.
My next excercise was to read actual data from a disk. Oh, “disk”? I run everything in a virtual machine, of course. When your code crashes in userspace, it doesn’t really matter. Bugs in kernel space can crash your entire machine, though, and there’s no use in risking that.
onefilerofs was born. It’s dead simple:
The “ro” indicates that it’s read-only. To write data, I used a hex editor and then remounted the filesystem.
But it worked! My first kernel module that reads data from a disk and presents it to userspace.
Soon after that, I discovered “address space operations”. It’s a layer
of abstraction in VFS that saves you the need to implement
write directly. Instead, the kernel asks for the
n-th block of a
file and you have to tell it that it’s located on the
m-th block on
disk. All the nasty details like single seeks or single-byte writes can
be hidden behind that. It doesn’t even matter if a process wants to read
or write data – you only have to define that mapping.
asopfs came to life. There was still only one file, but it can
now be written from userspace. This means that the size of the file can
asopfs also needs to implement updates to inodes. An
“inode” is still just the first 64 bits of the first block here.
To my surprise, page size and block size appear to be the same. One “chunk” of data in memory is 4096 bytes long, as is one block on disk. I’m still not sure if this is a particularity of my setup, though, and a lot of resources on the internet say that there should be a difference.
All of the above happened in three or four days. I did a lot of reading and dug through the code of the minix filesystem. You pretty much have to do that because “documentation” is scarce. Once you leave userland and enter kernel space, the internet stops being helpful. Let’s be honest here, we all use Google and we use it all the time. Facing a problem? Let’s google it, someone will have solved it for us.
This no longer works when you’re writing a kernel module.
There are no “tutorials”. There are very little blog posts on the topic. Maybe there are books, but I wasn’t willing to spend another 100€ just yet.
What you have to read, is vfs.txt. Start reading the source code of some “simple” filesystems, like minix. There’s a lot of information to digest, even though it’s not presented to you on a silver plate.
Sometimes you need to know what a particular kernel functions does. This is where a cross reference comes in handy.
Also, make baby steps. After I had
asopfs running, I thought: “Okay, I
know the basics now, I can start implementing an actual filesystem.” And
so I did. I started writing
basicfs, which was still rather simple in
design, but it was supposed to have a bitmap for free blocks, and files
and directories could grow to unlimited size. There were actual inodes
on disk, symlinks, hardlinks, special files could have been implemented.
How hard can it be?
Well, harder than expected. I spent about a month on the code until I threw it away.
basicfs was too much. Too much complexity, too many things to learn at
once. I knew too little about VFS and I didn’t know how to structure my
code in a meaningful way.
So, I threw
basicfs away and started working on
allowed me to implement more operations (most importantly creating files
or directories) while still avoiding harder problems like managing block
allocation for files or directories:
There are two important restrictions:
ncontains an inode of an object and block
n + 1contains the data of that object. This means the filesystem isn’t suitable for real-world usage, but it makes it a lot easier for me to implement the kernel module.
As I said, baby steps. Important ones, though.
The code so far is here:
In no way do I claim that this code is perfect. It’s probably exactly the opposite. For example, there are no locks at all, which means that your kernel might explode when two processes use the filesystem simultaneously.
There is a lot to learn for me. VFS, caching, locking, you name it. I’m all ears for pull requests or comments. :-)
Why did I publish that repo? I think having more examples of kernel code available on the internet is a good thing. Most importantly, share the simple examples. The ones that help you get started.
Where to go? I hope that all this helps me in understanding other filesystems. Maybe, a long time from now, I can start to read and understand what goes on in beasts like ext4, XFS, or btrfs. Maybe.