Brian Kernighan recently talked about the origin of the name “grep” on Computerphile. He also mentions that ed and then grep were written in PDP 11 assembly language. So, I thought that this would make for a pretty fun excercise: Write a grep in x86-64 assembly. :-)
There will be some restrictions:
Also, I most certainly am not Ken Thompson. This is going to take a while. :-) It’s going to be a series of blog posts over the next few months.
I’d also be very happy to hear about any mistakes that I made.
The first step will be a line-based
cat. Read a line, then print that
line. The following code will do exactly that. The next step is to
broaden my knowledge of x86-64 assembly (once again) and eliminate some
possibly inefficient or plain “stupid” parts of the code. After that, I
can look into implementing a regex engine and only print those lines
that actually match.
(It’s not even
cat as it can’t concatenate files. But you get the
The program will not use any third-party libraries. This includes not
using a libc. In other words, we’ll be using syscalls to communicate
with the kernel and we’ll have to implement anything else by ourselves.
objdump on the resulting binary, we only want to see our
Okay, so here’s the main program:
.text .global _start /* (0) */ _start: /* (1) */ pushq %rbp movq %rsp, %rbp /* (2) */ /* -8(%rbp), 8 = pointer to current line (char *) * -16(%rbp), 8 = size of line buffer (size_t) */ subq $16, %rsp /* Initialize our pointer with "NULL" and size with 0. */ movq $0, -8(%rbp) movq $0, -16(%rbp) .Lreadwriteloop: /* (3) */ /* Read a line. Memory management is done entirely by * readline(). */ leaq -8(%rbp), %rdi leaq -16(%rbp), %rsi call readline /* Test if readline() returned 0. If it did, exit our loop. */ cmpq $0, %rax je .Lreadwriteloop_end /* Print the line we've just read. */ movq -8(%rbp), %rdi movq %rax, %rsi call print jmp .Lreadwriteloop .Lreadwriteloop_end: movq $0, %rdi call exit
If you’ve never written assembly language before, here are some explanations and remarks.
(0) you can see a label called
_start. This is were program flow
begins. In C, you usually use a function called
main(), but that one
belongs to libc and we don’t want to use it. Every program really begins
main() is actually a rather thick wrapper around that.
(I’ve talked about that differences years ago, but the old posting is
only available in german.)
The two instructions at
(1) set up a “frame pointer” or a “stack
frame”. There’s another blog post by me about that
topic, but again, it’s old and has not been translated to
english. It even still uses 32 bit code, but it works the same on 64
bit. This article in good old english is a pretty good
The basic idea behind a frame pointer is to have one register that holds
the address to the base of your current stack frame. That value is fixed
during the lifetime of your current “function”.
%rbp is usually used
for that task. Addresses relative to
%rbp can then point to
“variables” on your stack.
movq $0, -8(%rbp)
movq means “copy a quad word” (that’s 8 bytes in our case). Which 8
bytes? I use an “immediate value” here and it’s a
0. Where will it be
copied to? To the memory addressed by the value which is currently
%rbp – minus 8 bytes. The stack in Linux grows from higher
addresses to lower addresses, hence the “minus”.
The code at
(2) alters the “size” of the stack frame and thus makes
room for our variables. It doesn’t really “allocate” anything (IIUC),
but it will influence future
push instructions: They work relative to
%rsp. In other words, by doing a
%rsp, we can pretend to
have pushed as many quad words as we like, without having to run actual
push instructions – but if we were to call foreign code that does a
push, it won’t clobber our memory. We, on the other hand, can access
our memory directly by using something like
So … we’re using stack memory here, but we’re not using it with the semantics of a stack. Simply because it’s more flexible that way. It’s just linear memory to us and we can pretty much access each byte individually.
So much for the recap on stack frames.
The loop that begins around
(3) will read a line, check if we actually
accomplished to read anything, and eventually print that line. The
readline are not yet defined. We’ll go
through them step by step.
“Calling a function” means you should adhere to a calling
convention. This is an absolute requirement if you want to call
library functions or if your code is called by them. We don’t have to
follow these conventions to the letter in our program, since we only
ever run our own code. For example, I ignored that some registers are
not to be clobbered by a callee. I’ll try to stick to the common
calling convention, though.
This one is simple:
exit: /* in: %rdi, exit code */ /* 60 = _exit(), %rdi already contains the exit code */ movq $60, %rax syscall
It’s just a syscall. This is the point where you can abandon all hope
that this code runs on anything other than Linux – unless your operating
system happens to use
60, too, to indicate a syscall that terminates
This, of course, applies to all syscalls. Their numbers and their arguments must match your operating system. And you again have to use the correct calling convention. It’s a little bit different for Linux syscalls on x86-64 than for normal userland calls (see A.2.1 in SysV ABI draft).
Not a very complicated function, either:
print: /* Print the given number of bytes. * * in: * %rdi, pointer to buffer (char *) * %rsi, length of string (size_t) */ /* Just shuffle arguments, call write(1, &buffer, len) (syscall #1) */ movq %rsi, %rdx movq %rdi, %rsi movq $1, %rax movq $1, %rdi syscall ret
In fact, it’s just a wrapper around Linux’s
write() syscall with a
fixed value of
1: File descriptor number 1 is expected to be stdout.
I could have ditched both
This is where the fun starts.
About 100 lines of code, this is the largest function so far. If it were written in C, it would have this signature:
size_t readline(char **buffer, size_t *buffer_size);
The first argument is a pointer to a “string”, but we won’t
NUL-terminate it, because we’re lazy. The second argument is a pointer
to some 8 bytes that hold the currently allocated size for our buffer.
Both of these variables will live on the stack of the main function –
they’re the ones allocated at
Our implementation of
readline will dynamically allocate memory on the
heap. That memory will store one entire line. In the future, the (yet
hypothetical) regex engine can work on this buffer. For now, this buffer
will simply be written to stdout by our
Allocating memory on the heap can be done through the
brk() syscall on
brk(0) will get you the current “program break” address. When
you run this the first time, you’ll get some pretty much random address.
You can then run
brk(first_result + some_size) to move the program
break – and that’s your heap. That’s what
malloc() does in C behind
the scenes (sometimes). Actually, it works very similar to managing the
stack, but in the other direction. Linux’s heap grows towards larger
So, what will
readline do? It will:
Let’s just have a look at the actual function:
readline: /* Read a line (terminated by \n or EOF) from stdin. * * in + out: * %rdi, pointer to pointer to buffer (char **) * %rsi, pointer to current size of buffer (size_t *) * * out: * %rax, length of string read (including \n), 0 if EOF */ pushq %rbp movq %rsp, %rbp /* -8(%rbp), 8 = pointer to pointer to buffer (char **) * -16(%rbp), 8 = pointer to size of buffer (size_t *) * -24(%rbp), 8 = length of string * (other 8 bytes are alignment) */ subq $32, %rsp movq %rdi, -8(%rbp) movq %rsi, -16(%rbp) movq $0, -24(%rbp) .Lreadline_next_byte: /* Check if there is still room in the buffer */ movq -16(%rbp), %rbx movq (%rbx), %rax cmpq %rax, -24(%rbp) je .Lreadline_increase_buffer .Lreadline_buffer_large_enough: /* read(stdin, &(buffer + len), 1) (syscall #0, stdin = fd 0) */ movq -8(%rbp), %rbx movq (%rbx), %rsi movq -24(%rbp), %rax addq %rax, %rsi movq $0, %rax movq $0, %rdi movq $1, %rdx syscall /* EOF? */ cmpq $0, %rax je .Lreadline_exit /* We read something, so increase the counter. */ movq -24(%rbp), %rax incq %rax movq %rax, -24(%rbp) /* \n found? */ movq -8(%rbp), %rbx movq (%rbx), %rsi movq -24(%rbp), %rax decq %rax addq %rax, %rsi cmpb $10, (%rsi) je .Lreadline_exit jmp .Lreadline_next_byte .Lreadline_increase_buffer: /* Get address to base of heap if pointer is still NULL. */ movq -8(%rbp), %rbx movq (%rbx), %rax cmpq $0, %rax je .Lreadline_initial_buffer_allocation .Lreadline_buffer_initially_allocated: /* Increase buffer by 32 bytes and call brk(base + new_size). * This includes saving the new buffer size to the memory * pointed by -16(%rbp). */ movq -8(%rbp), %rbx movq (%rbx), %rdi movq -16(%rbp), %rbx movq (%rbx), %rax addq $32, %rax movq %rax, (%rbx) addq %rax, %rdi movq $12, %rax syscall jmp .Lreadline_buffer_large_enough .Lreadline_initial_buffer_allocation: /* brk(0) (syscall #12) to get base of heap */ movq $12, %rax movq $0, %rdi syscall movq -8(%rbp), %rbx movq %rax, (%rbx) jmp .Lreadline_buffer_initially_allocated .Lreadline_exit: movq -24(%rbp), %rax leave ret
This is the “beauty” of assembly code. It’s really long, but it’s also
really simple. It’s just a bunch of “copy data”, “add or subtract
numbers”, “compare values”, “jump somewhere else”, and “issue a
syscall”. And you should actually be able to understand what’s going on
here. We’ve already talked about stack frames. We’ve talked about
brk(). You should already be familiar with the
That’s all there is to it. No further magic added.
I guess that it’s hard to understand nevertheless. Most importantly, there’s a lot of address shuffling going on. Beginners may have trouble understanding simple pointers in C, so here we have a double pointer in assembly. Still, I think it’s a very rewarding excercise: Once you’ve fully understood that it’s just about memory addresses and that it’s a fundamental and simple part of everyday programming to work with them, then you’ll find pointers in C trivial to understand.
Let’s go through it again.
_start, the register
%rbp holds a memory address. At that address
begins your stack memory. That already is a “pointer” and using
is a de-reference.
-8(%rbp) means subtracting
8 from the value in
%rbp (without altering
%rbp) and then using the result for the
leaq -8(%rbp), %rdi performs the same calculation – subtract
the value in
%rbp without altering it – and stores the result in the
lea stands for “load effective address”. The same
could be done by copying the value of
%rbp to another register and
lea is shorter and easier. Either way,
%rdi now holds the address of some memory where another address is
being stored – a “double pointer”.
%rdi will be set in the main routine (
_start) and then used as an
readline itself we put a backup copy of
that value on the local stack. If we want to work with this pointer, we
have to load the first address (original value of
%rdi) from the stack
into some register and then de-reference it. This leaves us with a
simple pointer, a „
char *“, which we can pass to
Some more remarks:
.Lare GNU as’ local symbols. I use them as jump labels within a function, and they get out of the way when debugging with GDB.
leaveis the opposite of
pushq %rbp; movq %rsp, %rbp. It discards the current stack frame. You may have noticed that
exitdidn’t even have
ret, because the syscall kills the process and never returns.
movwith arbitrary sources and destinations. That’s why I often had to first copy something from the stack to a register and then de-reference that address. This might be one of the obvious things to optimize, I hope.
You can find the complete code here:
Now, none of this is fast. We read one byte at a time, which is probably
the worst thing you can do. I didn’t take care to micro-optimize
anything. That’s why it’s called “a naïve
cat”. It’s written in
assembly language, but that doesn’t necessarily mean that it’s “highly
optimized”. I’ll have to look into that next.
Continuation: This series continues in
grep.asm – Step 2: A better