blog · git · desktop · images · contact · gopher
cat
2018-08-11
Brian Kernighan recently talked about the origin of the name “grep” on Computerphile. He also mentions that ed and then grep were written in PDP 11 assembly language. So, I thought that this would make for a pretty fun excercise: Write a grep in x86-64 assembly. :-)
There will be some restrictions:
Also, I most certainly am not Ken Thompson. This is going to take a while. :-) It’s going to be a series of blog posts over the next few months.
I’d also be very happy to hear about any mistakes that I made.
cat
The first step will be a line-based cat
. Read a line, then print that
line. The following code will do exactly that. The next step is to
broaden my knowledge of x86-64 assembly (once again) and eliminate some
possibly inefficient or plain “stupid” parts of the code. After that, I
can look into implementing a regex engine and only print those lines
that actually match.
(It’s not even cat
as it can’t concatenate files. But you get the
idea.)
The program will not use any third-party libraries. This includes not
using a libc. In other words, we’ll be using syscalls to communicate
with the kernel and we’ll have to implement anything else by ourselves.
When running objdump
on the resulting binary, we only want to see our
own code.
Okay, so here’s the main program:
.text
.global _start
/* (0) */
_start:
/* (1) */
pushq %rbp
movq %rsp, %rbp
/* (2) */
/* -8(%rbp), 8 = pointer to current line (char *)
* -16(%rbp), 8 = size of line buffer (size_t) */
subq $16, %rsp
/* Initialize our pointer with "NULL" and size with 0. */
movq $0, -8(%rbp)
movq $0, -16(%rbp)
.Lreadwriteloop:
/* (3) */
/* Read a line. Memory management is done entirely by
* readline(). */
leaq -8(%rbp), %rdi
leaq -16(%rbp), %rsi
call readline
/* Test if readline() returned 0. If it did, exit our loop. */
cmpq $0, %rax
je .Lreadwriteloop_end
/* Print the line we've just read. */
movq -8(%rbp), %rdi
movq %rax, %rsi
call print
jmp .Lreadwriteloop
.Lreadwriteloop_end:
movq $0, %rdi
call exit
If you’ve never written assembly language before, here are some explanations and remarks.
At (0)
you can see a label called _start
. This is were program flow
begins. In C, you usually use a function called main()
, but that one
belongs to libc and we don’t want to use it. Every program really begins
at _start
and main()
is actually a rather thick wrapper around that.
(I’ve talked about that differences years ago, but the old posting is
only available in german.)
The two instructions at (1)
set up a “frame pointer” or a “stack
frame”. There’s another blog post by me about that
topic, but again, it’s old and has not been translated to
english. It even still uses 32 bit code, but it works the same on 64
bit. This article in good old english is a pretty good
explanation, too.
The basic idea behind a frame pointer is to have one register that holds
the address to the base of your current stack frame. That value is fixed
during the lifetime of your current “function”. %rbp
is usually used
for that task. Addresses relative to %rbp
can then point to
“variables” on your stack.
For example:
movq $0, -8(%rbp)
movq
means “copy a quad word” (that’s 8 bytes in our case). Which 8
bytes? I use an “immediate value” here and it’s a 0
. Where will it be
copied to? To the memory addressed by the value which is currently
stored in %rbp
– minus 8 bytes. The stack in Linux grows from higher
addresses to lower addresses, hence the “minus”.
The code at (2)
alters the “size” of the stack frame and thus makes
room for our variables. It doesn’t really “allocate” anything (IIUC),
but it will influence future push
instructions: They work relative to
%rsp
. In other words, by doing a subq
on %rsp
, we can pretend to
have pushed as many quad words as we like, without having to run actual
push
instructions – but if we were to call foreign code that does a
push
, it won’t clobber our memory. We, on the other hand, can access
our memory directly by using something like -48(%rbp)
.
So … we’re using stack memory here, but we’re not using it with the semantics of a stack. Simply because it’s more flexible that way. It’s just linear memory to us and we can pretty much access each byte individually.
So much for the recap on stack frames.
The loop that begins around (3)
will read a line, check if we actually
accomplished to read anything, and eventually print that line. The
functions exit
, print
, and readline
are not yet defined. We’ll go
through them step by step.
“Calling a function” means you should adhere to a calling
convention. This is an absolute requirement if you want to call
library functions or if your code is called by them. We don’t have to
follow these conventions to the letter in our program, since we only
ever run our own code. For example, I ignored that some registers are
not to be clobbered by a callee. I’ll try to stick to the common cdecl
calling convention, though.
exit
This one is simple:
exit:
/* in: %rdi, exit code */
/* 60 = _exit(), %rdi already contains the exit code */
movq $60, %rax
syscall
It’s just a syscall. This is the point where you can abandon all hope
that this code runs on anything other than Linux – unless your operating
system happens to use 60
, too, to indicate a syscall that terminates
the program.
This, of course, applies to all syscalls. Their numbers and their arguments must match your operating system. And you again have to use the correct calling convention. It’s a little bit different for Linux syscalls on x86-64 than for normal userland calls (see A.2.1 in SysV ABI draft).
print
Not a very complicated function, either:
print:
/* Print the given number of bytes.
*
* in:
* %rdi, pointer to buffer (char *)
* %rsi, length of string (size_t) */
/* Just shuffle arguments, call write(1, &buffer, len) (syscall #1) */
movq %rsi, %rdx
movq %rdi, %rsi
movq $1, %rax
movq $1, %rdi
syscall
ret
In fact, it’s just a wrapper around Linux’s write()
syscall with a
fixed value of 1
: File descriptor number 1 is expected to be stdout.
I could have ditched both exit
and print
and could have used the
syscalls directly in our main program. I didn’t do so because I wanted
to keep the main loop a little bit more abstract and possibly easier to
understand.
readline
This is where the fun starts.
About 100 lines of code, this is the largest function so far. If it were written in C, it would have this signature:
size_t readline(char **buffer, size_t *buffer_size);
The first argument is a pointer to a “string”, but we won’t
NUL-terminate it, because we’re lazy. The second argument is a pointer
to some 8 bytes that hold the currently allocated size for our buffer.
Both of these variables will live on the stack of the main function –
they’re the ones allocated at (2)
.
Our implementation of readline
will dynamically allocate memory on the
heap. That memory will store one entire line. In the future, the (yet
hypothetical) regex engine can work on this buffer. For now, this buffer
will simply be written to stdout by our print
after we have returned
from readline
.
Allocating memory on the heap can be done through the brk()
syscall on
Linux. A brk(0)
will get you the current “program break” address. When
you run this the first time, you’ll get some pretty much random address.
You can then run brk(first_result + some_size)
to move the program
break – and that’s your heap. That’s what malloc()
does in C behind
the scenes (sometimes). Actually, it works very similar to managing the
stack, but in the other direction. Linux’s heap grows towards larger
addresses.
So, what will readline
do? It will:
Let’s just have a look at the actual function:
readline:
/* Read a line (terminated by \n or EOF) from stdin.
*
* in + out:
* %rdi, pointer to pointer to buffer (char **)
* %rsi, pointer to current size of buffer (size_t *)
*
* out:
* %rax, length of string read (including \n), 0 if EOF */
pushq %rbp
movq %rsp, %rbp
/* -8(%rbp), 8 = pointer to pointer to buffer (char **)
* -16(%rbp), 8 = pointer to size of buffer (size_t *)
* -24(%rbp), 8 = length of string
* (other 8 bytes are alignment) */
subq $32, %rsp
movq %rdi, -8(%rbp)
movq %rsi, -16(%rbp)
movq $0, -24(%rbp)
.Lreadline_next_byte:
/* Check if there is still room in the buffer */
movq -16(%rbp), %rbx
movq (%rbx), %rax
cmpq %rax, -24(%rbp)
je .Lreadline_increase_buffer
.Lreadline_buffer_large_enough:
/* read(stdin, &(buffer + len), 1) (syscall #0, stdin = fd 0) */
movq -8(%rbp), %rbx
movq (%rbx), %rsi
movq -24(%rbp), %rax
addq %rax, %rsi
movq $0, %rax
movq $0, %rdi
movq $1, %rdx
syscall
/* EOF? */
cmpq $0, %rax
je .Lreadline_exit
/* We read something, so increase the counter. */
movq -24(%rbp), %rax
incq %rax
movq %rax, -24(%rbp)
/* \n found? */
movq -8(%rbp), %rbx
movq (%rbx), %rsi
movq -24(%rbp), %rax
decq %rax
addq %rax, %rsi
cmpb $10, (%rsi)
je .Lreadline_exit
jmp .Lreadline_next_byte
.Lreadline_increase_buffer:
/* Get address to base of heap if pointer is still NULL. */
movq -8(%rbp), %rbx
movq (%rbx), %rax
cmpq $0, %rax
je .Lreadline_initial_buffer_allocation
.Lreadline_buffer_initially_allocated:
/* Increase buffer by 32 bytes and call brk(base + new_size).
* This includes saving the new buffer size to the memory
* pointed by -16(%rbp). */
movq -8(%rbp), %rbx
movq (%rbx), %rdi
movq -16(%rbp), %rbx
movq (%rbx), %rax
addq $32, %rax
movq %rax, (%rbx)
addq %rax, %rdi
movq $12, %rax
syscall
jmp .Lreadline_buffer_large_enough
.Lreadline_initial_buffer_allocation:
/* brk(0) (syscall #12) to get base of heap */
movq $12, %rax
movq $0, %rdi
syscall
movq -8(%rbp), %rbx
movq %rax, (%rbx)
jmp .Lreadline_buffer_initially_allocated
.Lreadline_exit:
movq -24(%rbp), %rax
leave
ret
This is the “beauty” of assembly code. It’s really long, but it’s also
really simple. It’s just a bunch of “copy data”, “add or subtract
numbers”, “compare values”, “jump somewhere else”, and “issue a
syscall”. And you should actually be able to understand what’s going on
here. We’ve already talked about stack frames. We’ve talked about
brk()
. You should already be familiar with the read()
syscall.
That’s all there is to it. No further magic added.
I guess that it’s hard to understand nevertheless. Most importantly, there’s a lot of address shuffling going on. Beginners may have trouble understanding simple pointers in C, so here we have a double pointer in assembly. Still, I think it’s a very rewarding excercise: Once you’ve fully understood that it’s just about memory addresses and that it’s a fundamental and simple part of everyday programming to work with them, then you’ll find pointers in C trivial to understand.
Let’s go through it again.
In _start
, the register %rbp
holds a memory address. At that address
begins your stack memory. That already is a “pointer” and using (%rbp)
is a de-reference. -8(%rbp)
means subtracting 8
from the value in
%rbp
(without altering %rbp
) and then using the result for the
actual operation.
leaq -8(%rbp), %rdi
performs the same calculation – subtract 8
from
the value in %rbp
without altering it – and stores the result in the
register %rdi
. lea
stands for “load effective address”. The same
could be done by copying the value of %rbp
to another register and
then subtracting 8
, but lea
is shorter and easier. Either way,
%rdi
now holds the address of some memory where another address is
being stored – a “double pointer”.
%rdi
will be set in the main routine (_start
) and then used as an
argument to readline
. In readline
itself we put a backup copy of
that value on the local stack. If we want to work with this pointer, we
have to load the first address (original value of %rdi
) from the stack
into some register and then de-reference it. This leaves us with a
simple pointer, a „char *
“, which we can pass to read()
.
Some more remarks:
.L
are GNU as’ local symbols. I
use them as jump labels within a function, and they get out of the
way when debugging with GDB.leave
is the opposite of pushq %rbp; movq %rsp, %rbp
. It
discards the current stack frame. You may have noticed that exit
and print
didn’t have any stack frame at all, so they don’t have
leave
, either. exit
didn’t even have ret
, because the syscall
kills the process and never returns.mov
with arbitrary sources and destinations. That’s
why I often had to first copy something from the stack to a register
and then de-reference that address. This might be one of the obvious
things to optimize, I hope.You can find the complete code here:
Now, none of this is fast. We read one byte at a time, which is probably
the worst thing you can do. I didn’t take care to micro-optimize
anything. That’s why it’s called “a naïve cat
”. It’s written in
assembly language, but that doesn’t necessarily mean that it’s “highly
optimized”. I’ll have to look into that next.
Continuation: This series continues in
grep.asm – Step 2: A better cat
.