blog · git · desktop · images · contact · gopher

grep.asm – Step 1: A naïve cat


Brian Kernighan recently talked about the origin of the name “grep” on Computerphile. He also mentions that ed and then grep were written in PDP 11 assembly language. So, I thought that this would make for a pretty fun excercise: Write a grep in x86-64 assembly. :-)

There will be some restrictions:

Also, I most certainly am not Ken Thompson. This is going to take a while. :-) It’s going to be a series of blog posts over the next few months.

I’d also be very happy to hear about any mistakes that I made.

A naïve line-based cat

The first step will be a line-based cat. Read a line, then print that line. The following code will do exactly that. The next step is to broaden my knowledge of x86-64 assembly (once again) and eliminate some possibly inefficient or plain “stupid” parts of the code. After that, I can look into implementing a regex engine and only print those lines that actually match.

(It’s not even cat as it can’t concatenate files. But you get the idea.)

The program will not use any third-party libraries. This includes not using a libc. In other words, we’ll be using syscalls to communicate with the kernel and we’ll have to implement anything else by ourselves. When running objdump on the resulting binary, we only want to see our own code.

Okay, so here’s the main program:

.global _start

/* (0) */
    /* (1) */
    pushq %rbp
    movq %rsp, %rbp

    /* (2) */
    /*  -8(%rbp), 8 = pointer to current line (char *)
     * -16(%rbp), 8 = size of line buffer (size_t) */
    subq $16, %rsp

    /* Initialize our pointer with "NULL" and size with 0. */
    movq $0, -8(%rbp)
    movq $0, -16(%rbp)

        /* (3) */
        /* Read a line. Memory management is done entirely by
         * readline(). */
        leaq -8(%rbp), %rdi
        leaq -16(%rbp), %rsi
        call readline

        /* Test if readline() returned 0. If it did, exit our loop. */
        cmpq $0, %rax
        je .Lreadwriteloop_end

        /* Print the line we've just read. */
        movq -8(%rbp), %rdi
        movq %rax, %rsi
        call print

        jmp .Lreadwriteloop

        movq $0, %rdi
        call exit

If you’ve never written assembly language before, here are some explanations and remarks.

At (0) you can see a label called _start. This is were program flow begins. In C, you usually use a function called main(), but that one belongs to libc and we don’t want to use it. Every program really begins at _start and main() is actually a rather thick wrapper around that. (I’ve talked about that differences years ago, but the old posting is only available in german.)

The two instructions at (1) set up a “frame pointer” or a “stack frame”. There’s another blog post by me about that topic, but again, it’s old and has not been translated to english. It even still uses 32 bit code, but it works the same on 64 bit. This article in good old english is a pretty good explanation, too.

The basic idea behind a frame pointer is to have one register that holds the address to the base of your current stack frame. That value is fixed during the lifetime of your current “function”. %rbp is usually used for that task. Addresses relative to %rbp can then point to “variables” on your stack.

For example:

movq $0, -8(%rbp)

movq means “copy a quad word” (that’s 8 bytes in our case). Which 8 bytes? I use an “immediate value” here and it’s a 0. Where will it be copied to? To the memory addressed by the value which is currently stored in %rbp – minus 8 bytes. The stack in Linux grows from higher addresses to lower addresses, hence the “minus”.

The code at (2) alters the “size” of the stack frame and thus makes room for our variables. It doesn’t really “allocate” anything (IIUC), but it will influence future push instructions: They work relative to %rsp. In other words, by doing a subq on %rsp, we can pretend to have pushed as many quad words as we like, without having to run actual push instructions – but if we were to call foreign code that does a push, it won’t clobber our memory. We, on the other hand, can access our memory directly by using something like -48(%rbp).

So … we’re using stack memory here, but we’re not using it with the semantics of a stack. Simply because it’s more flexible that way. It’s just linear memory to us and we can pretty much access each byte individually.

So much for the recap on stack frames.

The loop that begins around (3) will read a line, check if we actually accomplished to read anything, and eventually print that line. The functions exit, print, and readline are not yet defined. We’ll go through them step by step.

“Calling a function” means you should adhere to a calling convention. This is an absolute requirement if you want to call library functions or if your code is called by them. We don’t have to follow these conventions to the letter in our program, since we only ever run our own code. For example, I ignored that some registers are not to be clobbered by a callee. I’ll try to stick to the common cdecl calling convention, though.


This one is simple:

    /* in: %rdi, exit code */

    /* 60 = _exit(), %rdi already contains the exit code */
    movq $60, %rax

It’s just a syscall. This is the point where you can abandon all hope that this code runs on anything other than Linux – unless your operating system happens to use 60, too, to indicate a syscall that terminates the program.

This, of course, applies to all syscalls. Their numbers and their arguments must match your operating system. And you again have to use the correct calling convention. It’s a little bit different for Linux syscalls on x86-64 than for normal userland calls (see A.2.1 in SysV ABI draft).


Not a very complicated function, either:

    /* Print the given number of bytes.
     * in:
     *   %rdi, pointer to buffer (char *)
     *   %rsi, length of string (size_t) */

    /* Just shuffle arguments, call write(1, &buffer, len) (syscall #1) */
    movq %rsi, %rdx
    movq %rdi, %rsi
    movq $1, %rax
    movq $1, %rdi

In fact, it’s just a wrapper around Linux’s write() syscall with a fixed value of 1: File descriptor number 1 is expected to be stdout.

I could have ditched both exit and print and could have used the syscalls directly in our main program. I didn’t do so because I wanted to keep the main loop a little bit more abstract and possibly easier to understand.


This is where the fun starts.

About 100 lines of code, this is the largest function so far. If it were written in C, it would have this signature:

size_t readline(char **buffer, size_t *buffer_size);

The first argument is a pointer to a “string”, but we won’t NUL-terminate it, because we’re lazy. The second argument is a pointer to some 8 bytes that hold the currently allocated size for our buffer. Both of these variables will live on the stack of the main function – they’re the ones allocated at (2).

Our implementation of readline will dynamically allocate memory on the heap. That memory will store one entire line. In the future, the (yet hypothetical) regex engine can work on this buffer. For now, this buffer will simply be written to stdout by our print after we have returned from readline.

Allocating memory on the heap can be done through the brk() syscall on Linux. A brk(0) will get you the current “program break” address. When you run this the first time, you’ll get some pretty much random address. You can then run brk(first_result + some_size) to move the program break – and that’s your heap. That’s what malloc() does in C behind the scenes (sometimes). Actually, it works very similar to managing the stack, but in the other direction. Linux’s heap grows towards larger addresses.

So, what will readline do? It will:

  1. Manage a buffer that holds the data that we’ve read.
  2. Read data from stdin and store it in said buffer.
  3. Check if we’ve read a newline character that signifies the end of the line.
  4. Eventually return the number of bytes read to the caller.

Let’s just have a look at the actual function:

    /* Read a line (terminated by \n or EOF) from stdin.
     * in + out:
     *   %rdi, pointer to pointer to buffer (char **)
     *   %rsi, pointer to current size of buffer (size_t *)
     * out:
     *   %rax, length of string read (including \n), 0 if EOF */
    pushq %rbp
    movq %rsp, %rbp

    /*  -8(%rbp), 8 = pointer to pointer to buffer (char **)
     * -16(%rbp), 8 = pointer to size of buffer (size_t *)
     * -24(%rbp), 8 = length of string
     * (other 8 bytes are alignment) */
    subq $32, %rsp

    movq %rdi, -8(%rbp)
    movq %rsi, -16(%rbp)
    movq $0, -24(%rbp)

        /* Check if there is still room in the buffer */
        movq -16(%rbp), %rbx
        movq (%rbx), %rax
        cmpq %rax, -24(%rbp)
        je .Lreadline_increase_buffer

        /* read(stdin, &(buffer + len), 1) (syscall #0, stdin = fd 0) */
        movq -8(%rbp), %rbx
        movq (%rbx), %rsi
        movq -24(%rbp), %rax
        addq %rax, %rsi

        movq $0, %rax
        movq $0, %rdi
        movq $1, %rdx

        /* EOF? */
        cmpq $0, %rax
        je .Lreadline_exit

        /* We read something, so increase the counter. */
        movq -24(%rbp), %rax
        incq %rax
        movq %rax, -24(%rbp)

        /* \n found? */
        movq -8(%rbp), %rbx
        movq (%rbx), %rsi
        movq -24(%rbp), %rax
        decq %rax
        addq %rax, %rsi
        cmpb $10, (%rsi)
        je .Lreadline_exit

        jmp .Lreadline_next_byte

        /* Get address to base of heap if pointer is still NULL. */
        movq -8(%rbp), %rbx
        movq (%rbx), %rax
        cmpq $0, %rax
        je .Lreadline_initial_buffer_allocation

        /* Increase buffer by 32 bytes and call brk(base + new_size).
         * This includes saving the new buffer size to the memory
         * pointed by -16(%rbp). */
        movq -8(%rbp), %rbx
        movq (%rbx), %rdi
        movq -16(%rbp), %rbx
        movq (%rbx), %rax
        addq $32, %rax
        movq %rax, (%rbx)
        addq %rax, %rdi

        movq $12, %rax
        jmp .Lreadline_buffer_large_enough

        /* brk(0) (syscall #12) to get base of heap */
        movq $12, %rax
        movq $0, %rdi
        movq -8(%rbp), %rbx
        movq %rax, (%rbx)
        jmp .Lreadline_buffer_initially_allocated

        movq -24(%rbp), %rax

This is the “beauty” of assembly code. It’s really long, but it’s also really simple. It’s just a bunch of “copy data”, “add or subtract numbers”, “compare values”, “jump somewhere else”, and “issue a syscall”. And you should actually be able to understand what’s going on here. We’ve already talked about stack frames. We’ve talked about brk(). You should already be familiar with the read() syscall. That’s all there is to it. No further magic added.

I guess that it’s hard to understand nevertheless. Most importantly, there’s a lot of address shuffling going on. Beginners may have trouble understanding simple pointers in C, so here we have a double pointer in assembly. Still, I think it’s a very rewarding excercise: Once you’ve fully understood that it’s just about memory addresses and that it’s a fundamental and simple part of everyday programming to work with them, then you’ll find pointers in C trivial to understand.

Let’s go through it again.

In _start, the register %rbp holds a memory address. At that address begins your stack memory. That already is a “pointer” and using (%rbp) is a de-reference. -8(%rbp) means subtracting 8 from the value in %rbp (without altering %rbp) and then using the result for the actual operation.

leaq -8(%rbp), %rdi performs the same calculation – subtract 8 from the value in %rbp without altering it – and stores the result in the register %rdi. lea stands for “load effective address”. The same could be done by copying the value of %rbp to another register and then subtracting 8, but lea is shorter and easier. Either way, %rdi now holds the address of some memory where another address is being stored – a “double pointer”.

%rdi will be set in the main routine (_start) and then used as an argument to readline. In readline itself we put a backup copy of that value on the local stack. If we want to work with this pointer, we have to load the first address (original value of %rdi) from the stack into some register and then de-reference it. This leaves us with a simple pointer, a „char *“, which we can pass to read().

Some more remarks:


You can find the complete code here:


Now, none of this is fast. We read one byte at a time, which is probably the worst thing you can do. I didn’t take care to micro-optimize anything. That’s why it’s called “a naïve cat”. It’s written in assembly language, but that doesn’t necessarily mean that it’s “highly optimized”. I’ll have to look into that next.

Continuation: This series continues in grep.asm – Step 2: A better cat.