grep.asm – Step 2: A better `cat`

2018-09-22

Sorry for the long delay. This project is going much slower than I expected. Oh, well.

Again, this blogpost needs a disclaimer: I am not an expert on x86 Assembly. If you have any corrections or improvements, I’d be very happy to hear about them!

Reading large chunks

One of the major issues with the previous version was that we were reading one byte at a time. This makes the code a tiny bit easier, but it’s really slow. You should avoid doing this if you expect to read more than a couple of bytes. My bevelbar reads from STDIN and does indeed read one byte at a time – but the thing is, it reads so little data and spends most of its time idling, so nobody cares.

A grep is different. It took my original code about 1.8 seconds to process a file of 4 MB. That’s really bad. This is now down to about 60 milliseconds, which still isn’t fast, but it sure is faster than before.

This performance boost comes from not reading one byte at a time. Remember that “reading one byte at a time” means one syscall for each byte. A syscall is a lot of overhead and thus results in a time penalty. To avoid all this, the program now allocates a large buffer of, say, 4 MB and then issues the read syscall just once. (Ideally. read does not necessarily read the exact amount of bytes you want it to read, but it may return earlier. You then have to call it again to read the rest.)

I am still lazy: The program reads the entire input until it hits end-of-file. This means, if you want to grep through a file of 1 GB, you need 1 GB of memory. We can easily optimize this at a later stage – I have already organized the program in such a way that functions don’t need to care about this implementation detail. They all work on a “struct” that just tells them: “Here’s a large buffer, the current line begins here and it’s this many bytes in size.”

Okay, so the main loop now looks like this (without comments):

.text
.global _start

_start:
    pushq %rbp
    movq %rsp, %rbp

    subq $64, %rsp
    movq %rbp, %rdi
    call init_buffer_info

    .Lreadwriteloop:
        movq %rbp, %rdi
        call next_line

        cmpq $0, %rax
        je .Lreadwriteloop_end

        movq %rbp, %rdi
        call test_match

        cmpq $0, %rax
        je .Lreadwriteloop

        movq %rbp, %rdi
        call print_current

        jmp .Lreadwriteloop

    .Lreadwriteloop_end:
        movq $0, %rdi
        call exit

In pseudocode:

- Initialize a thing called "buffer info".
- Loop:
    - Read a line. (Update "buffer info" and mark the current line.)
    - Exit if no line left.
    - Test if the current line matches a pattern.
    - Loop again, if it doesn't.
    - Print the current line.

At the moment, the first invocation of next_line() reads everything into a large buffer. Later invocations only update pointers/counters. But as you can see, the main loop and the other functions don’t need to know anything about that.

How? The local stack of _start allocates space for 64 bytes. In that space we’re going to store that “buffer info” “struct”. “struct” basically means a struct in C. It’ll contain various fields:

 -8(%rbp), 8 = pointer to data (char *)
-16(%rbp), 8 = allocated size of entire buffer (size_t)
-24(%rbp), 8 = current fill of buffer (size_t)
-32(%rbp), 8 = offset for current line in buffer (size_t)
-40(%rbp), 8 = length of current line (size_t)

Pretty much all functions work with this struct: next_line() arranges for the pointer and the counters to “point” to the current line; test_match() will only check the current line; print_current() prints nothing but the current line.

I’ll spare you the details of the new next_line() function. It has a lot of comments. Similar to the previous version, it manages memory using the brk() syscall and it reads data using read().

Using `rep stosb` to initialize memory

An intermediate version of init_buffer_info() looked like this:

movq $0, -8(%rdi)
movq $0, -16(%rdi)
movq $0, -24(%rdi)
movq $0, -32(%rdi)
movq $0, -40(%rdi)

In other words, it simply writes a bunch of zeroes to memory. It’s just five instructions that are called exactly once during the lifetime of the program, so it probably doesn’t matter how you do it. The question still arose: What if you were to initialize a few megabytes of memory? Or more?

Sure, you could do it like this (for len >= 1):

movq $0, %rcx
movq $pointer, %rdi
.Lloop_start:
    movb $0, (%rdi)
    inc %rdi
    inc %rcx
    cmpq $len, %rcx
    jne .Lloop_start

Isn’t there a better way to do this? Since I don’t have that much background in Assembly language, the best thing I could find is the “rep” mechanism which works on “string instructions”.

For this particular case, let’s use rep stosb:

movq $pointer, %rdi
movq $40, %rcx
movb $0, %al
rep stosb

We first put a 40 in %rcx, then a NULL byte in %al. stosb reads the byte from %al and stores it at the memory location pointed to by the current value of %rdi. (Let’s ignore segmentation.) stosb then automatically increases %rdi by one, so if you run stosb again, it stores the same value at the next byte in the string pointed to by the initial %rdi. How do you run stosb again? rep does this. rep runs the instruction again and again, until %rcx hits zero.

So, this is a pretty concise version of memset().

There’s one more detail. String operations like stosb can increase %rdi, as we’ve seen above, but they can also decrease it. This depends on whether the “direction flag” is set. You can set or clear that flag using the instructions std and cld.

More `rep`: `repne scasb` to implement `strlen()`

Okay, our grep reads all the data and stores it in a buffer. It then has to seek through that data and find the individual lines – because grep is a line-oriented program.

Remember the struct that we introduced at the beginning:

 -8(%rbp), 8 = pointer to data (char *)
...
-32(%rbp), 8 = offset for current line in buffer (size_t)
-40(%rbp), 8 = length of current line (size_t)

On each invocation, next_line() has to increase offset and update length. It starts at pointer + offset and checks each of the following bytes if it is a newline character. Once it finds a newline, it’ll store the resulting length in length.

This is essentially the same as the traditional strlen(), just with a check for a newline character instead of a NULL byte.

Now, of course you can manually write a loop that does this. And that’s what I did. My initial version looked like this:

current_line_length:
    /* Find the length of the current line.
     *
     * in:
     *   %rdi, pointer to base of info struct
     *
     * out:
     *   %rax, current line length */

    movq $0, %rax
    movq -8(%rdi), %rbx
    movq -24(%rdi), %rcx
    movq -32(%rdi), %rdx
    addq %rdx, %rbx

    .Lfind_next_newline_loop:
        cmpb $10, (%rbx)
        je .Lfind_next_newline_found

        incq %rax
        incq %rbx
        incq %rdx
        cmpq %rdx, %rcx
        je .Lfind_next_newline_exceeded_bounds_check

        jmp .Lfind_next_newline_loop

    .Lfind_next_newline_found:
        incq %rax
        ret

    .Lfind_next_newline_exceeded_bounds_check:
        ret

Pretty long and verbose.

Again, we can use a rep construct to simplify things:

current_line_length:
    movq -8(%rdi), %rdx
    movq -32(%rdi), %rbx
    addq %rbx, %rdx

    movq -24(%rdi), %rcx
    movq -32(%rdi), %rbx
    subq %rbx, %rcx
    movq %rcx, %rbx

    movq %rdx, %rdi
    movb $10, %al
    cld
    repne scasb

    subq %rcx, %rbx
    movq %rbx, %rax
    ret

We use repne now instead of rep. This also repeats the instruction until %rcx hits 0, but it also checks the “zero flag”, as you can read in the Intel docs on the x86 instruction set.

You now know have to know a little detail: When you do a cmp %rax, %rbx, the CPU subtracts the two registers from another. A following je (“jump if equal”) is meant to perform a jump if those two values were equal – in other words, jump if subtracting them yields 0. That’s what the zero flag is used for. So, if the zero flag is set (subtraction yields 0), then the two values were equal.

So, when the Intel docs talk about “repeat while ZF flag is clear”, it’s the same as saying “repeat while a comparison did not determine two values to be equal”. This is also why repnz (“repeat while not zero”) is just an alias for repne (“repeat while not equal”).

Okay. Now we know what repne does. We now just need a string operation similar to stosb as used in init_buffer_info(), but it shall not change our target string as stosb does but compare it to some value. And that would be scasb: It’s short for “scan string, length byte”. It runs a cmpb %al, (%rdi) and increases/decreases %rdi afterwards. The repne around it will then evaluate the result of this cmp (and check if %rcx is now zero).

In short, we now have a strlen().

There is still a lot of mov and add and sub going on besides the actual repne scasb. But I think this version is better than the previous spaghetti mess. You just have to be familiar with rep*.

Conclusion

Still no sign of regular expressions, but we have eliminated some of the very naïve things about our first cat. The code now also evaluates the return values of syscalls.

Oh, if you have a look at the code of grep.asm, have a look at the Git tags. The tag blogpost-step-2 points to the code version as of this blogpost.

It’s now time to implement test_match().

Continuation: This series continues in grep.asm – Step 3: Matching regular expressions.

Comments?

grep.asm – Step 2: A better cat

Reading large chunks

Using rep stosb to initialize memory

More rep: repne scasb to implement strlen()

Conclusion

grep.asm – Step 2: A better `cat`

Using `rep stosb` to initialize memory

More `rep`: `repne scasb` to implement `strlen()`