blog · git · desktop · images · contact & privacy · gopher
Previous post in this series.
Sorry for the long delay. This project is going much slower than I expected. Oh, well.
Again, this blogpost needs a disclaimer: I am not an expert on x86 Assembly. If you have any corrections or improvements, I’d be very happy to hear about them!
One of the major issues with the previous version was that we were reading one byte at a time. This makes the code a tiny bit easier, but it’s really slow. You should avoid doing this if you expect to read more than a couple of bytes. My bevelbar reads from STDIN and does indeed read one byte at a time – but the thing is, it reads so little data and spends most of its time idling, so nobody cares.
A grep is different. It took my original code about 1.8 seconds to process a file of 4 MB. That’s really bad. This is now down to about 60 milliseconds, which still isn’t fast, but it sure is faster than before.
This performance boost comes from not reading one byte at a time.
Remember that “reading one byte at a time” means one syscall for each
byte. A syscall is a lot of overhead and thus results in a time penalty.
To avoid all this, the program now allocates a large buffer of, say, 4
MB and then issues the
read syscall just once. (Ideally.
not necessarily read the exact amount of bytes you want it to read, but
it may return earlier. You then have to call it again to read the rest.)
I am still lazy: The program reads the entire input until it hits end-of-file. This means, if you want to grep through a file of 1 GB, you need 1 GB of memory. We can easily optimize this at a later stage – I have already organized the program in such a way that functions don’t need to care about this implementation detail. They all work on a “struct” that just tells them: “Here’s a large buffer, the current line begins here and it’s this many bytes in size.”
Okay, so the main loop now looks like this (without comments):
.text .global _start _start: pushq %rbp movq %rsp, %rbp subq $64, %rsp movq %rbp, %rdi call init_buffer_info .Lreadwriteloop: movq %rbp, %rdi call next_line cmpq $0, %rax je .Lreadwriteloop_end movq %rbp, %rdi call test_match cmpq $0, %rax je .Lreadwriteloop movq %rbp, %rdi call print_current jmp .Lreadwriteloop .Lreadwriteloop_end: movq $0, %rdi call exit
- Initialize a thing called "buffer info". - Loop: - Read a line. (Update "buffer info" and mark the current line.) - Exit if no line left. - Test if the current line matches a pattern. - Loop again, if it doesn't. - Print the current line.
At the moment, the first invocation of
next_line() reads everything
into a large buffer. Later invocations only update pointers/counters.
But as you can see, the main loop and the other functions don’t need to
know anything about that.
How? The local stack of
_start allocates space for 64 bytes. In that
space we’re going to store that “buffer info” “struct”. “struct”
basically means a struct in C. It’ll contain various fields:
-8(%rbp), 8 = pointer to data (char *) -16(%rbp), 8 = allocated size of entire buffer (size_t) -24(%rbp), 8 = current fill of buffer (size_t) -32(%rbp), 8 = offset for current line in buffer (size_t) -40(%rbp), 8 = length of current line (size_t)
Pretty much all functions work with this struct:
for the pointer and the counters to “point” to the current line;
test_match() will only check the current line;
prints nothing but the current line.
I’ll spare you the details of the new
next_line() function. It has a
lot of comments. Similar to the previous version, it manages memory
brk() syscall and it reads data using
rep stosbto initialize memory
An intermediate version of
init_buffer_info() looked like this:
movq $0, -8(%rdi) movq $0, -16(%rdi) movq $0, -24(%rdi) movq $0, -32(%rdi) movq $0, -40(%rdi)
In other words, it simply writes a bunch of zeroes to memory. It’s just five instructions that are called exactly once during the lifetime of the program, so it probably doesn’t matter how you do it. The question still arose: What if you were to initialize a few megabytes of memory? Or more?
Sure, you could do it like this (for
len >= 1):
movq $0, %rcx movq $pointer, %rdi .Lloop_start: movb $0, (%rdi) inc %rdi inc %rcx cmpq $len, %rcx jne .Lloop_start
Isn’t there a better way to do this? Since I don’t have that much background in Assembly language, the best thing I could find is the “rep” mechanism which works on “string instructions”.
For this particular case, let’s use
movq $pointer, %rdi movq $40, %rcx movb $0, %al rep stosb
We first put a 40 in
%rcx, then a NULL byte in
the byte from
%al and stores it at the memory location pointed to by
the current value of
%rdi. (Let’s ignore segmentation.)
%rdi by one, so if you run
stosb again, it
stores the same value at the next byte in the string pointed to by the
%rdi. How do you run
rep does this.
runs the instruction again and again, until
%rcx hits zero.
So, this is a pretty concise version of
There’s one more detail. String operations like
stosb can increase
%rdi, as we’ve seen above, but they can also decrease it. This depends
on whether the “direction flag” is set. You can set or clear that flag
using the instructions
repne scasbto implement
Okay, our grep reads all the data and stores it in a buffer. It then has to seek through that data and find the individual lines – because grep is a line-oriented program.
Remember the struct that we introduced at the beginning:
-8(%rbp), 8 = pointer to data (char *) ... -32(%rbp), 8 = offset for current line in buffer (size_t) -40(%rbp), 8 = length of current line (size_t)
On each invocation,
next_line() has to increase
offset and update
length. It starts at
pointer + offset and checks each of the
following bytes if it is a newline character. Once it finds a newline,
it’ll store the resulting length in
This is essentially the same as the traditional
strlen(), just with a
check for a newline character instead of a NULL byte.
Now, of course you can manually write a loop that does this. And that’s what I did. My initial version looked like this:
current_line_length: /* Find the length of the current line. * * in: * %rdi, pointer to base of info struct * * out: * %rax, current line length */ movq $0, %rax movq -8(%rdi), %rbx movq -24(%rdi), %rcx movq -32(%rdi), %rdx addq %rdx, %rbx .Lfind_next_newline_loop: cmpb $10, (%rbx) je .Lfind_next_newline_found incq %rax incq %rbx incq %rdx cmpq %rdx, %rcx je .Lfind_next_newline_exceeded_bounds_check jmp .Lfind_next_newline_loop .Lfind_next_newline_found: incq %rax ret .Lfind_next_newline_exceeded_bounds_check: ret
Pretty long and verbose.
Again, we can use a
rep construct to simplify things:
current_line_length: movq -8(%rdi), %rdx movq -32(%rdi), %rbx addq %rbx, %rdx movq -24(%rdi), %rcx movq -32(%rdi), %rbx subq %rbx, %rcx movq %rcx, %rbx movq %rdx, %rdi movb $10, %al cld repne scasb subq %rcx, %rbx movq %rbx, %rax ret
repne now instead of
rep. This also repeats the instruction
%rcx hits 0, but it also checks the “zero flag”, as you can read
in the Intel docs on the x86 instruction set.
You now know have to know a little detail: When you do a
%rbx, the CPU subtracts the two registers from another. A following
je (“jump if equal”) is meant to perform a jump if those two values
were equal – in other words, jump if subtracting them yields 0. That’s
what the zero flag is used for. So, if the zero flag is set (subtraction
yields 0), then the two values were equal.
So, when the Intel docs talk about “repeat while ZF flag is clear”, it’s
the same as saying “repeat while a comparison did not determine two
values to be equal”. This is also why
repnz (“repeat while not zero”)
is just an alias for
repne (“repeat while not equal”).
Okay. Now we know what
repne does. We now just need a string operation
stosb as used in
init_buffer_info(), but it shall not
change our target string as
stosb does but compare it to some value.
And that would be
scasb: It’s short for “scan string, length byte”. It
cmpb %al, (%rdi) and increases/decreases
%rdi afterwards. The
repne around it will then evaluate the result of this
cmp (and check
%rcx is now zero).
In short, we now have a
There is still a lot of
sub going on besides the
repne scasb. But I think this version is better than the
previous spaghetti mess. You just have to be familiar with
Still no sign of regular expressions, but we have eliminated some of the
very naïve things about our first
cat. The code now also evaluates the
return values of syscalls.
Oh, if you have a look at the code of grep.asm, have a
look at the Git tags. The tag
blogpost-step-2 points to the code
version as of this blogpost.
It’s now time to implement
Continuation: This series continues in grep.asm – Step 3: Matching regular expressions.