blog · git · desktop · images · contact & privacy · gopher
cat
2018-09-22
Previous post in this series.
Sorry for the long delay. This project is going much slower than I expected. Oh, well.
Again, this blogpost needs a disclaimer: I am not an expert on x86 Assembly. If you have any corrections or improvements, I’d be very happy to hear about them!
One of the major issues with the previous version was that we were reading one byte at a time. This makes the code a tiny bit easier, but it’s really slow. You should avoid doing this if you expect to read more than a couple of bytes. My bevelbar reads from STDIN and does indeed read one byte at a time – but the thing is, it reads so little data and spends most of its time idling, so nobody cares.
A grep is different. It took my original code about 1.8 seconds to process a file of 4 MB. That’s really bad. This is now down to about 60 milliseconds, which still isn’t fast, but it sure is faster than before.
This performance boost comes from not reading one byte at a time.
Remember that “reading one byte at a time” means one syscall for each
byte. A syscall is a lot of overhead and thus results in a time penalty.
To avoid all this, the program now allocates a large buffer of, say, 4
MB and then issues the read
syscall just once. (Ideally. read
does
not necessarily read the exact amount of bytes you want it to read, but
it may return earlier. You then have to call it again to read the rest.)
I am still lazy: The program reads the entire input until it hits end-of-file. This means, if you want to grep through a file of 1 GB, you need 1 GB of memory. We can easily optimize this at a later stage – I have already organized the program in such a way that functions don’t need to care about this implementation detail. They all work on a “struct” that just tells them: “Here’s a large buffer, the current line begins here and it’s this many bytes in size.”
Okay, so the main loop now looks like this (without comments):
.text
.global _start
_start:
pushq %rbp
movq %rsp, %rbp
subq $64, %rsp
movq %rbp, %rdi
call init_buffer_info
.Lreadwriteloop:
movq %rbp, %rdi
call next_line
cmpq $0, %rax
je .Lreadwriteloop_end
movq %rbp, %rdi
call test_match
cmpq $0, %rax
je .Lreadwriteloop
movq %rbp, %rdi
call print_current
jmp .Lreadwriteloop
.Lreadwriteloop_end:
movq $0, %rdi
call exit
In pseudocode:
- Initialize a thing called "buffer info".
- Loop:
- Read a line. (Update "buffer info" and mark the current line.)
- Exit if no line left.
- Test if the current line matches a pattern.
- Loop again, if it doesn't.
- Print the current line.
At the moment, the first invocation of next_line()
reads everything
into a large buffer. Later invocations only update pointers/counters.
But as you can see, the main loop and the other functions don’t need to
know anything about that.
How? The local stack of _start
allocates space for 64 bytes. In that
space we’re going to store that “buffer info” “struct”. “struct”
basically means a struct in C. It’ll contain various fields:
-8(%rbp), 8 = pointer to data (char *)
-16(%rbp), 8 = allocated size of entire buffer (size_t)
-24(%rbp), 8 = current fill of buffer (size_t)
-32(%rbp), 8 = offset for current line in buffer (size_t)
-40(%rbp), 8 = length of current line (size_t)
Pretty much all functions work with this struct: next_line()
arranges
for the pointer and the counters to “point” to the current line;
test_match()
will only check the current line; print_current()
prints nothing but the current line.
I’ll spare you the details of the new next_line()
function. It has a
lot of comments. Similar to the previous version, it manages memory
using the brk()
syscall and it reads data using read()
.
rep stosb
to initialize memoryAn intermediate version of init_buffer_info()
looked like this:
movq $0, -8(%rdi)
movq $0, -16(%rdi)
movq $0, -24(%rdi)
movq $0, -32(%rdi)
movq $0, -40(%rdi)
In other words, it simply writes a bunch of zeroes to memory. It’s just five instructions that are called exactly once during the lifetime of the program, so it probably doesn’t matter how you do it. The question still arose: What if you were to initialize a few megabytes of memory? Or more?
Sure, you could do it like this (for len >= 1
):
movq $0, %rcx
movq $pointer, %rdi
.Lloop_start:
movb $0, (%rdi)
inc %rdi
inc %rcx
cmpq $len, %rcx
jne .Lloop_start
Isn’t there a better way to do this? Since I don’t have that much background in Assembly language, the best thing I could find is the “rep” mechanism which works on “string instructions”.
For this particular case, let’s use rep stosb
:
movq $pointer, %rdi
movq $40, %rcx
movb $0, %al
rep stosb
We first put a 40 in %rcx
, then a NULL byte in %al
. stosb
reads
the byte from %al
and stores it at the memory location pointed to by
the current value of %rdi
. (Let’s ignore segmentation.) stosb
then
automatically increases %rdi
by one, so if you run stosb
again, it
stores the same value at the next byte in the string pointed to by the
initial %rdi
. How do you run stosb
again? rep
does this. rep
runs the instruction again and again, until %rcx
hits zero.
So, this is a pretty concise version of memset()
.
There’s one more detail. String operations like stosb
can increase
%rdi
, as we’ve seen above, but they can also decrease it. This depends
on whether the “direction flag” is set. You can set or clear that flag
using the instructions std
and cld
.
rep
: repne scasb
to implement strlen()
Okay, our grep reads all the data and stores it in a buffer. It then has to seek through that data and find the individual lines – because grep is a line-oriented program.
Remember the struct that we introduced at the beginning:
-8(%rbp), 8 = pointer to data (char *)
...
-32(%rbp), 8 = offset for current line in buffer (size_t)
-40(%rbp), 8 = length of current line (size_t)
On each invocation, next_line()
has to increase offset
and update
length
. It starts at pointer + offset
and checks each of the
following bytes if it is a newline character. Once it finds a newline,
it’ll store the resulting length in length
.
This is essentially the same as the traditional strlen()
, just with a
check for a newline character instead of a NULL byte.
Now, of course you can manually write a loop that does this. And that’s what I did. My initial version looked like this:
current_line_length:
/* Find the length of the current line.
*
* in:
* %rdi, pointer to base of info struct
*
* out:
* %rax, current line length */
movq $0, %rax
movq -8(%rdi), %rbx
movq -24(%rdi), %rcx
movq -32(%rdi), %rdx
addq %rdx, %rbx
.Lfind_next_newline_loop:
cmpb $10, (%rbx)
je .Lfind_next_newline_found
incq %rax
incq %rbx
incq %rdx
cmpq %rdx, %rcx
je .Lfind_next_newline_exceeded_bounds_check
jmp .Lfind_next_newline_loop
.Lfind_next_newline_found:
incq %rax
ret
.Lfind_next_newline_exceeded_bounds_check:
ret
Pretty long and verbose.
Again, we can use a rep
construct to simplify things:
current_line_length:
movq -8(%rdi), %rdx
movq -32(%rdi), %rbx
addq %rbx, %rdx
movq -24(%rdi), %rcx
movq -32(%rdi), %rbx
subq %rbx, %rcx
movq %rcx, %rbx
movq %rdx, %rdi
movb $10, %al
cld
repne scasb
subq %rcx, %rbx
movq %rbx, %rax
ret
We use repne
now instead of rep
. This also repeats the instruction
until %rcx
hits 0, but it also checks the “zero flag”, as you can read
in the Intel docs on the x86 instruction set.
You now know have to know a little detail: When you do a cmp %rax,
%rbx
, the CPU subtracts the two registers from another. A following
je
(“jump if equal”) is meant to perform a jump if those two values
were equal – in other words, jump if subtracting them yields 0. That’s
what the zero flag is used for. So, if the zero flag is set (subtraction
yields 0), then the two values were equal.
So, when the Intel docs talk about “repeat while ZF flag is clear”, it’s
the same as saying “repeat while a comparison did not determine two
values to be equal”. This is also why repnz
(“repeat while not zero”)
is just an alias for repne
(“repeat while not equal”).
Okay. Now we know what repne
does. We now just need a string operation
similar to stosb
as used in init_buffer_info()
, but it shall not
change our target string as stosb
does but compare it to some value.
And that would be scasb
: It’s short for “scan string, length byte”. It
runs a cmpb %al, (%rdi)
and increases/decreases %rdi
afterwards. The
repne
around it will then evaluate the result of this cmp
(and check
if %rcx
is now zero).
In short, we now have a strlen()
.
There is still a lot of mov
and add
and sub
going on besides the
actual repne scasb
. But I think this version is better than the
previous spaghetti mess. You just have to be familiar with rep*
.
Still no sign of regular expressions, but we have eliminated some of the
very naïve things about our first cat
. The code now also evaluates the
return values of syscalls.
Oh, if you have a look at the code of grep.asm, have a
look at the Git tags. The tag blogpost-step-2
points to the code
version as of this blogpost.
It’s now time to implement test_match()
.
Continuation: This series continues in grep.asm – Step 3: Matching regular expressions.