blog · git · desktop · images · contact


grep.asm – Step 3: Matching regular expressions

2019-06-14

Previous post in this series.

Okay. What’s going on, why the silence? Many reasons, two big ones.

First one is RSI. Apparently, the human body does not like it when you use keyboard and mouse for 8 to 16 hours a day, every day, for some decades, much of it under stress – and then some. So one day I got pain in my hands and it didn’t go away anymore. There were no warning signs (if there were, I didn’t notice), it just happened. Bummer. Recovery is still in progress. I’m typing this text using some silly workaround at glacial speed, but it’s better than nothing. Maybe more on this topic in the future.

This is, of course, the reason why this posting is much shorter and much less detailed than I’d like it to be.

The other reason was simply motivation. I started this project (partly) because I’m annoyed by the complexity and the level of abstraction of today’s software. Programming in Assembly gets you a little bit closer to “the machine”. You see what’s really happening, it doesn’t hide so much from you. (Knowing well that things like register renaming exist, and everything is super-scalar and out-of-order anyway, so even on the level of ASM, things aren’t really what they appear to be.) But of course, handling regular expressions can be arbitrarily complicated. Do I really want to implement that in ASM? Isn’t that going to be a huge monstrosity? That, what I wanted to avoid?

I knew that right from the start and I simply hoped that I could come up with a clever trick along the way.

Nope.

Then I finally found this:

https://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html

Funny enough, Brian Kernighan and Rob Pike were looking for simple regex engines, but all the existing ones were too big. As a result, Rob Pike wrote three very short C functions. With those, you can match a small but very useful set of regular expressions:

c    matches any literal character c
.    matches any single character
^    matches the beginning of the input string
$    matches the end of the input string
*    matches zero or more occurrences of the previous character

All in those three short functions. That’s just “wow”. Or as Brian Kernighan put it: “a superb example of beautiful code”.

Lazy as I am, I took his code and put an ASM version of it into grep.asm.

And that concludes our journey. Potential bug fixes aside, grep.asm is now finished. The tag blogpost-step-3 points to the code version as of this blogpost.

Comments?