This is a very long message, but I hope you won't mind -- 
there are some interesting things here.  Grab yourself a 
drink, sit down, and have a bit of a read.  GNU Grep's 
claim to be "The fastest grep in the West" is in for a bit 
of a shake.  

Back in the November 1997 issue of Dr. Dobb's Journal, I 
published an article entitled "High-Speed Finite-State 
Machines".  In that article, I presented some cute coding 
trickery combining threaded assembly with table-driven 
finite state machines to produce a stunningly fast 
architecture for handling byte-oriented applications.  The 
article included a word count utility, a grep program and a 
Huffman decoder program.  

In each case, the speed of the programs was (and is) 
amazing: the Huffman decoder decompresses a bitstream 
(bit-aligned, not byte-aligned) for just 14 cycles per 
output byte.  The word count program is faster than any 
comparable line/word counting program (although I note that 
if you don't try to count lines and use a 64k lookup table, 
you can create faster word counting code).  The grep 
program (Grouse Grep, a.k.a. ggrep) easily outstrips most 
comparable programs, regardless of whether they implement 
simple string searches or regular expression searches.  
[See later for some limitations of this grep program, 
however.]  

Well, I thought that such fast code, and the potential of 
this general-purpose architecture to provide stunning 
improvements in dozens of other applications, would cause 
quite a splash when the article was published.  While the 
reaction was quite pleasing, the interest was nowhere near 
as great as I'd hoped... partially because my expectations 
were very, very high.  

So why didn't the code go as far as I'd hoped?  There's 
quite a few reasons, once you start looking:

    1.  The programs ran on DOS, at a time when Windows 95 
        had mostly (arguably?) shed its DOS inheritance;

    2.  It was produced using TopSpeed C, instead of a 
        more mainstream C compiler such as Microsoft's 
        or Borland's; 

    3.  The DDJ article title and text somewhat underplayed 
        the benefits of the architecture.  This was 
        partially due to my originally writing TWO articles, 
        4000 words each, which ended up being distilled 
        into one 3000-word article.  Many of the things I 
        originally hoped to say got left out along the way.  
        (That's also why the article is very klunky in one 
        or two places); 

    4.  I shied away from directly comparing my code to 
        other packages -- as I didn't have time to properly 
        compare speed, features or correctness, and so 
        didn't want to publish inadequately-researched 
        comparisons.  Remember that everything I published 
        was achieved as an after-hours project, taking 
        from May 1995 to October 1997 to create.  

    5.  The key to the high speed operation was some 
        hand-crafted Intel assembly.  The code was of 
        limited interest to non-Intel platforms.  

    6.  Grep is a utility closely associated with the UNIX 
        community; publishing a DOS program misses much of 
        the likely audience.  

So, I decided in early 1998 to correct these deficiencies 
by porting the code to GCC running on GNU/Linux, and by 
trying to take on GNU Grep head-to-head.  Linux provided 
the Intel PC platform I had originally designed for, so I 
was confident that the threaded assembly trickery I had 
(re)invented would be applicable.  

When I first started playing with the original idea for 
these applications (the lodsb/xlat/jmp ax sequence), it was 
first tried in the context of writing a word count utility.  
So it was natural when contemplating the GCC/Linux version 
to port the word count utility first.  

(Incidentally, the word count utility was the driving force 
the first time round, in early May 1995: I tried more 
conventional coding techniques to implement a wc utility, 
but these didn't produce the speed I wanted.  I had 
previously dreamt up the "lodsb/xlat/jmp ax" state table 
idea and had noted it as an interesting idea, but had never 
tried to use it anywhere.  I threw it into the wc app only 
when other approaches proved to be too slow.)  

A key question was how to port the lodsb/xlat/jmp ax 
sequence.  Several issues came up:
    - Should I write a pure ANSI-compatible version, or 
        a GCC-extended version, or a GCC-extended version 
        plus assembly optimisation, or should I write 
        the match engine purely in assembly?  
    - Would GCC's flat memory model be unsuitable for the 
        page-aligned aspects of the assembly code?
    - Some feedback from the FORTH community suggested that 
        the 386 instruction sequence:
            movb (%ecx),%al
            incl %ecx
            jmp *(%ebx,%eax,4)
        would run faster because its load/store style 
        better exploits the newer processors' pipelined 
        architecture.  

In the end, I decided to base my code on a GCC-extended 
version, with GCC's computed goto extension being the 
nearest to a direct translation of the original assembly.  
Some of the limitations of the original assembly -- notably 
the restriction that all code entry points fit into a 
256-byte code page -- no longer applied in the GCC version, 
so the freedom to experiment in C was more important than 
squeezing the last iota of performance out of the 
assembly version.  

A highly condensed version of the C code for the inner loop 
of my word counting utility follows.  As noted in the DDJ 
article, we add the byte 0xee to the end of the buffer, so 
that we only test for buffer end when we see this byte -- 
which will hopefully be very rare for typical text files.  

entryEE_NOP:    if (pText == pEnd) goto Finished;
                /*FALLTHROUGH*/
entryNOP:       NextCh = *pText++;
                pAction = pTab[NextCh];
                goto *pAction;

entryEE_WORD:   if (pText == pEnd) goto Finished;
                /*FALLTHROUGH*/
entryWORD:	    Words++;
                pTab = WordTable;
                NextCh = *pText++;
                pAction = pTab[NextCh];
                goto *pAction;

entryWHITE:     pTab = WhiteTable;
                NextCh = *pText++;
                pAction = pTab[NextCh];
                goto *pAction;

entryLF_WHITE:  pTab = WhiteTable;
                /*FALLTHROUGH*/
entryLF_NOP:    Lines++;
                NextCh = *pText++;
                pAction = pTab[NextCh];
                goto *pAction;

Finished:       /*End of buffer found: remember in-word context and exit*/
                pCurrentTable = pTab;
                return;

The state tables need to be treated a little differently to 
the DOS version.  Each table contains absolute memory 
addresses instead of code page byte offsets, so the action 
type is a void * pointer.  Any code assuming that each 
entry was a byte (mainly fill operations like memset) 
has been changed to be more independent of the entry type.  
Mostly, this meant that memset calls were rewritten as 
loops.  

The actual entry addresses for each action (e.g. 
entryLF_WHITE) need to be advertised to the code 
constructing the state tables.  I implemented this by 
adding a pointer parameter to a structure which receives 
the entry points if it isn't NULL, so we can get the entry 
points during initialisation and then run the engine during 
processing:

        /*Does the client want to receive the action entry points?*/
        if (pActionCodes != NULL) {
                /*Yes, write them now and then return*/
                pActionCodes->pEE_NOP   = &&entryEE_NOP;
                pActionCodes->pNOP      = &&entryNOP;
                pActionCodes->pEE_WORD  = &&entryEE_WORD;
                pActionCodes->pWORD     = &&entryWORD;
                pActionCodes->pWHITE    = &&entryWHITE;
                pActionCodes->pLF_WHITE = &&entryLF_WHITE;
                pActionCodes->pLF_NOP   = &&entryLF_NOP;
                return;
    	}

This version compared extremely favourably to the GNU wc 
utility: it runs almost 30% faster.  While it was possible 
that some of this was due to different implementation 
choices such as file buffering size, a few experiments 
quickly demonstrated that almost all the performance gain 
was due to the speed of the table-driven threaded assembly 
of the search engine.  

Next, I took a look at the assembly generated by GCC, to 
see how it compared to the lodsb/xlat/jmp ax sequence I'd 
started from.  I found that while most of the generated 
code was optimal, GCC hadn't spotted one of the features 
of the original: it kept clearing EAH to zero, not 
realising that it could be cleared upon startup and assumed 
to be clear thereafter.  I edited the assembly intermediate 
version by hand to reinstate this optimisation, reducing 
the number of instructions per action from (on average) 5 
to 4. and the optimised program sped up by nearly 9% to be
nearly 35% faster than GNU wc.  

So this experiment, plus a quick look at the GNU grep 
sources, convinced me that it was worth while porting the 
grep program to GNU/Linux, and taking on the very 
formidable GNU grep head-to-head.  In particular, this 
meant that (a) bugs in Grouse Grep (ggrep), especially the 
mishandling of NULs, had to be fixed, (b) Many, if not all, 
cases where ggrep was slower than grep had to be addressed, 
and (c) Many of the additional features in grep needed to 
be added to ggrep.  In addition, I wanted to clean up 
various things that weren't well-written, but which were 
too risky to tackle as the DDJ publication date loomed.  

Here, then, is the results for a highly unscientific list 
of expressions that I used while developing the code.  
It compares the performance of grep (v 2.3) versus ggrep, 
when searching an 8Mb text file containing a catalog of 
files available for download from a BBS.  I don't pretend 
that this comparison is unbiased; it just happens to be 
the reference set I used.  I'll try to be more objective 
and rigorous elsewhere.  

  ./ggrep      ./ggrep-c    grep23       

  4.20(  95%)   4.22(  95%)   4.42   -c e
  4.52(  86%)   4.52(  86%)   5.21   -c ee
  3.20(  91%)   3.21(  91%)   3.49   -c eee
  2.79(  94%)   2.78(  93%)   2.96   -c eeee
  2.62( 100%)   2.63( 101%)   2.60   -c J
  3.55(  88%)   3.55(  88%)   3.99   -c JJ
  2.85(  91%)   2.86(  91%)   3.13   -c JJJ
  2.51(  93%)   2.51(  93%)   2.69   -c JJJJ
  4.17(  44%)   4.29(  45%)   9.33   -c a.e
  4.27(  43%)   4.41(  44%)   9.87   -c a..e
  4.29(  43%)   4.49(  45%)   9.81   -c a...e
  6.25(  58%)   6.76(  63%)  10.72   -c th\\?d
  3.17(  51%)   3.31(  53%)   6.14   -c x.*y.*z
  4.13(  65%)   4.15(  65%)   6.33   -c [a]
  4.98(  72%)   5.61(  81%)   6.87   -c [qj].*[qj].*[qj]
  2.33(  95%)   2.32(  94%)   2.45   -c Kermit
  3.43(  50%)   3.60(  52%)   6.85   -c [abc][def][ghi]
  1.71(  98%)   1.75( 101%)   1.73   -c thisisaverylongstring
  4.49(  38%)   4.62(  39%)  11.70   -c e$
  2.79(  46%)   2.81(  47%)   5.96   -c ^W
  4.45(  71%)   4.58(  73%)   6.23   -ci e
  4.98(  70%)   4.90(  69%)   7.10   -ci ee
  3.43(  73%)   3.40(  72%)   4.67   -ci eee
  2.96(  73%)   3.00(  74%)   4.02   -ci eeee
  4.22(  52%)   4.80(  59%)   8.01   -ci J
  3.61(  76%)   3.61(  76%)   4.71   -ci JJ
  2.90(  79%)   2.95(  80%)   3.66   -ci JJJ
  2.51(  74%)   2.52(  75%)   3.36   -ci JJJJ
  5.16(  39%)   5.86(  45%)  12.92   -ci a.e
  5.26(  43%)   5.91(  49%)  11.97   -ci a..e
  5.12(  42%)   5.52(  45%)  12.19   -ci a...e
  7.24(  55%)   8.06(  61%)  13.04   -ci th\\?d
  8.23(  69%)   9.65(  81%)  11.81   -ci x.*y.*z
  4.44(  80%)   4.83(  87%)   5.51   -ci [a]
  5.56(  75%)   6.43(  87%)   7.38   -ci [qj].*[qj].*[qj]
  2.54(  83%)   2.57(  84%)   3.03   -ci Kermit
  3.64(  53%)   3.80(  55%)   6.80   -ci [abc][def][ghi]
  1.75(  87%)   1.82(  91%)   1.99   -ci thisisaverylongstring
  4.57(  32%)   5.07(  35%)  14.16   -ci e$
  2.78(  21%)   2.82(  21%)  12.90   -ci ^W

(The numbers are times in seconds to perform the search 
10 times.  "grep23" is GNU grep version 2.3, "ggrep-c" 
is Grouse Grep with the match engine using merely 
GCC-generated code; "ggrep" is the version where the 
EAH-is-zero optimisation has been applied to GCC's output.  
Percentages compare the Grouse times to the GNU times; 
less than 100% is faster.  Timing is a little crude, and 
variations of up to 0.2 second or so aren't significant.  
The file in question was in RAM, so the times ignore disk 
I/O overhead and focus on the CPU required to perform the 
search.)  

As you can see, Grouse Grep is faster than GNU grep in 
almost every case above, sometimes over twice as fast.  

This message is already way too long, so very briefly, a 
few highlights of ggrep:

- I've written a Perl script to edit GCC's assembly output 
  and add the EAH optimisation.  This script (called 
  trimasm) assumes various things based on how matchgcc.c 
  is written.  The script is run as part of the program 
  compilation/linking commands in the Makefile.  

- ggrep includes a minor refinement to the Tuned Boyer-Moore 
  simple string search algorithm that I've invented, called 
  the Self-Tuning Boyer-Moore algorithm.  In addition, I've 
  found that GNU Grep bails out of using BM-style searches 
  for case-insensitive strings, despite the fact that this 
  is quite easy and efficient to implement.  This is why the 
  case-sensitive string searches (3 chars or longer) are 
  faster than grep.  

- Parts of the code has been rewritten to make the code 
  more flexible and reuseable.  Most modules (with one or 
  two glaring exceptions) are now completely reentrant; 
  previously, module-wide configuration flags limited the 
  module's use.  This effort included substantial work to 
  represent options within the compiled format.  I've also 
  tried to prepare for Unicode support, as the table-driven 
  nature of the search engine should make things like UTF-8 
  handling a breeze.  

- Buffered file handling has been rewritten to (mostly) 
  match the memory-mapped file buffering interface of GNU 
  grep, but has been split out into a reusable module 
  called FastFile.  

Paradoxically, the software in this release that I'm most 
excited about has relatively little to do with performance.  
It's a trace facility called Tracery.  This module captures 
so much of the features I'd always wanted in a trace 
facility.  Its real importance comes from the fact that it 
is primarily designed to improve the reusability of 
software modules: the designer and implementor can capture 
and share their effort in tracing and debugging the code, 
with little or no impact on the speed of the program.  Even 
if nothing else in my software interests you, I urge you to 
start using Tracery on your C and C++ code NOW (and port the 
module to other languages!).  

However, just when it looks like ggrep can lay claim to 
being the "fastest grep in the West", some objections come 
up:

- It doesn't support all the options that grep does, either 
  on the command line or in the code; 

- It doesn't run on multiple machines; 

- It is limited to GCC, being based on non-ANSI C code; 

- It doesn't support extended regular expression syntax 
  (grouping, alternation, backreferences, {n,m} iteration); 

- It has a user base of only 1; grep has millions of 
  users.  The bug level is substantially higher.  

- It uses a NFA architecture rather than a DFA.  

The last point is very important, and deserves expansion.  
Compare the performance of the following search:

277.20(3904%) 336.10(4734%)   7.10   -c \w.*[a-j].*[A-Z].*\w

As soon as the amount of backtracking involved in 
completing the search becomes significant, the NFA engine 
becomes hopelessly inefficient.  The DFA engine throws more 
memory at the search (and can run out of memory more easily 
in a few cases), but the result is much more reliable and 
predictable execution times.  

The reason the earlier comparisons did not show this defect 
is that ggrep is aware of its deficiencies, and tries to 
minimise the impact.  For example, when searching for 
"x.*y.*z", ggrep first looks for the z, and only attempts 
a full match on lines where a z is found.  Even better, 
the full match uses the CPU byte search to look for the x.  
This approach is very much hit-and-miss: it means the 
code makes guesses about the input data, and can be 
severely punished if its guess is wrong, or if its 
best guess simply isn't good enough.  

Some comments from Henry Spencer given by private email at 
the time of the DDJ publication are relevant:

      The one real flaw I see in your code is that it uses 
      the backtracking approach, when there are superior 
      methods known -- at the cost of some extra setup time, 
      one can convert the regular expression into a form
      that permits matching with no backtracking, and this 
      is almost always faster in the match phase.

      [...] You do a very good job of keeping the 
      backtracking method's flaws under control.  Some of it 
      wouldn't generalize as well as one might like [...].  

      [...] The MatchDual business, with the RE split into 
      two pieces matched separately, is cute, although it 
      wouldn't generalize to RE syntaxes in which you can 
      apply operators to parenthesized subexpressions 
      (which means that concatenation isn't necessarily the 
      top-level operator and you can't necessarily split 
      the RE into two concatenated parts).  Some of the 
      other optimizations had similarly limited 
      applicability.

      [...] Assigning scores to strings to decide which one 
      is less likely doesn't work well if the application 
      is doing something non-standard -- foreign-language 
      text or non-text data.  (Searching DNA sequences is 
      the classic weird case; in fact, there are special 
      algorithms just for that problem.)  I don't know 
      whether it matters enough to bother having a way to 
      suppress this.

[MatchDual has been absorbed into ScanFile_SearchBuffer 
in this release.  The functionality is mostly the same; the 
new implementation is a little more straightforward.]  

So where to from here?  There's lots of issues, but here we 
go (in no particular order):

- GNU Grep is still the "Fastest Grep in the West"... but 
  it's position is under serious threat, and the next 
  generation heralded by Grouse Grep is poised to 
  overrun it.  The only objection to Grouse Grep claiming 
  the crown for simple string searches (particularly where 
  case insensitivity is involved) is that Grouse cannot 
  compete on bug level, features or portability... but 
  it's only a matter of time before these appear.  More 
  widely, GNU Grep deserves the title because its DFA 
  engine is far more predictable... but it's only a matter 
  of time (perhaps 6-12 months) before the Grouse FSA can 
  be used to create a DFA engine that eclipses this as well.  

- ggrep is CERTAINLY buggy, much as I would wish it 
  otherwise.  Use it at your own risk.  E-mail me at 
  mailto:behoffski@grouse.com.au with bug reports (and 
  preferably patches), and I'll coordinate further 
  releases.  

- Using computed gotos in an undisciplined fashion is 
  frowned on, and I agree with that opinion.  However, 
  the discipline imposed by the state tables is very 
  substantial, and raises this technique to one that 
  should become mainstream in programming languages.  
  Table-driven finite-state machines have been around 
  for decades, anyway.  For example, the VAX has a 
  dedicated instruction for byte-indexed table-driven 
  jumps: CASEB.  

- The assembly optimisation implemented by my Perl script 
  could be incorporated into GCC's optimisation phases.  
  This would improve the portability of the C code, and 
  would make the optimisation more reliable.  

- Some people have advised me to add grep's DFA engine 
  to my code, so that (a) ggrep supports the full RE 
  syntax and (b) some of the performance problems can 
  be sidestepped.  I've resisted this, mainly because I 
  want the focus to remain on the table-driven 
  finite-state architecture I've presented here.  In 
  particular, a DFA enthusiast could compose a set of 
  DFA-oriented actions that can be added to the existing 
  framework.  In the long run, this will provide 
  substantially better performance than the existing DFAs 
  can produce.  I ran out of time to try this myself, sigh.  

- It's possible a handcrafted-assembly match engine would 
  outperform the C-generated one.  It's not an issue now, 
  as the flexibility of the C version is much more 
  valuable, but may be worth investigating once the dust 
  has settled a bit.  

- The remaining features of grep (e.g. leading and trailing 
  context) should be added to ggrep.  

- Virtually every module has been designed and implemented 
  to be a reusable library.  While the original 
  license/warranty statements on the 1997 sources published 
  along with the DDJ article are not explicit, they map most 
  closely to the GNU Lesser General Public Library License.  
  Therefore, each and all source files included in ggrep are 
  published under this licence.  However, while the intent 
  has been to reuse, little of this code has in fact been 
  applied in other projects, so your mileage may vary.  

- The table-driven nature of the search engine should lend 
  itself to superb support for foreign-language processing, 
  including Unicode.  There's no support at present, 
  although some bits have been designed with Unicode in 
  mind.  For example, the compact RE specification provides 
  16 bits to store literals (glyphs).  On a much more 
  mundane level, I've tried to match grep's support of 
  locale in the implementation of things like case folding 
  and character classifications (untested, sorry).  

- Byte-oriented interpreters such as Perl and Java might 
  benefit from this technique (although it's quite 
  possible that they implement it already).  

- There are many, many other areas where this architecture 
  could provide dramatic performance improvements.  For 
  example, notice the 14-cycle Huffman decoder published 
  along with the DDJ article.  

- If you can frame an application so that (a) you can 
  conveniently work using byte-sized pieces of information, 
  (b) you can compose a set of actions that describe how to 
  act on those bytes and (c) those actions can be organised 
  into a smallish set of state tables, then this 
  architecture will reward you with stunning performance.  
  1k bytes per state table gets costly if you have a 
  lot of tables, however.  (The original lodsb/xlat/jmp ax 
  code consumed only 256 bytes per table, but there were 
  substantial restrictions on the actions as a result.)  

- One possible application is analysing chess positions: 
  recognising and acting on the contents of a square can be 
  done very, very cheaply.  The tables themselves could be 
  created from text specification files, making 
  experimentation easier.  Also, there might be ways of 
  speeding up move generation.  Who knows?  

At this point, I must pay tribute to the authors of GNU 
grep, and the other Open Source supporters: it would have 
been utterly impossible for me to write ggrep without the 
presence of superb tools like GCC, Perl, XFree86 and 
GNU/Linux.  Also, my grep code would have been much 
poorer if I hadn't had the luxury of reading the sources 
of the existing implementations.  I "doffs my titfer" in 
respect to you all, and pray that my contribution will in 
some way add to the resources available for all.  These 
improvements would be virtually impossible to achieve in 
anything other than an Open Source model.  

And now, an advert: I have other projects I would like to 
work on, that centre on trying to substantially improve 
code reuse -- things like Tracery just scratch the 
surface.  I'm trying to fund time to work on these projects 
though my contracting business (Grouse Software).  If you 
like my ideas and inventions, or maybe my emphasis on reuse, 
or even the automated test rig included in this release 
(over 2000 tests!), and you would like me to work on your 
projects for a while, contact Grouse!  

I've created a web describing Grouse Grep in more detail, 
including links to the FTP archive where you can download 
the source.  The web is available via Grouse's home page:

    http://www.grouse.com.au/

Finally, a small exercise in reuse, although we are moving 
from arguably the sublime to almost certainly the 
ridiculous: Recall that we created a reusable module in 
ggrep called FastFile to present the file using mmap.  
The word count utility would benefit from this module, and 
FastFile can be configured very easily to provide what wc 
requires...

Here are the results when searching for an 8Mb text file 
which has been paged into RAM (disk I/O overheads would 
add a constant time to each program if the file was on disk 
instead of in RAM, but that's not what's being measured).  
Speeds are given as a ratio comparing the time to GNU wc's 
time:

Time    Speedup    Program

20.8    1.0        GNU wc
15.1    1.3        Grouse FSA, Read interface, no EAH opt
14.0    1.4        Grouse FSA, Read interface, EAH opt
 9.8    2.1        Grouse FSA, FastFile i/f, no EAH opt
 8.7    2.3        Grouse FSA, FastFile i/f, EAH opt

cheers,

behoffski