String Searching

Searching for a simple string in a text file is one of the classic problems in the history of computing. The major algorithmic breakthrough was the Boyer-Moore (BM) algorithm published in 1977, and the most notable update to this algorithm has been the publication of the Tuned Boyer-Moore (TBM) algorithm in 1991 by Hume and Sunday. GNU Grep implements a variation of the Tuned Boyer-Moore search.

A quick tutorial

To start, let's briefly summarise the simple, BM and TBM searches, using as an example a search for "recent". Each search has a scan loop, where we look for a place where the string might be found, and a match loop, where we look more closely to see if we have a complete match once the scan has picked up a potential match case. Other search algorithms do not have such a clear division (e.g. Knuth-Morris-Pratt), but they're slower than the best algorithms given here, so we're skipping them in the interests of being brief.

The basic idea to speed up the search is to make the scan loop as fast as possible (both by keeping it extremely simple and streamlined, and by being smart in how we use the scan information), and to try to keep away from using the match loop. However, each line of code we add to try to make the search smarter will also make it slower, so we need to carefully balance where and how we use search information.

· Simple: We scan each byte of the buffer, looking for the first character of the string (r). Whenever we find an "a", we try to match the remainder of the string. At best, we can run in linear time (one character test per buffer byte); in some pathological cases we can test each buffer byte many, many times (e.g. searching for "xxxxxxxxxxxxxxy" in a buffer full of "x"s.)

Here is a diagram that shows the tests performed to complete the match. We also use this example below to demonstrate other algorithms, so the buffer text has been chosen to highlight algorithm differences rather than be truly representative of normal text:

simple.jpg

· Boyer-Moore: We scan first for the last character of the string (e.g. "t"). However, instead of merely trying to match the character in the buffer, we look at the character we fetched from the buffer: if it's not in the string at all, we can instantly skip forward by the length of the string! It's quite simple and fast to build the skip table for this search, and the time spent is more than paid back because the scan loop is almost always sublinear: we typically scan only 1 in every n-1 characters for an n-length string. In the example, if the buffer character is "x", we can immediately skip forward 6 characters. If we find a character from earlier in the string, we can still skip forward some places, e.g. the skip for "r" is 5, "c" is 3, "e" is 2 and "n" is 1. We set the skip for "t" to 0 since this is the target of the scan loop, and only enter the match loop when we find this character. The scan loop is often called the skip loop.

· Tuned Boyer-Moore: While the BM scan loop is amazingly fast, the match loop is still very slow. An additional guard check added to the scan loop helps filter out non-matching positions without incurring the overhead needed to set up the full match loop. The result is that we spend more time in the scan loop and less time in the match loop, and enjoy an increase in speed. There is some black magic that can go into the selection of the guard character, but the choice typically depends on knowing the characteristics of the data be searched, which in the case of an all-round search tool such as grep is too hard to implement. GNU Grep chooses the second-to-last character of the string as the guard ("n" in the example).

tunedbm.jpg

(The example pictured above isn't at all kind or fair to TBM, where the guard is usually much less likely to match. We use it merely to highlight differences between TBM and STBM below, rather than be truly representative of performance.)

TBM also optimises the scan loop by unrolling the inner loop, since typically there are many skips made for each time the scan loop locates the last character of the string. Since the target character has a skip value of 0, the unrolled code will stay on the matching code until we check to see if we've found it. We can't take the unrolling too far, however, as otherwise all the skip operations which don't advance the search turn into an overhead of their own.

Many, many thanks must go to Hume and Sunday for their 1991 paper, "Fast String Searching", especially for providing such a clear and concise taxonomy for describing BM-style algorithms.

However, I've discovered a few improvements to the Tuned Boyer-Moore search implementation in GNU Grep. Many (but not all) of these improvements come because my search engine configures the end of the search buffer to match the last character of the string, and so end-of-buffer checks can be eliminated from the main scan loop. In effect, I've merely chosen different engineering trade-offs: I've optimised more aggressively for speed, at the expense of providing code that can be reused for searching less-well-conditioned buffers.

Case-insensitive searches

A huge surprise is that Grouse Grep performs 10-25% faster than GNU Grep when performing case-insensitive searches on simple strings. As with the buffer conditioning case above, this came about mainly from a engineering trade-off: I found that GNU Grep bailed out of using a TBM search if case insensitivity was present, and instead used a more general multi-word matcher (but unfortunately which runs substantially slower than TBM).

But you can use a Boyer-Moore style search for case-insensitive searches: I'd already published this in the original DOS version of Grouse Grep, and some tests showed that the GCC/Linux version of the Grouse FSA was already 10-15% faster than GNU Grep. Producing a TBM-style search was preferable to using the Grouse FSA as (a) the TBM code is much more compact (so it's easier to reuse), (b) TBM has substantially lower search setup costs, and (c) TBM was likely to run faster as the inner loop of the TBM engine requires approximately 3 instructions per skip, whereas the Grouse FSA typically takes 4.

The key to the case-insensitive TBM search is the skip table: It's trivial to modify this table during the setup so that both uppercase and lowercase versions of letters receive the same skip value. Once this modification has been made, the unrolled scan loop will execute correctly and be nearly as efficient as when performing a case-sensitive search. However, when testing the guard character, and when matching the remainder of the string, the buffer text must be translated to fold uppercase letters into lower case, as this is the cheapest way to achieve case-insensitive matching. While this folding table lookup is very cheap to add to the code, it still slows down the TBM code, so the STBM module publishes both a case-sensitive and a case-insensitive version of the search function, so the client may obtain the best performance.

Self-Tuning Boyer-Moore

Another surprise is that I've discovered a minor improvement to the Tuned Boyer-Moore algorithm, which I've called the Self-Tuning Boyer-Moore algorithm. This improvement is notable in that it can avoid most pathological cases that can slow down TBM, yet its normal run time is usually identical or even fractionally faster. I made this discovery on 8 April 1998.

The improvement revolves around the guard character. Recall that TBM has the BM scan loop and match loop, but adds a guard character to reduce the number of times the slower match loop is attempted. Recall also that there's freedom to choose any character of string as the guard, but that general-purpose searches such as GNU Grep merely choose the second-to-last character as the guard as it's simple and effective.

There is some benefit in choosing a different guard in some circumstances, although some of the benefit is lost as the match loop becomes more costly where the guard is somewhere in the middle of the string -- we have to either test the guard twice if the match uses a single loop, or set up two loops. This, then is one reason why GNU Grep chooses the second-to-last character: It allows the code to operate using just a single match loop.

One suggestion from Hume and Sunday is that static character frequencies be used to "tune" the guard position, so that the character least likely to appear in the file is chosen as the guard -- the less the guard character matches, the less the code enters the slow match loop. The success of this technique depends heavily on knowing the characteristics of the text being searched. Again, in a general-purpose grep tool, no "best" character frequency distribution can be assumed, and in any case, the analysis to find or guess the correct frequencies may substantially slow down the program.

It turns out that the appropriate character frequencies to use when choosing the character least likely to match as the guard are more complex than a simple static distribution, because the guard test is performed when the scan loop has matched the last character of the string. Therefore, the true probability distribution to use depends on the last character, and on the offset from that position to the guard position. For example, when searching /usr/dict/words for "tree", the best guard character to choose -- given that we've just matched the final "e" -- is in fact the preceding "e", despite "e" being the most common letter in the file.

The true conditional probability seems far too complex to implement... except that I've found you can obtain it virtually for free! This is because there is a piece of information not exploited by TBM that can help tune the guard: The last mismatch position found by the match loop. It costs almost nothing to acquire this information, as it's a by-product of a loop that must be executed whenever a match is attempted. All the activity that occurs with both the guard and the match loop happen after the last character has matched the scan loop, so any information derived from the match is automatically conditional on the last character matching. The effect on the scan loop is very small, although this may depend to some extent on whether the guard offset can be held in a register.

The result is a guard test that tunes itself to the data being searched. This is because a poor guard will very quickly pass control to the match loop, which will result in a different guard being selected. However, once a good guard character is found, the fast scan loop may scan large portions of the buffer before finding a potential match candidate.

We have one more addition to the TBM search to bring STBM to life: We need to give each character of the string an equal chance of becoming the guard. Otherwise, a series of relatively poor guard characters near the end of the string may alternate as the guard, as starting the match loop from the same place each time gives some characters an unwanted priority over others. In addition, we would prefer not retesting the guard character in the match loop, as this also slows us down. So we use a round-robin scheme to organise how we search the match: we perform the match in two loops, first from the guard down to the start of the string, and then from the end of the string back to the guard.

Implementing the match with the round-robin discipline imposes a cost on the search: the match loop must set up two loops instead of one. Some experiments suggest that STBM is not worth using for short strings (length 4 or less) -- TBM's simplicity wins out. But in most cases, the extra efficiency granted to the guard means that this cost is cancelled out -- we can afford a slightly slower match loop if we use it less often.

selftbm.jpg

Finally, the self-tuning nature of STBM can pay another dividend -- we can eliminate some of the (fairly rare) pathological cases that slow down TBM. For example, searching for "yxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" in a file full of "x"es is extremely slow -- unless you choose the "y" as the guard. In this case, Grouse Grep, using STBM, is over 10 times faster than GNU Grep. STBM has far fewer pathological cases, thanks to its dynamic self-tuning nature. This is perhaps the strongest argument for using STBM in all cases (and this is in fact how Grouse Grep is implemented).

Buffer Configuration

I've taken the liberty of assuming that the search may address memory some distance past the end of the buffer without incurring any penalty. This would be a problem if the buffer happened to end upon some critical boundary defined by the memory map, but in Grouse Grep this problem is explicitly removed as FastFile provides a memory-mapped file buffer with guaranteed data appended to the end of the buffer. Again, this is a trade-off I've chosen as part of writing a search explicitly for GNU/Linux (and for other modern systems with a large and linear memory map).

The buffer end check has been eliminated altogether from the scan loop of the STBM and TBM algorithms, by appending characters to the end of the buffer. This is because we check for buffer end each time the last character of the string matches before testing the guard, and so by adding n end characters to the buffer for an n-length string, we can guarantee that this check will be executed and searching terminated as soon as the end of the buffer is reached. This is a worthwhile speedup, but needs to be amended (by adding an end check into the inner scan loop) if the code is reused in situations where the application does not have the luxury of such control over memory.

Code References

The code for the Self-Tuning Boyer Moore algorithm has been separated into a separate reusable module, stbm.c. Note that the code is optimised for searching preconfigured buffers: there is no buffer end check in the scan loop. This check needs to be added in most cases where STBM is reused, as the luxury of configuring the buffer end is fairly rare. This module also includes a Tuned Boyer-Moore search routine, so that the different algorithms can be compared in an identical environment. A shim (STBMShim ) is used to adapt the general-purpose STBM interface to the format required by GGrep.

Several optimisation switches are provided to select the string searching algorithm:

-OB Use STBM or TBM if appropriate (default)

-Ob Use Grouse FSA (table-driven finite-state machine)

-OT Use STBM (default)

-Ot Use TBM

The Grouse FSA (MatchEng/RETable /MatchGCC/TblDisp ) implements a wider Boyer-Moore-style search that features the ability to use BM methods even where classes are present (for example, Grouse Grep is over twice as fast as GNU Grep when searching for "[abc][def][ghi][jkl] " in a large text file). Preparation for this search is mostly contained in RETable_ScanByBoyerMoore , and uses SKIPn actions defined in MatchEng and implemented in MatchGCC_Match .