RETable takes a RE specification from RegExp and builds a state machine using the machine specification in MatchEng . This machine is used by MatchGCC to execute the search.
Building the tables is implemented in three phases:
· Each RE component is expanded into its own state table, working from first to last component,
· The tables are scanned from last to first, looking for optimisations between pairs of tables, and
· A match start state is added to the front, to scan for potential match positions within the buffer.
In addition to the tables themselves, RETable also writes a short summary of the contents of each table in a "flags" list. This list is used to direct the optimisation of the tables, since the nature of each table becomes difficult or impossible to infer after expansion. These flags contain some similarities to the types and modifier fields used by RegExp, but capture the information in a format tailored for RETable's own needs.
Expanding the state tables for literals, classes and all-but classes is straightforward: We initialise each table to NO_MATCH , then change the action for each matching character to ADVANCE (including catering for case insensitivity if selected). Finally, we apply any iteration specified to the table, including +, * and ? modifiers. For example, if a table has "* " iteration, we change each ADVANCE entry to AGAIN_PUSH_ADVANCE , and change NO_MATCH entries to BACK_AND_ADVANCE .
The optimisation phase analyses the tables right-to-left. This is done so that optimisations can ripple (cascade) through multiple tables in some cases. The classic optimisation case is iteration, such as searching for " x.*y": During the " .*" state, we can eliminate pushing a backtrack reference to the next state if we can see that it won't match: in the case of "x.*y ", this means that only the entry for "y " contains AGAIN_PUSH_ADVANCE ; all other instances of this action are optimised to become AGAIN since the next table has NO_MATCH in the corresponding entry. This represents an enormous saving in backtracking effort, and lifts the performance of the NFA to near or even past the DFA engine in some circumstances. Sadly, while we can often avoid the NFA's weaknesses, we can't always, and then the performance of the engine degrades exponentially.
The match start state scans the buffer looking for text that matches the leftmost components of the RE. The scan is optimised to use Boyer-Moore string searching or CPU byte scanning if appropriate. If line numbering is requested, this function can be executed in parallel to byte searching for a very low cost. The Boyer-Moore code is also interesting as it extends the concept to include classes, so strings of classes such as "[abc][def][ghi][jkl] " run extremely quickly (Grouse Grep was nearly 2.4 times faster than GNU grep in one test of counting lines with this pattern in a large text file).
Public routines:
Init -- Prepare module for operation Start -- Begin managing what has to be managed AllocBacktrack -- (Re-)Allocate space for backtracking Expand -- Convert "compact" RE codes to optimised table form
Private routines:
SetState -- Write actions and flags for state Create -- Acquire resources and initialise memory WordMetaState -- Set up entries to test word characteristics AddMatchChars -- Write action for chars that match RE code ApplyIteration -- Apply RE iteration modifier to tables AnchoredToEnd -- Set up tables where RE is merely end anchor ScanForLineStart -- Search for LF for anchored search AnalyseStart -- Look for easy(ish) components at start of RE ScanByBoyerMoore -- Set up start state to use BM ScanByByteSearch -- Set up start state to use CPU instruction ScanSequentially -- Set up start state to scan buffer PrepareStart -- Set up state to scan buffer for match IterationOpt -- Use following table to optimise backtracking ZeroTripOpt -- Use following table to optimise backtracking MetaMatchOpt -- Optimise zero-with (meta) tables TrimWordAdvance -- Trim table since next comes a word test ZeroTripOptEnd -- Eliminate redundancy where z-trip at end OptimiseTables -- Optimise by comparing table activities WordFinalState -- Prepare success state for word match CopyLFtoCR -- Make CR serve as a line terminator