FastFile

The search engine in modern grep programs (including GNU Grep and Grouse Grep) is a very, very fast and efficient program. These search engines can often analyse the file as quickly as the operating system can present the data for analysis. The normal file access primitives provided by the standard I/O library, while extremely concise, efficient and effective for less demanding programs, become a significant bottleneck when used with these ultra-efficient search engines.

These search engines get much of their speed by ignoring most of the facilities offered by the standard I/O library for dealing with the file in a civilised fashion -- these facilities are valuable but they slow down the code. This is because one of the basic goals of a grep program is to find lines that match the specified pattern, while spending as little effort as possible on lines that don't match.

The practical upshot of these performance goals is that the grep program loads the raw file data into a fairly large memory buffer, and scans the memory for the pattern. The scan includes dealing with line separators that appear in the data. To some extent, this makes the program less portable, as different operating systems may choose different ways of describing how the file is divided into lines. For example, three major line delimiter styles are Unix (an LF character at the end of each line), CP/M/MS-DOS/Windows ( CR/LF characters at the end of each line) and VMS (line structure encoded internally by RMS). Both GNU Grep and Grouse Grep are based on the LF line termination style, and both include some support for the CR /LF style. (Note that Grouse Grep's search engine is more sophisticated than GNU's in dealing with CF/ LF termination: It is able to handle mixtures of LF and CR/LF line termination seamlessly on-the-fly, even where both styles are mingled in the one file, and it is usually faster, as it doesn't need to execute an expensive file preprocessing loop to do so.)

GNU/Linux and other Unix systems provide a very low-cost interface for presenting the raw file data as a memory buffer: mmap(2) . This interface lets grep obtain the file data in a memory buffer with very little overhead.

Only problem is, mmap doesn't quite do what grep requires. Mmap presents the file data in a buffer whose size is based on the page size of the operating system; the grep search engines are based on matching lines, and require that the boundaries of the memory buffer match line boundaries (i.e. the start of the buffer is also the start of a line, and the end of the buffer is also the end of a line). The code in GNU Grep does an excellent job of resolving this conflict, which is not trivial, as setting up the memory buffers correctly requires care and attention. However, this code is mixed in with the other grep code (it's in the middle of grep.c ), and interacts with the search engine in one or two spots, and so is hard to reuse.

So, finally, we come to the reason for the existence of FastFile: It seeks to take the code in GNU Grep that maps the file into a buffer, and presents it as a simple, efficient, reusable module. It also seeks to hide mmap : Other operating systems may have similar facilities to present raw file data in a memory-mapped buffer, but may use a different interface. (When behoffski was writing the original MS-DOS version of Grouse Grep, he though about writing a double-buffered file memory mapping scheme so that the file reads could happen concurrently with the search (e.g. by using DMA) -- but this was far too complicated and almost certainly not portable.)

As with other parts of Grouse Grep, the module has only been used in restricted circumstances, so while the goal is to provide a highly reusable facility, the current implementation may fall short of this in both features and bug level. I beg your forgiveness of any difficulties you encounter, and ask that you help debug and improve the code so that it can live up to its promise.

The process of lopping off the partial last line of each buffer and prepending it to the start of the next buffer has been named feedover . This is one of many aspects of FastFile that can be traced using Tracery (providing that Tracery support has been compiled in).

FastFile does more than merely present the raw file data in line-oriented buffers: The memory buffer is extended to include leading and trailing memory blocks as specified by the client. The additional blocks let the application simplify and streamline its buffer handling code, which in turn leads to higher performance. For example, a LF is prepended to the start of the buffer: This means that anyone can ignore buffer issues when scanning for the start of a line. Another example is Boyer-Moore style string searches such as the search for " Unix" which looks first for the "x" at the end of the string: Instead of adding costly buffer-end checks into the innermost search loop, we can instead append "xxxx " to the buffer, and test for buffer end only where the " x" has been found.

Public routines:

Init -- Prepare module for operation

Start -- Begin managing what has to be managed

ReadFunction -- Receive function to read bytes from a file

Open -- Provide optimised access using specified file

Reopen -- Close existing file and open new file for access

StartCondition -- Specify how to set up start of buffer

EndCondition -- Specify how to set up end of buffer

Read -- Fetch next buffer of file (if any)

Stats -- Report file stats (from fstat(2))

TraceryLink -- Tell Tracery how to deal with us

Footnote: This section was written when GNU Grep 2.3 was current. Version 2.4 was released in December 1999... and it dispenses with mmap , and uses read instead! Even so, the value added by FastFile means that it is a worthwhile module regardless of the underlying implementation. Since Grouse Grep is currently restricted to being a GNU/Linux-based program, which supports mmap(), I haven't modified the code to operate in a comparable fashion to the latest version. Some quick tests suggest 2.3 runs much faster than 2.4 under Linux, so I've used 2.3 as the benchmark.