Regular Expressions As Implemented By BARF (The Awesome Kind)

I would recommend reading Generic Regular Expressions (The Boring Kind) for background on regular expressions before proceeding.

Necessarily Escaped Characters

In order to represent non-printable characters escape codes must be used. The printable characters (see the manpage for wctype) are defined by the current locale (BARF uses specifically ASCII). For ASCII, the printable characters are characters in the range from space (ASCII value 32), to ~ (tilde; ASCII value 126). Some terminals are able to print characters in the extended ASCII character set (values between 128 and 255), but for the sake of portability, these "extended" characters will be considered unprintable (and thus require an escape code).

An escaped character consists of a backslash followed by a single character (or formatted sequence in the case of hexadecimal escape characters). For example, a newline is represented as \n while \xF3 represents the ASCII character with numeric value 0xF3. The necessarily escaped characters are:

It is an error to have a single backslash at the end of a regular expression.

Atom-Context Normal And Special Characters

In the context of atoms (i.e. in the non-bracket-expression body of a regex), the following characters have special meaning, and must be escaped to be used literally.

All other printable characters (see Necessarily Escaped Characters) have no special meaning, and can be used directly, each accepting itself literally. Any normal character can be escaped, and unless it is one of those listed in Conditionals In BARF Regular Expressions, it will remain unchanged. Non-printable characters will be ignored. For example, a literal newline character within a regular expression will have no effect; it will be as if the newline didn't exist.

Some implementations of regexes have caveats about when certain special characters can be used as normal characters without escaping (such as allowing ) as a normal character in the atom-context of a POSIX regex). This is entirely avoided in BARF for purposes of simplicity and consistency. The rule is that any special character in the applicable context must be escaped to use literally. If this is ever not the case, it is a bug in BARF.

Bracket-Expression-Context Normal And Special Characters

In the context of bracket expressions, the following characters have special meaning, and must be escaped to be used literally.

Just like in the atom context, all other printable characters (see Necessarily Escaped Characters) have no special meaning, and can be used directly, each accepting itself literally. In the context of bracket expressions, there are no special escaped characters such as the conditionals described in Conditionals In BARF Regular Expressions. Escaping any character in a bracket expression will cause it to accept itself literally. The necessarily escaped characters such as hexadecimal escape characters, \t (tab), \n (newline), etc, accept themselves as would be expected. Non-printable characters will be ignored.

Some implementations of regexes have caveats about when certain special characters can be used as normal characters without escaping (such as allowing ] if it is the first character, possibly following a ^, as a normal character in the bracket-expression-context of a POSIX regex). This is entirely avoided in BARF for purposes of simplicity and consistency. The rule is that any special character in the applicable context must be escaped to use literally. If this is ever not the case, it is a bug in BARF.

Conditionals In BARF Regular Expressions

In addition to the ^ and $ (beginning and end of line) generic regex conditionals, BARF provides several others in the form of escaped characters. They are the following.

Example Regular Expressions

Here are some examples illustrating the usage of the forms described above, as they may be unfamiliar to someone used to a different implementation of regexes (e.g. grep's POSIX regexes).

Finite Automaton Generation

This section describes how BARF converts a regular expression into the finite automaton which accepts it. The final result is a DFA (which is the easiest FA to implement and the fastest to run). This happens in the following stages.

All the regular expression facilities of BARF are contained within the Barf::Regex namespace, all the files of which are located in the lib/regex directory.

Parsing Regular Expressions

BARF performs parsing of regular expressions using an LALR(1) grammar parser class (Barf::Regex::Parser) generated by trison. The grammar source file is barf_regex_parser.trison and the AST classes it uses are defined in the files barf_regex_ast.hpp and barf_regex_ast.cpp . Here is the AST resulting from parsing the regex ^/{2}.*|\n|"[^"]*" -- that is, one which accepts C++ style comments which start at the beginning of a line, newlines, or simple C-style string literals without any escaped characters.

inline_dotgraph_9

If an error is encountered, an exception is thrown -- this is done because the regex facilities in BARF are used as utility functions from within other applications, and to avoid uncontrolled printing of error messages, errors are indicated via exceptions.

NFA Generation

Once a regular expression has been successfully parsed and exists in AST form, an NFA can be generated by walking the AST recursively, generating sub-NFAs for each sub-regex and constructing certain forms of NFA for each construct. The generic Barf::Graph class and its subordinates (which are defined in barf_graph.hpp and barf_graph.cpp) are used to represent the NFA (as well as all FAs in BARF). The start and accept states are labeled, and each node is numbered, for reference later in the corresponding DFA.

inline_dotgraph_10

Again, since the regex facilities in BARF are used as utility functions from within other applications, and each application requires a certain amount of customizability in the use of the generated NFAs, the NFA-generating code does not create the NFA's start or accept states -- the client application must provide these. The client application must also take care to keep track of the provided start and accept states, as they are effectively the only point of entry/exit for the NFA.

DFA Generation

Once an NFA has been generated, the modified subset construction algorithm (TODO: make into ref) is used to generate an equivalent DFA. Each DFA state is actually a set of NFA states (hence the name "subset construction algorithm"). The start and accept states are labeled and each node is numbered, suffixed by ": DFA". The corresponding set of NFA states is also indicated, suffixed by ": NFA".

inline_dotgraph_11

Hosted by SourceForge.net Logo -- Generated on Mon Jan 7 22:58:02 2008 for BARF by doxygen 1.5.1