Regular Expressions As Implemented By BARF (The Awesome Kind)

I would recommend reading Generic Regular Expressions (The Boring Kind) for background on regular expressions before proceeding.

Necessarily Escaped Characters

In order to represent non-printable characters escape codes must be used. The printable characters (see the manpage for wctype) are defined by the current locale (BARF uses specifically ASCII). For ASCII, the printable characters are characters in the range from space (ASCII value 32), to ~ (tilde; ASCII value 126). Some terminals are able to print characters in the extended ASCII character set (values between 128 and 255), but for the sake of portability, these "extended" characters will be considered unprintable (and thus require an escape code).

An escaped character consists of a backslash followed by a single character (or formatted sequence in the case of hexadecimal escape characters). For example, a newline is represented as \n while \xF3 represents the ASCII character with numeric value 0xF3. The necessarily escaped characters are:

\a (bell character)
\b (backspace)
\t (tab)
\n (newline)
\v (vertical tab)
\f (form feed)
\r (carriage return)
\x# (hexadecimal character literal) -- # represents a string with length of at least one, composed of the hexadecimal digits [0-9a-fA-F] indicating the value of an unsigned hexadecimal integer. If \x is not followed by any hexadecimal characters, it is an error. A hexadecimal character value can technically be any value, but values above \xFF will be truncated. It is preferred to use two hexadecimal characters for the value string, indicative of the value space of the 8-bit value commonly used for a character. Any value, including printable characters which otherwise don't need any escaping, may be represented in this format.

It is an error to have a single backslash at the end of a regular expression.

Atom-Context Normal And Special Characters

In the context of atoms (i.e. in the non-bracket-expression body of a regex), the following characters have special meaning, and must be escaped to be used literally.

( and ) (parentheses) -- grouping delimiters
{ and } (curly braces) -- bound delimiters
[ and ] (square brackets) -- bracket expression delimiters
| (pipe) -- alternation operator
? (question mark) -- indicates 0 or 1 of the previous atom
* (asterisk) -- indicates 0 or more of the previous atom
+ (plus) -- indicates 1 or more of the previous atom
. (period) -- matches any character except newline
^ (carat) -- matches the empty string at the beginning of a line
$ (dollar sign) -- matches the empty string at the end of a line
\ (backslash) -- used for escaping characters

All other printable characters (see Necessarily Escaped Characters) have no special meaning, and can be used directly, each accepting itself literally. Any normal character can be escaped, and unless it is one of those listed in Conditionals In BARF Regular Expressions, it will remain unchanged. Non-printable characters will be ignored. For example, a literal newline character within a regular expression will have no effect; it will be as if the newline didn't exist.

Some implementations of regexes have caveats about when certain special characters can be used as normal characters without escaping (such as allowing ) as a normal character in the atom-context of a POSIX regex). This is entirely avoided in BARF for purposes of simplicity and consistency. The rule is that any special character in the applicable context must be escaped to use literally. If this is ever not the case, it is a bug in BARF.

Bracket-Expression-Context Normal And Special Characters

In the context of bracket expressions, the following characters have special meaning, and must be escaped to be used literally.

[ and ] (square brackets) -- character class (and bracket expression) delimiters
- (hyphen) -- character range operator
^ (carat) -- bracket expression negation operator
\ (backslash) -- used for escaping characters

Just like in the atom context, all other printable characters (see Necessarily Escaped Characters) have no special meaning, and can be used directly, each accepting itself literally. In the context of bracket expressions, there are no special escaped characters such as the conditionals described in Conditionals In BARF Regular Expressions. Escaping any character in a bracket expression will cause it to accept itself literally. The necessarily escaped characters such as hexadecimal escape characters, \t (tab), \n (newline), etc, accept themselves as would be expected. Non-printable characters will be ignored.

Some implementations of regexes have caveats about when certain special characters can be used as normal characters without escaping (such as allowing ] if it is the first character, possibly following a ^, as a normal character in the bracket-expression-context of a POSIX regex). This is entirely avoided in BARF for purposes of simplicity and consistency. The rule is that any special character in the applicable context must be escaped to use literally. If this is ever not the case, it is a bug in BARF.

Conditionals In BARF Regular Expressions

In addition to the ^ and $ (beginning and end of line) generic regex conditionals, BARF provides several others in the form of escaped characters. They are the following.

^ (carat) -- the generic regex special character which accepts the empty string at the beginning of a line
$ (dollar sign) -- the generic regex special character which accepts the empty string at the end of a line
\b -- accepts the empty string at a word boundary (i.e. the previous character matches [a-zA-Z0-9_] and the next character doesn't, or vice versa)
\B -- is the opposite of \b in that it accepts the empty string anywhere that isn't at a word boundary (i.e. both the previous and next characters match [a-zA-Z0-9_] or they both don't)
\e -- is equivalent to $ and is included for consistency -- it accepts the empty string at the end of a line
\E -- is the opposite of $ and \e in that it accepts the empty string anywhere that isn't the end of a line
\l -- is equivalent to ^ and is included for consistency -- it accepts the empty string at the beginning of a line
\L -- is the opposite of ^ and \l in that it accepts the empty string anywhere that isn't the beginning of a line
\y -- accepts the empty string at the beginning of input (e.g. at the beginning of the input file)
\Y -- is the opposite of \y in that it accepts the empty string anywhere that isn't the beginning of input
\z -- accepts the empty string at the end of input (e.g. at the end of the input file)
\Z -- is the opposite of \z in that it accepts the empty string anywhere that isn't the end of input

Example Regular Expressions

Here are some examples illustrating the usage of the forms described above, as they may be unfamiliar to someone used to a different implementation of regexes (e.g. grep's POSIX regexes).

Accepts "ostrich" and "head" separated by a tab character.
```
ostrich\thead 
```
Alternate version of the above example -- 0x09 is the hex value for the ASCII tab character.
```
ostrich\x09 
```
Accepts "Content-Type: text/plain" followed by 2 newline characters.
```
Content-Type: text/plain\n\n 
```
Accepts the string "HIPPO".
```
\x48\x49\x50\x50\x4F 
```
This is an error because ] is a special character and must be escaped to use in the atom context.
```
] 
```
This is the correct form of the above attempt, which accepts the string "]".
```
\] 
```
Accepts the string "(){}[]|?*+.^$\".
```
\{\}\[\]\|\?\*\+\.\^\$\\ 
```
Accepts "hippos are my favorite.", "ostriches are my favorite." or "dromedaries are my favorite." (notice the escaped period at the end).
```
(hippos|ostriches|dromedaries) are my favorite\. 
```
Accepts any string of length 10 not containing a newline.
```
.{10} 
```
Accepts any string of even length containing any digit.
```
([0-9][0-9])* 
```
Alternate form of the above example.
```
([0-9]{2})* 
```
This is an erroneous form of the above example -- a bound must not directly follow a bound (this is a limitation on the part of the grammar which will allow the currently unimplemented syntax for greedy matching -- a question mark after a bound -- to cause less confusion).
```
[0-9]{2}* 
```
Accepts the string "donkey" if it spans the entire line from beginning to end.
```
^donkey$ 
```
Alternate form of the above example, using BARF's equivalent conditional escape codes.
```
\ldonkey\e 
```
Accepts the string "donkey" as long as it doesn't occur at the beginning or end of the line.
```
\Ldonkey\E 
```
Accepts any character.
```
.|\n 
```
Accepts any string not containing a decimal digit.
```
[^0-9]* 
```
Alternate form of the above example, using a character class.
```
[^[:digit:]]* 
```
Non-bracket-expression regex which accepts any string containing only any of the atom-context special characters (e.g. "$$^^())([][]{}...$$\\$").
```
\{\}\[\]\|\?\*\+\.\^\$\\ 
```
Bracket expression form of the above example. Note which characters are escaped and which are not.
```
[(){}\[\]|?*+.\^$\\] 
```
Non-bracket-expression regex which accepts any string containing only any of the bracket-expression-context special characters (e.g. "--[--[]-^^--\\\\--]["). Note which characters are escaped and which are not.
```
\[\]-\^ 
```
Bracket expression form of the above example.
```
[\[\]\-\^] 
```
Accepts the word "LOL" typed at any and every retarded volume.
```
LOL!* 
```
Slightly more retarded version of the above example.
```
L(OL)+!* 
```
The retardedest (yes, I said "retardedest") version yet.
```
L(OL)+!+1+(one)+ 
```

Finite Automaton Generation

This section describes how BARF converts a regular expression into the finite automaton which accepts it. The final result is a DFA (which is the easiest FA to implement and the fastest to run). This happens in the following stages.

The regex string is parsed using an LALR(1) parser. An AST representation of the regex is the result of this process.
An NFA is generated by traversing the AST, generating sub-NFAs for each node. The starting state and set of accept states is recorded.
A DFA is generated from the NFA using the subset construction algorithm, which uses the starting state and set of accept states from the NFA. A new starting state and set of accept states is produced.

All the regular expression facilities of BARF are contained within the Barf::Regex namespace, all the files of which are located in the lib/regex directory.

Parsing Regular Expressions

BARF performs parsing of regular expressions using an LALR(1) grammar parser class (Barf::Regex::Parser) generated by trison. The grammar source file is barf_regex_parser.trison and the AST classes it uses are defined in the files barf_regex_ast.hpp and barf_regex_ast.cpp . Here is the AST resulting from parsing the regex ^/{2}.*|\n|"[^"]*" -- that is, one which accepts C++ style comments which start at the beginning of a line, newlines, or simple C-style string literals without any escaped characters.

If an error is encountered, an exception is thrown -- this is done because the regex facilities in BARF are used as utility functions from within other applications, and to avoid uncontrolled printing of error messages, errors are indicated via exceptions.

NFA Generation

Once a regular expression has been successfully parsed and exists in AST form, an NFA can be generated by walking the AST recursively, generating sub-NFAs for each sub-regex and constructing certain forms of NFA for each construct. The generic Barf::Graph class and its subordinates (which are defined in barf_graph.hpp and barf_graph.cpp) are used to represent the NFA (as well as all FAs in BARF). The start and accept states are labeled, and each node is numbered, for reference later in the corresponding DFA.

Again, since the regex facilities in BARF are used as utility functions from within other applications, and each application requires a certain amount of customizability in the use of the generated NFAs, the NFA-generating code does not create the NFA's start or accept states -- the client application must provide these. The client application must also take care to keep track of the provided start and accept states, as they are effectively the only point of entry/exit for the NFA.

DFA Generation

Once an NFA has been generated, the modified subset construction algorithm (TODO: make into ref) is used to generate an equivalent DFA. Each DFA state is actually a set of NFA states (hence the name "subset construction algorithm"). The start and accept states are labeled and each node is numbered, suffixed by ": DFA". The corresponding set of NFA states is also indicated, suffixed by ": NFA".

Hosted by

-- Generated on Mon Jan 7 22:58:02 2008 for BARF by

1.5.1