Generic Regular Expressions (The Boring Kind)

This page details the generic concept of regular expressions and may exclude certain implementation-specific concepts. The BARF implementation of regular expressions has certain extensions and caveats which are described fully in Regular Expressions As Implemented By BARF (The Awesome Kind).

A regular expression (regex) is a syntactical form for compactly specifying a regular language. In other words, a regex is a string which defines a set of strings which are acceptable to the machine using the regex. Examples of regular expressions:

xyz

hip{2}o

this|that

smashy( smashy)*

Structure Of A Regular Expression

TODO: write about regexes, branches, pieces, atoms, etc -- in the context of parsing a regex

Atoms

In the context of a regex, atoms are the most basic components from which more complicated regexes are created (hence the "atom" metaphor). You can think of an atom as accepting a single character (though technically the special "conditional" characters count as atoms, but they will not be discussed in this page). There are several forms of atoms.

A non-special normal character (the special characters are ones used in operators such as ( ) (parentheses), { } (curly braces), ? (question mark), * (asterisk), + (plus) and | (pipe), as well as [ ] (square brackets), ^ (carat), $ (dollar sign), . (period) and \ (backslash)). Characters such as a or @ or space or \n (newline) are considered normal characters and have no special meaning. Basically any character that is not special in the current context is normal. Each non-special character will accept only itself (e.g. a accepts only a, \n accepts only \n).
The match-all . (period). The period is used as shorthand for any non-newline character, and thus will accept any character except for \n (newline).
An escaped character -- a \ (backslash) followed by any character (though some implementations do not allow normal characters to be escaped). If the escaped character is a special character, then it loses its special meaning and becomes a normal character, accepting its literal value. An escaped normal character does not change, and continues to accept itself (assuming the specific implementation allows escaped normal characters). To accept a literal backslash, an escaped backslash must be used -- \\
A bracket expression is a terse way to specify a set of acceptable characters, without having to explicitly list each one. A bracket expression is enclosed by the [ ] (bracket) characters. The format of a bracket expression varies by implementation (so its full description can be found in Regular Expressions As Implemented By BARF (The Awesome Kind)), but generally they allow:
- Single characters -- the sets of normal and special characters within a bracket expression are different than for those in the atom context as described above. Escaping and use of normal/special characters varies by implementation. However, most of the normal characters in an atom are also normal in a bracket expression (the alphabetic and numeric characters for example). This bracket expression accepts the string "j".
```
[j] 
```
  This bracket expression accepts the strings "j", "u", "n", or "k".
```
[junk] 
```
- Character ranges -- a hyphen-delimited range of characters. The numeric value of each character is indicated by the current locale (BARF specifically uses ASCII). This bracket expression accepts "a", "b", "c", "d", "e" or "f".
```
[a-f] 
```
  A character range may have only two endpoints -- the bracket expression [0-3-6] is invalid. The way to use a literal - (hyphen) within a range is implementation-specific. POSIX regular expressions indicate that it must be either the first or last character within the [ ] brackets, or the right-side (ending) character in a range (BARF implements this differently -- see Regular Expressions As Implemented By BARF (The Awesome Kind)).
- Character classes -- of the form [:classname:], character classes are predefined sets of characters which are defined by the current locale (BARF uses specifically ASCII). See the manpage for wctype for details on each.
- Set negation -- indicated by the ^ (carat) character immediatly after the opening [ bracket. This serves to negate the set, causing the bracket expression to match everything except those specified within. This bracket expression accepts all single-character strings except "a", "b" and "c".
```
[^a-c] 
```
Bracket expressions can contain a series of the above-listed elements concatenated together. For example, the following bracket expression accepts the strings "a", "b", "c", "x", "0", "1", "2", "3", "4", "5", "6", "7", "8" or "9".
```
[a-cx[:digit:]] 
```
A parenthesized regular expression. Though this doesn't necessarily accept a single character, it does behave as a single atom within the context of the regex operators (see Operations).

Operations

Regular expressions use a terse notation -- they wouldn't be very useful if each regex string was longer than the strings each accepts. For the sake of simplicity, the examples in this section will not use escaped characters or bracket expressions. The operations are as follows, in order of highest to lowest precedence.

Parenthesized subexpressions -- used for grouping sequences, overriding the higher precedence of other operations (as in common arithmetic). The grouped subexpression must be a regular expression in its own right. Examples:
- Trivial subgrouping which is equivalent to the regex hippo
```
(hippo) 
```
- The empty string is an acceptable regex. Accepts only the empty string.
```
() 
```
- Accepts zero or more iterations of the sequence hippo -- it accepts "hippo", "hippohippo", "hippohippohippo", etc.
```
(hippo)+ 
```
- Contrast this the example to the above -- because iteration has higher precedence than concatenation, only the last "o" will be repeated. This regex accepts "hippo", "hippo", "hippoo", "hippooo", etc.
```
hippo+ 
```
  Equivalent to the unambiguously parenthesized regex: (hipp)((o)+)
- Accepts "what about a hippopotamus" or "what about an ostrich".
```
what about (a hippopotamus|an ostrich) 
```
  Equivalent to the unambiguously parenthesized regex: (what about )((a hippopotamus)|(an ostrich))
- Contrast this example to the above -- because concatenation has higher precedence than alternation, the entire sequence what about a hippopotamus will be alternated with the sequence an ostrich -- it accepts "what about a hippopotamus" or "an ostrich".
```
what about a hippopotamus|an ostrich 
```
  Equivalent to the unambiguously parenthesized regex: (what about a hippopotamus)|(an ostrich)
- Accepts "PirateNinja", "PirateZombie", "RobotNinja" or "RobotZombie".
```
(Pirate|Robot)(Ninja|Zombie) 
```
  Equivalent to the unambiguously parenthesized regex: ((Pirate)|(Robot))((Ninja)|(Zombie))
Iteration -- is a right-side unary operator. The left-side operand should be repeated a number of times, the number being indicated by the right-side operator. There are several forms of this. The most generic is called a "bound" and uses a pair of non-negative integers indicating a range of acceptable iterations. There are two alternate versions of the bound notation, and three shorthand single-character operators for the commonly used "zero or one", "zero or more" and "one or more".
- The basic form of a bound is a range of non-negative integers, the left-side value being less than or equal to the right-side value. Accepts "aaa", "aaaa" or "aaaaa".
```
a{3,5} 
```
- Using zero as a range value is also acceptable, keeping in mind that the relation between the values must be respected. This regex accepts the empty string, "a", "aa" or "aaa".
```
a{0,3} 
```
- The second form of a bound leaves out the upper limit on the range. This particular regex indicates that at least 2 iterations of the operand "a" must occur for the string to be accepted. It accepts "aa", "aaa", "aaaa", etc.
```
a{2,} 
```
- Again, using zero as the lower limit for the open-ended bound is acceptable. It indicates that any number of iterations of the operand is acceptable. This regex accepts the empty string, "a", "aa", "aaa", etc.
```
a{0,} 
```
- The third form of a bound gives a single value indicating the only number of acceptable iterations. This regex accepts only "aaaaaaaaaaaaaaaaa" (that's 17 iterations of the letter 'a').
```
a{17} 
```
- Zero is also acceptable in the third form of a bound. This regex accepts only the empty string.
```
a{0} 
```
- The ? character is used as shorthand for a bound accepting 0 or 1 iterations of the operand (? is equivalent to {0,1}. This regex accepts the empty string or "a".
```
a? 
```
- The * character is used as shorthand for a bound accepting 0 or more iterations of the operand (* is equivalent to {0,}. This regex accepts the empty string, "a", "aa", "aaa", etc.
```
a* 
```
- The + character is used as shorthand for a bound accepting 1 or more iterations of the operand (+ is equivalent to {1,}. This regex accepts "a", "aa", "aaa", etc.
```
a+ 
```
- Iteration has higher precedence than concatenation. This regex accepts "ab", "abb", "abbb", etc.
```
ab+ 
```
  Equivalent to the unambiguously parenthesized regex: (a)((b)+)
- If iteration of the sequence ab was desired, then the following regex would be used to accept "ab", "abab", "ababab", etc.
```
(ab)+ 
```
Concatenation -- indicates that the left-side operand should precede the right-side operand. There is no physical operator character for this operation; it is performed by simply placing the operators next to one another in the regex. Examples:
- Only accepts the string "ab"
```
ab 
```
- Accepts the strings "ac", "ad", "bc" or "bd"
```
(a|b)(c|d) 
```
  Equivalent to the unambiguously parenthesized regex: ((a)|(b))((c)|(d))
- Accepts the string "ostrich!"
```
ostrich! 
```
- Iteration has higher precedence than concatenation. Thus this regex accepts the strings "ostrich", "ostrich!", "ostrich!!", "ostrich!!!", etc.
```
ostrich!* 
```
  Equivalent to the unambiguously parenthesized regex: (ostrich)((!)*)
- Accepts a string containing at least one of each of the letters, in order -- for example, "slow", "ssssslowww", "ssslllooowww", "sslllllloooooooooow", etc.
```
s+l+o+w+ 
```
  Equivalent to the unambiguously parenthesized regex: ((s)+)((l)+)((o)+)((w)+)
Alternation -- indicates that either the left-side operand or the right-side operand is acceptable, but not both.
- Accepts "hippo" or "ostrich".
```
hippo|ostrich 
```
- Accepts "hippo" or the empty string.
```
hippo| 
```
  Equivalent to the unambiguously parenthesized regex: (hippo)|()
- Accepts a string containing any nonempty sequence of the words "Hippo" or "Ostrich" -- for example, "Hippo", "Ostrich", "HippoHippo", "HippoOstrich", "OstrichHippoOstrichOstrichHippo", etc.
```
(Hippo|Ostrich)+ 
```
  Equivalent to the unambiguously parenthesized regex: ((Hippo)|(Ostrich))+
- Iteration has higher precedence than alternation, this regex alternates the sequences Hippo and (Ostrich)+ thus accepting the strings "Hippo", "Ostrich", "OstrichOstrich", "OstrichOstrichOstrich", etc.
```
(hippo)|(ostrich)+ 
```
  Equivalent to the unambiguously parenthesized regex: (hippo)|((ostrich)+)
- Contrast this example to the above -- because iteration has a higher precedence than both concatenation and alternation, it accepts the strings "Hippo", "Ostrich", "Ostrichh", "Ostrichhh", etc.
```
Hippo|Ostrich+ 
```
  Equivalent to the unambiguously parenthesized regex: (Hippo)|((Ostric)((h)+))

Conditionals In Generic Regular Expressions

From within the context of a regular expression (not inside a bracket expression), the characters ^ (carat) and $ (dollar sign) are special -- they match the empty string at the beginning of the line and the end of the line, respectively. The beginning of a line is denoted by the beginning of input, or if a newline was just accepted. The end of a line is denoted by the end of input, or if the next character in the unread input is a newline.

These two special characters are referred to as conditionals, in that they don't accept any physical input, but rather require certain input conditions to be met to accept. In various regex implementations, other conditionals exist, such as:

Word boundary -- matches the empty string when the most recently accepted character is [a-zA-Z0-9_] and the next unread character is not, or vice versa.
Not a word boundary -- matches the empty string when the word boundary condition is not met
Beginning/end of input -- matches the empty string at the beginning or end of input
Not beginning/end of input -- matches the empty string when the beginning/end of input condition is not met
Not beginning/end of line -- the negation of the conditions indicated by ^ and $

Here is an illustrative example.

two
lines!

Using the above text as input, the beginning-of-line condition ^ (carat) is met before the characters t and l, since they're both at the beginning of the line.
The end-of-line condition $ (dollar sign) is met after the characters o and !, since they're both at the end of the line.
The word-boundary condition is met before the character t, after the character o, before the character l, and after the character s.
The not-a-word-boundary condition is met in between the character pairs tw, wo, li, in, ne, es and after the character !.
By now, you can probably guess where the other two aforementioned conditionals are satisfied (and frankly I'm sick of typing examples).

Again, see Regular Expressions As Implemented By BARF (The Awesome Kind) for implementation-specific details.

Hosted by

-- Generated on Mon Jan 7 22:58:00 2008 for BARF by

1.5.1