The Grotesque Innards Of Reflex

Regular Expressions As Used By Reflex

I would recommend reading the pages on Language Theory, Generic Regular Expressions (The Boring Kind) and Regular Expressions As Implemented By BARF (The Awesome Kind) before proceeding. The latter is essential for understanding this section and the next.

The body of a reflex scanner source file consists of one or more named scanner modes. A scanner mode is composed of one or more rules. A rule is a regular expression with associated handler code. Essentially each scanner mode is a separate state machine, as it is impossible to transition from anywhere in one scanner mode to a different scanner mode without an explicit transition in the rule handler code. Each rule regex in a given scanner mode is essentially a separate branch in a massively alternated regex.

TODO: graph of AST produced by the example

Here is the NFA produced from the example reflex source given in Format Of The Reflex Source File.

inline_dotgraph_4

inline_dotgraph_5

Notice that the two scanner modes are entirely separate NFAs. Also notice that though there is only one start state (the double-lined graph nodes) for each scanner mode, there are multiple accept states (the green graph nodes). This is because not only does the scanner need to decide wether or not the input is accepted by any of the regexes in the scanner mode, but it has to determine which one(s) accept the input. However, notice that there is only one accept state for each regex. This fact will be contrasted after DFA generation.

In order to construct the DFA from which to generate the scanner targets, we will use the modified subset construction algorithm on each separate scanner mode's NFA. The result is an equal number of DFAs, each corresponding to the NFA it was generated from. Here are the DFAs corresponding to the above NFAs.

inline_dotgraph_6

inline_dotgraph_7

Because each DFA state is a set of NFA states, each node is labeled first with the DFA state index, then the list of NFA states it corresponds to. Again, start states are indicated by double-lined graph nodes and accept states are indicated by green graph nodes. For the DFA states which contain accept states from the NFA, the accepting regex is indicated after the state index, separated by a colon (just like in the NFA graph). If a DFA state contains an accept state from the NFA, it is also an accept state. Notice that in the DFA graph for %scanner_mode BLOCK_COMMENT there are two DFA states which contain the NFA state 1:([^*]+|[*]). This is a result of the subset construction algorithm. Another result is that over all states in a DFA, the corresponding NFA state set is unique.

If a DFA state contains more than one NFA accept state, the state with the lower index (i.e. declared first in the reflex source file) takes priority, as only one rule handler can be be executed for each string that is accepted.

Implementing A Reflex Target

Make sure to read the previous section and all referred pages before this one, but especially The Targetspec And Codespec Paradigm, as this section draws on the cumulative information contained in said material.

The goal of this section is to enumerate the concepts and algorithms necessary to implement a target using whatever crappy language your heart desires. What will be implemented is a machine which runs the FA generated by reflex (both an NFA and equivalent DFA are generated), and it must conform to certain standards and obey certain rules to provide consistent behavior across all targets.

First, you must gain an understanding of exactly what a scanner does. The scanner simply analyzes input, one character at a time, attempting to accept the longest string which satisfies its FA. Once the scanner can make no valid transitions based on the next input atom, it executes the code associated with the most recently encountered accept state(s). Assuming the executed code didn't cause the scanning to halt, the machine sets the current state to the starting state for the current scanner mode (e.g. for %scanner_mode MAIN in the above example, DFA state 6) and continues processing.

The main parts of a scanner, as defined in this document, are:

The input and automaton apparati are essentially unchangeable, to be used as common code to all customized scanner apparati (e.g. the input and automaton apparati could be implemented as base classes for each of many possible customized scanner apparati to subclass). The following is an example class hierarchy chart to illustrate the point.

inline_dotgraph_8

Because all of the customized scanner apparati (MapFileScanner, NetworkCommandScanner, and ArithmeticScanner) use AutomatonApparatus and InputApparatus as their base classes, there must be no scanner-specific customization in either. Specifically, this means that the codespec files which specify the input and automaton apparati must not use any of the scanner-specific codespec macros.

Definitions

First, some concepts will be defined for use in the forthcoming scanner algorithm definition.

An atom is one piece of input, in this case an ASCII character. It can be anything in the range 0-255 (though 0 is reserved to represent beginning-of-input and end-of-input).

A deque (double-ended queue) is a well-known abstract container type. It provides random access to indexed elements (like an array), and (amortized) constant-time insertion and removal of elements (hence the double-ended qualifier). It must also provide an accessor for its length. In the context of this scanner algorithm, a deque will contain elements of type atom.

The scanner uses the input buffer (which utilizes a deque) to track the accepted text and unread text. Each index of the input buffer will be referred to as a cursor. As will become apparent soon, a cursor is only valid if it has index greater than 0. The input buffer is split into four sections.

To illustrate the input buffer graphically:

previous  accepted   unaccepted   unread
    atom   string      string     string

       ABBBBBBBBBBBBCCCCCCCCCCCCDDDDDDDDDD
        ^           ^           ^         ^
        |           |           |         |
      index       accept       read     length
        1         cursor      cursor      of
                                        buffer

The accepted string (BBBBBBBBBBBB) and the unaccepted string (CCCCCCCCCCCC) may both independently have length 0.

When first constructing the input buffer, the value '\0' (null-terminating char) should be added as the very first element (index 0). This value for the previous atom indicates the beginning of input. As more atoms are read, the end of input will eventually be encountered. Once this happens, the value '\0' should be added to the end of the input buffer, at which point, no more atoms may be read. This final atom indicates the end of input. Correspondingly, any normal atom (not the beginning or end of input) must not be '\0'.

A conditional is a pair of bitfields -- a mask and a set of flags -- which indicate a set of conditions. The mask indicates which bits are active in the conditional, while the flags indicate the values of the active bits. Each bit corresponds to a different condition at the read cursor. The conditions and corresponding bitfield values are as follows. In the following pseudocode, the existence of the previous atom is shown to be necessary (by the subtraction of 1 within the array index expression)

The current conditional flags is the flags portion of a conditional, indicating the conditions active in the input buffer at the read cursor. The existence of the previous atom is solely to guarantee that there is always at least one atom before the read cursor upon which to perform the calculation of the current conditional flags.

The above definitions have been made for use in the following definition.

The Input Apparatus

The input apparatus is an interface which provides certain well-defined operations on the above-defined entities. Its encapsulation serves to simplify the interaction with the automaton apparatus (explained later) which runs the state machine. The interface follows (Note: types, structures and methods use CapitalizedWord identifiers, members use m_underscore_separated_lowercase identifiers prefixed with m_, and compile-time constants are indicated by UPPERCASE_WITH_UNDERSCORES identifiers).

interface InputApparatus {
    // types

    // An 8-bit (unsigned) integer value to store atoms.
    type Atom;
    // An integer large enough to store the highest conditional bit flag.
    type Bitfield;
    // A string of atoms.
    type String;
    // A type which can be used to call a method taking no arguments, returning a Boolean (e.g. a C++ method pointer).
    type MethodReturningBoolean;
    // A type which can be used to call a method taking no arguments, returning an Atom (e.g. a C++ method pointer).
    type MethodReturningAtom;
    // A type satisfying the definition of input buffer, given above.
    type InputBuffer;
    // An integer large enough to store an index into the input buffer.
    type BufferIndex;    
    
    // compile-time constants
    
    Bitfield BEGINNING_OF_INPUT = (1 << 0);
    Bitfield END_OF_INPUT       = (1 << 1);
    Bitfield BEGINNING_OF_LINE  = (1 << 2);
    Bitfield END_OF_LINE        = (1 << 3);
    Bitfield WORD_BOUNDARY      = (1 << 4);
    
    // members
    
    // The deque-object which will store the read-in atoms.
    member InputBuffer            m_input_buffer;
    // The index of the read cursor in the input buffer, as described above.
    member BufferIndex            m_read_cursor;
    // The index of the accept cursor in the input buffer, as described above.
    member BufferIndex            m_accept_cursor;
    // The cached value of the conditional flags at the location in the input buffer
    // indicated by the read cursor.  Because it is a cached value, it will have to
    // be updated before being accessed at certain times.
    member Bitfield               m_current_conditional_flags;
    // The method to call to determine if we're at the end of the input stream.
    member MethodReturningBoolean m_is_at_end_of_input_method;
    // The method to call to retrieve the next atom from the input stream.
    member MethodReturningAtom    m_read_next_atom_method;
    
    // methods
    
    // Should be called once and only once upon creation of this input apparatus -- one-time
    // initializations including construction of all members.
    void Construct (MethodReturningBoolean is_at_end_of_input_method, MethodReturningAtom read_next_atom_method);
    // Should be called when this object is going bye-bye.  Destroys all member variables
    // and performs any other necessary clean-up.
    void Destroy ();
    // Return true if and only if CurrentConditionalFlags() indicates that the end-of-input 
    // condition is true.
    Boolean IsAtEndOfInput ();
    // Returns the conditional flags for the current value of the read cursor and input buffer.
    // This method will first have to ensure that at least one atom on either side of the 
    // current read cursor is available before performing the calculations.
    Bitfield CurrentConditionalFlags ();
    // Returns true if and only if no relevant bits (the bits set in conditional_mask) differ 
    // between conditional_flags and CurrentConditionalFlags().  Using C-language operators, 
    // this can be calculated as 
    // ((conditional_flags ^ CurrentConditionalFlags()) & conditional_mask) == 0
    Boolean IsConditionalMet (Bitfield conditional_mask, Bitfield conditional_flags);
    // Ensures there is at least one atom in the input buffer after the read cursor, by 
    // using the is_at_end_of_input_method and read_next_atom_method methods (passed into
    // Construct) as necessary.
    Atom InputAtom ();
    // Increments the read cursor by one.  The read cursor should be greater than zero and 
    // less than the length of the input buffer.
    void AdvanceReadCursor ();
    // Assigns the read cursor's value to the accept cursor.
    void SetAcceptCursor ();
    // Extracts and returns the accepted string; the string of input buffer atoms in the range 
    // [1, accept cursor) -- if the accept cursor is 0, it is an error.  Pops the same number 
    // of atoms off the front of the input buffer.  Sets the read cursor to 1 (indicating the 
    // new beginning of the unread string).  Sets the accept cursor to 0 (indicating the 
    // non-existence of the accept string).  A result of the popping is that the new previous 
    // atom is the last atom in the accepted string.
    String Accept ();
    // Extracts and returns the atom at index 1 in the input buffer.  If the accept cursor 
    // is not 0, it is an error.  If the extracted atom is not 0 (null-terminating character), 
    // pops the front of the input buffer once.  Sets the read cursor to 1 (which is the new 
    // beginning of the unread string).
    Atom Reject ();
    // Clears the input buffer.  Pushes an atom with value 0 onto the input buffer (at this 
    // point, a single atom with value 0 should be the only contents of the input buffer).  
    // Sets the read cursor to 1 (indicating the beginning of the unread string).  Sets the 
    // accept cursor to 0 (indicating the non-existence of the accept string).  Must somehow 
    // indicate that the current conditional flags have not been calculated (so that when 
    // CurrentConditionalFlags is called, the current conditional flags can be calculated).  
    // This method must be called to initialize the input buffer before being used, as well 
    // as any time this input buffer is being reused for a new input source (e.g. reading 
    // from a second, different file).
    void ResetForNewInput ();
}

All of the following two definitions are independent of the input apparatus -- they are definitions for the components required to run the FA generated by reflex.

The current scanner mode (not to be confused with an FA state) is the currently active "mode" of the scanner -- the two scanner modes in the example from Format Of The Reflex Source File are MAIN and BLOCK_COMMENT. Each has an associated FA -- a set of states, transitions, a starting state, and a set of accept states.

A state identifier is an abstract indicator for a single state in an FA. It can be invalid (i.e. indicating no real FA state), and its validity must be detectable. A state identifier could be implemented as a pointer to a memory address (in which case the invalid value would be NULL), the FA state index (the invalid value could be one higher than the highest valid FA state index), or whatever.

The above definitions have been made for use in the following definition.

The Automaton Apparatus

The automaton apparatus is an interface which provides certain well-defined operations on the above-defined entities. Like the input apparatus, its encapsulation serves to simplify the interaction with the final generated scanner. The given definition is that of a DFA (and not an NFA); because all NFAs have an equivalent DFA, we will choose to implement the easier/faster DFA. The definition of the automaton apparatus for an NFA may be given at a later time. The interface follows.

The Scanner Apparatus

TODO: this section isn't finished

The scanner apparatus is the main interface and is publicly accessible to the user's code. The API defined here defines the canonical behavior which all implemented reflex targets should follow (though there certainly is no absolute requirement to do so). This conformance is intended to provide consistent behavior and semantics between different targets.

Each apparatus, in hiding its subordinate, provides a much cleaner and simpler API. The scanner apparatus API is the simplest of all. The previously defined input and FA apparati have been separated out for the following three reasons.

TODO: describe the "debug spew" and "scanner mode" properties. There are two important properties of the scanner apparatus which will now be described. Firstly, the "debug spew" property, a boolean value, indicates wether or not various "debug messages" will be printed during the execution of

The scanner apparatus API follows.

TODO: more words


Hosted by SourceForge.net Logo -- Generated on Mon Jan 7 22:58:02 2008 for BARF by doxygen 1.5.1