Here
is an excerpt from our book "Parsing with Visual
Parse++" which provides a brief description of parsing
technology and how Visual Parse++ fits into the larger picture of
compiler architecture.

Lexical Analysis (Tokenization)
After a source file has been
preprocessed so that it contains only those symbols that are
properly part of the programming language being compiled, it is
broken into discrete tokens. A token is the smallest
lexical unit that is understood to have some syntactic meaning
within a programming language. Consider the final output of the
preprocessor given above:
float ac = 0.0;
ac = 3.14 * ((3+1) * (3+1));
In this small bit of code, the
lexical tokens, in order (not including spaces, for the sake of
simplicity here), are:
float data type (float)
ac identifier
= operator (assignment)
0.0 real number
; semicolon
ac identifier
= assignment operator
3.14 real number
* operator (assignment) see
note below
( parenthesis (open)
( parenthesis (open)
3 integer number
+ operator (plus)
1 integer number
) parenthesis (close)
* operator (multiplication)
( parenthesis (open)
3 integer number
+ operator (plus)
1 integer number
) parenthesis (close)
) parenthesis (close)
; semicolon
Lexical analysis is the process by
which the input text is decomposed into such basic units. Note that,
in the above example, the classification of the lexeme * into
a token called operator (multiplication) is not completely
satisfactory, since in the C language, the lexeme * may also be used
for pointer dereferencing. Since this lexeme may be interpreted
later by the parsing phase, based upon its syntactical context to
either be the multiplication operator, the pointer type declaration,
or a pointer dereferencing operator, this particular lexeme might
better be classified by lexical analysis as a more generic token
called star.
Syntactic Analysis (Parsing)
In the process of compilation of a
source file, tokenization alone does not suffice. Consider the
following statement, applying the rules of the C language:
int 63 = n;
Although this would be tokenized
into perfectly legal tokens (namely: data type, integer number,
assignment operator, identifier, semicolon), in the C language,
variable declarations and initialization have the syntactic form
of:
datatype identifier
assigmentoperator value semicolon
The syntactical error in the above
declaration is that an identifier is expected, but an integer number
is supplied in its place. The process of determining whether or not
a series of tokens satisfies the expressed syntactic rules of a
language is known as syntactic analysis or parsing.
The parser requests tokens from the tokenizer (sometimes called the lexer)
and attempts to determine if the tokens occur in sequences that are
permitted by the general form of the language in question. This
general description of the syntactical form of a language is called
a grammar and will be discussed more fully in a later
section.
As already discussed, lexical
analysis, or tokenization, is the process of decomposing input text
into discrete tokens. In a general sense, a token is something that
represents something else. In the case of the lexical analysis of
computer programming language source code, for instance, the token identifier
represents any sequence of characters that is not a key word or
otherwise reserved word of the language. The task of the tokenizer
is to scan through the characters of the source text and assemble
tokens, one at a time, and feed these to the parser when it requests
them.
Visual Parse++ grammar
specifications begin by defining all of the unique tokens of the
input language they are designed to parse. Let us imagine a very
small text formatting language called Teeny. Teeny is a simple
language that can do only one thing: display text on a console,
either in a normal font, or in italics. An example of a Teeny file
follows:
The quick <brown
<i>fox</i> jumped over Tom’s <i>lazy</i>
dog.
The text of the above Teeny File
would display with the words fox and lazy in italics,
as follows:
The quick <brown fox
jumped over Tom’s lazy dog.
The Teeny formatting language does
not do very much, but it will serve our purposes for explaining
tokens. The tokens of the Teeny language, specified in Visual
Parse++, would be:
%expression Main
'[a-zA-Z0-9]+' word;
' ' space;
'<[iI]>' italics_on;
'</[iI]>' italics_off;
'.' any_other_character;
Tokenized, the example sentence
would become:
The word
space
quick word
space
< any_other_character
brown word
<i> italics_on
fox word
</i> italics_off
et cetera...
Notice that certain tokens are
expressed as patterns, rather than as literals. In Teeny, the token word
is any sequence of one or more characters from the ranges a-z, A-Z,
or 0-9. This is expressed using a regular expression. Regular
expressions allow the large sets of text sequences to be considered
the same type of token. In the case of Teeny, the only important
distinction between the tokens that has been made is that there are
words, spaces, italics commands, and other characters. (The regular
expression ‘.’ will match any single character.) Since Teeny
simply displays whatever it encounters in the text file, it does not
need to be able to distinguish one word from any other word, and so,
all words are the same. The exact text that matches does become
important later, when it comes time to actually display the text,
but for the purposes of determining whether or not a given input
file is a valid Teeny file, this is not an issue.
In the case of Teeny, there are
five distinct token types. The job of the lexical scanner is to read
through an input file and return one complete token whenever it is
certain it has consumed a complete token. It is fairly
simple, given such a basic language as Teeny, to imagine how this
happens. The scanner examines one character at a time, and then,
based upon the character it is examining, makes either continues to
examine more characters, or returns a token. When it sees
"T", for instance, it knows that it is in a word
token. When it scans ahead one more character, it sees
"h", and remains in the word token. Again, it scans
ahead one more character, and sees "e", remaining in the
same state. Finally, it sees the space, which is not part of
the word token, and enters an accepting state.
(It has accepted "The" as a complete word token.)
At this point, the lexical scanner returns whatever numeric constant
represents word tokens (let’s say this number is 1, since
it is the first type of token). Since the character that brought the
scanner out of the word token was a space, the scanner is now
in a space token. Since a single space is accepted as a word
token, the scanner then returns the numeric constant representing
the space token (for instance, 2).
This process continues, until the
scanner sees the first "<" character. At this point,
there are three possibilities:
- the next character is either "i" or
"I", in which case the scanner may be encountering an italics_on
token,
- the next character is a "/", in which
case the scanner may be encountering an italics_off
token, or
- the next character is any other character, in
which case the "<" is simple an any_other_character
token.
In the Teeny example, it is case 3
above – the next character is the "b" of
"brown" and the scanner returns the numeric constant for
an any_other_character token (let’s say 5).
We can imagine that the lexical
scanners are an efficient machine that shift from one state
to another, depending on what state they are currently in, what they
expect to encounter before they enter their next state, and that the
total number of states that such a machine can enter is finite. For
this reason, we say that a lexical scanner such as the one used by
Visual Parse++ to tokenize input text is a finite state automaton.
(As defined by the Merriam-Webster online dictionary, an automaton
is "a machine or control mechanism designed to follow
automatically a predetermined sequence of operations or respond to
encoded instructions.")
Although it is not necessary to fully understand
all of the theory of finite state automata in order to write a
grammar specification using Visual Parse++, understanding some of
how they work may assist you as you consider exactly how Visual
Parse++ tokenizes input text. Tokenization by an FSA is an extremely
efficient means of lexical scanning, since Visual Parse++ is able to
turn token specifications as expressed by regular expressions into a
deterministic machine that can return tokens without having
to scan characters more than once. This is easy to imagine from our
example, since, as each character is consumed from the input text,
there are a finite number of possibilities. In order to determine
which state to enter next, the scanner simply reads another
character and either remains in the same state it is in, or enters a
new state.
Click here
for our free informative book "Parsing with Visual Parse++"