Products: Visual Parse++: Tutorials: Topics: Other Support:
Visual_Parse++     Visual_Parse++_Features GUI Tutor 1 Reducing a Rule Parsing Technology
DataStruct Why Visual Parse++? GUI Tutor 2 Expression List Stack Documentation
Meta-S Download Trial C++ Tutor 1 Flexible File Format Further Reading
Ordering Accolades Java Tutor 1 Grammar Idioms Consulting Services
Support Our Customers
  Back to home page  Click here for our free informative book "Parsing with Visual Parse++"  
 

Visual Parse++

"The industry standard visual parsing tool."

"It is indeed a great tool you have created; especially the debugging facilities have impressed me."

—Ulla Peterson, 

Hydro Informatics Technologies

 
   For a 15 day free trial click here To order now click here!  
  Here is an excerpt from our book "Parsing with Visual Parse++"  which provides a brief description of parsing technology and how Visual Parse++ fits into the larger picture of compiler architecture.

 

Lexical Analysis (Tokenization)

After a source file has been preprocessed so that it contains only those symbols that are properly part of the programming language being compiled, it is broken into discrete tokens. A token is the smallest lexical unit that is understood to have some syntactic meaning within a programming language. Consider the final output of the preprocessor given above:

float ac = 0.0;

ac = 3.14 * ((3+1) * (3+1));

In this small bit of code, the lexical tokens, in order (not including spaces, for the sake of simplicity here), are:

float data type (float)

ac identifier

= operator (assignment)

0.0 real number

; semicolon

ac identifier

= assignment operator

3.14 real number

* operator (assignment) see note below

( parenthesis (open)

( parenthesis (open)

3 integer number

+ operator (plus)

1 integer number

) parenthesis (close)

* operator (multiplication)

( parenthesis (open)

3 integer number

+ operator (plus)

1 integer number

) parenthesis (close)

) parenthesis (close)

; semicolon

Lexical analysis is the process by which the input text is decomposed into such basic units. Note that, in the above example, the classification of the lexeme * into a token called operator (multiplication) is not completely satisfactory, since in the C language, the lexeme * may also be used for pointer dereferencing. Since this lexeme may be interpreted later by the parsing phase, based upon its syntactical context to either be the multiplication operator, the pointer type declaration, or a pointer dereferencing operator, this particular lexeme might better be classified by lexical analysis as a more generic token called star.

Syntactic Analysis (Parsing)

In the process of compilation of a source file, tokenization alone does not suffice. Consider the following statement, applying the rules of the C language:

int 63 = n;

Although this would be tokenized into perfectly legal tokens (namely: data type, integer number, assignment operator, identifier, semicolon), in the C language, variable declarations and initialization have the syntactic form of:

datatype identifier assigmentoperator value semicolon

The syntactical error in the above declaration is that an identifier is expected, but an integer number is supplied in its place. The process of determining whether or not a series of tokens satisfies the expressed syntactic rules of a language is known as syntactic analysis or parsing. The parser requests tokens from the tokenizer (sometimes called the lexer) and attempts to determine if the tokens occur in sequences that are permitted by the general form of the language in question. This general description of the syntactical form of a language is called a grammar and will be discussed more fully in a later section.

As already discussed, lexical analysis, or tokenization, is the process of decomposing input text into discrete tokens. In a general sense, a token is something that represents something else. In the case of the lexical analysis of computer programming language source code, for instance, the token identifier represents any sequence of characters that is not a key word or otherwise reserved word of the language. The task of the tokenizer is to scan through the characters of the source text and assemble tokens, one at a time, and feed these to the parser when it requests them.

Visual Parse++ grammar specifications begin by defining all of the unique tokens of the input language they are designed to parse. Let us imagine a very small text formatting language called Teeny. Teeny is a simple language that can do only one thing: display text on a console, either in a normal font, or in italics. An example of a Teeny file follows:

The quick <brown <i>fox</i> jumped over Tom’s <i>lazy</i> dog.

The text of the above Teeny File would display with the words fox and lazy in italics, as follows:

The quick <brown fox jumped over Tom’s lazy dog.

The Teeny formatting language does not do very much, but it will serve our purposes for explaining tokens. The tokens of the Teeny language, specified in Visual Parse++, would be:

%expression Main

'[a-zA-Z0-9]+' word;

' ' space;

'<[iI]>' italics_on;

'</[iI]>' italics_off;

'.' any_other_character;

Tokenized, the example sentence would become:

The word

space

quick word

space

< any_other_character

brown word

<i> italics_on

fox word

</i> italics_off

et cetera...

Notice that certain tokens are expressed as patterns, rather than as literals. In Teeny, the token word is any sequence of one or more characters from the ranges a-z, A-Z, or 0-9. This is expressed using a regular expression. Regular expressions allow the large sets of text sequences to be considered the same type of token. In the case of Teeny, the only important distinction between the tokens that has been made is that there are words, spaces, italics commands, and other characters. (The regular expression ‘.’ will match any single character.) Since Teeny simply displays whatever it encounters in the text file, it does not need to be able to distinguish one word from any other word, and so, all words are the same. The exact text that matches does become important later, when it comes time to actually display the text, but for the purposes of determining whether or not a given input file is a valid Teeny file, this is not an issue.

In the case of Teeny, there are five distinct token types. The job of the lexical scanner is to read through an input file and return one complete token whenever it is certain it has consumed a complete token. It is fairly simple, given such a basic language as Teeny, to imagine how this happens. The scanner examines one character at a time, and then, based upon the character it is examining, makes either continues to examine more characters, or returns a token. When it sees "T", for instance, it knows that it is in a word token. When it scans ahead one more character, it sees "h", and remains in the word token. Again, it scans ahead one more character, and sees "e", remaining in the same state. Finally, it sees the space, which is not part of the word token, and enters an accepting state. (It has accepted "The" as a complete word token.) At this point, the lexical scanner returns whatever numeric constant represents word tokens (let’s say this number is 1, since it is the first type of token). Since the character that brought the scanner out of the word token was a space, the scanner is now in a space token. Since a single space is accepted as a word token, the scanner then returns the numeric constant representing the space token (for instance, 2).

This process continues, until the scanner sees the first "<" character. At this point, there are three possibilities:

  1. the next character is either "i" or "I", in which case the scanner may be encountering an italics_on token,
  2. the next character is a "/", in which case the scanner may be encountering an italics_off token, or
  3. the next character is any other character, in which case the "<" is simple an any_other_character token.

In the Teeny example, it is case 3 above – the next character is the "b" of "brown" and the scanner returns the numeric constant for an any_other_character token (let’s say 5).

We can imagine that the lexical scanners are an efficient machine that shift from one state to another, depending on what state they are currently in, what they expect to encounter before they enter their next state, and that the total number of states that such a machine can enter is finite. For this reason, we say that a lexical scanner such as the one used by Visual Parse++ to tokenize input text is a finite state automaton. (As defined by the Merriam-Webster online dictionary, an automaton is "a machine or control mechanism designed to follow automatically a predetermined sequence of operations or respond to encoded instructions.")

Although it is not necessary to fully understand all of the theory of finite state automata in order to write a grammar specification using Visual Parse++, understanding some of how they work may assist you as you consider exactly how Visual Parse++ tokenizes input text. Tokenization by an FSA is an extremely efficient means of lexical scanning, since Visual Parse++ is able to turn token specifications as expressed by regular expressions into a deterministic machine that can return tokens without having to scan characters more than once. This is easy to imagine from our example, since, as each character is consumed from the input text, there are a finite number of possibilities. In order to determine which state to enter next, the scanner simply reads another character and either remains in the same state it is in, or enters a new state.

Click here for our free informative book "Parsing with Visual Parse++"