Products: Visual Parse++: Tutorials: Topics: Other Support:
Visual_Parse++     Visual_Parse++_Features GUI Tutor 1 Reducing a Rule Parsing Technology
DataStruct Why Visual Parse++? GUI Tutor 2 Expression List Stack Documentation
Meta-S Download Trial C++ Tutor 1 Flexible File Format Further Reading
Ordering Accolades Java Tutor 1 Grammar Idioms Consulting Services
Support Our Customers
  Back to home page  Click here for our free informative book "Parsing with Visual Parse++"  
 

Visual Parse++ Topics: Flexible File Format

"The industry standard visual parsing tool."

Reducing a Rule

Expression List Stack

Flexible File Format

Grammar Idioms

 
   For a 15 day free trial click here To order now click here!  
 

You'll learn the following from this topic:

About Flexible File Formats.
How to tokenize the input.
About Delta Files.

From a programming perspective, lexing and parsing technology can be viewed as the construction of machines to break apart and reassemble or restructure some data. The lexer typically breaks apart and identifies the entities (pieces), and the grammar tells the parser how to interpret or structure (and verify) the pieces of data into some coherent form. 

The grammar describes the valid format, or structure of the data. The parser, using the grammar as a control, locates the valid elements, and presents them to a program to do some task. 

One useful, but little used, application of parsing technology is adding some structure to files written to disk. Now, applications typically just write the data in some flat file. The files typically map to some flat records defined by something like a C structure.

Have you ever tried to change such a format?  Its very difficult and time consuming, because you can’t just add or remove a field from your map (C structure), because doing so makes the old format unreadable.  You usually wind up doing something like making new maps and calling them something like mapName0, mapName1..., and then rewriting some piece of code to handle the new maps. This becomes tiresome very quickly, especially during initial design, when the file format can change several times a day.

This article proposes a new way to handle this common problem. Using Visual Parse++, you can front end your flat file and add a very flexible format in the process. Many common text file formats (RTF, HTML, etc.) already do this, this just extends the concept to binary files.

Tokenize Input

The first step is to tokenize the file somehow. An easy way to do this is to add a header to each piece of data. The header is 8 bytes long and contains a field for a length value, and a field for a token value. Each field is 32-bits. In C or C++, it would look like this:

struct Header

{

unsigned long Length;

unsigned long Token;

};

For this application, no lexer is required, the data is pre-lexed into to-kens when it is written. The reader for this type of structure is trivial, it just reads the tokens in one at a time.

In C++, the tokens are passed to the parser by overriding the nextLex-eme

We can design a grammar to describe the format of the data. Then add code to the reduce skeleton to assemble the pieces. This adds an insulation layer between the flat file on disk, and the data structures in your code, which makes changes, deletions, or additions easy to handle. You just need to modify one of the case (or equivalent) statements to ac-count for the new or changed piece of data.

Delta Files

Here is an example illustrating this method. The file format we will use is a delta file of some kind. The delta file is just a list of lines with changes, additions, or deletions added as new lines on the end of the file. There will be 2 types of tokens, 1 to carry the operations, and 1 to carry the line number and data (if any). We will use an expression-like grammar to process the file. 

We still need a %expression section, even though we are not using a ‘real’ lexer. The regular expressions are never used, but the names and aliases are still required in the %production section. Each entry is essentially a place holder for the name and alias.

Here is what the rule file looks like:

%expression     Main

‘a’             Line, ‘line’;

‘a’             Add, ‘+’;

‘a’             Delete, ‘-’;

‘a’             Replace, ‘%’;

%prec

1, ‘+’, %left;

1, ‘-’, %left;

1, ‘%’, %left;

%production     start

Start           start -> lineExpr;

LineAdd         lineExpr -> lineExpr ‘+’ lineExpr;

LineDelete      lineExpr -> lineExpr ‘-’ lineExpr;

LineReplace     lineExpr -> lineExpr ‘%’ lineExpr;

LineExprLine    lineExpr -> ‘line’;

The %prec section is required to remove the shift/reduce conflicts that the grammar will generate. 

This rule file allows you to add, replace, or delete lines. The file ‘writer’ will just write a list of lines separated by operation tokens. In this case the tokens are ‘+’, ‘-’, or ‘%’. Extending this with other operations (insert, modify, etc.) is easy, just add new rules. You can also add things like parentheses, like you would in an expression grammar.

What you are really doing is adding some intelligence to the file format, and at the same time making it quite flexible.

The above example could be used in something like a version control system, where you never want to actually delete anything from the file.  Essentially, it maintains a historical audit trail for the file.