http://www.cs.utexas.edu/users/mcguire/software/fbmodule/
Use Flex and Bison to create Python modules for parsing strings.
22 Mar 2002 - Releasing version 2.1. Cleans up push_position in FlexModule.h based on experience with APC and try to handle some weird error rule behavior by Bison.
28 Jan 2002 - Releasing version 2.0. Major upgrade, with complete rewrite of FlexModule.h. It no longer uses C++ and has several new features including better position handling and inserting sub-files into the token stream (ala C's #include). BisonModule.h has much better error handling, using FM's positions.
| Some people like using Flex and Bison. (Yes, I pity them, too. (Actually, fooling with Flex is kind of fun, though.)) |
| Flex and Bison are both reasonably standard, reasonably popular, and pretty well documented. (See O'Reilly's lex & yacc by John R. Levine, Tony Mason and Doug Brown (preferrably the second edition, which is much improved over the first) as well as the Flex and Bison docs.) |
| Flex and Bison are pretty fast (in their own way, of course). If more speed is needed when using both, it should be possible to merge the two modules into a single C module without changing any of the scanning or grammar rules. |
| Flex and Bison are flexible. If the speed of the Python program is not sufficient, you'll have debugged grammar rules which can be used to rewrite the program entirely in C. |
This sofware has been tested with:
| Flex 2.5.4 |
| Bison 1.28 |
| gcc 2.95.2 and 3.0.3 |
| Python 1.5.2 and 2.0 |
under Linux. BisonModule uses GCC's variable argument macros, if nothing else, and therefore probably won't work with another C compiler. FlexModule uses Flex's buffer calls and will not work with lex.
BisonModule.h
| C header file which should be included by a Bison grammar specification. |
FlexModule.h
| Similarly, a C header file used by a Flex scanner specification. |
C++FlexModule.h
| For reference, a copy of the original C++ FlexModule. It does not follow the interface below. |
Makefile.pre.in
| Modified version of Makefile.pre.in from Python 1.5.1; changed to add rules to run Flex and Bison to produce .c files. |
Setup.in
| Template for a Setup.in file. |
Symbols.py
| Sample Python code for Symbol (as in a non-terminal Bison grammar symbol) and Token classes (a subclass of Symbol, for terminal Flex symbols). |
example/hoc2
| Example based on hoc from The UNIX Programming Environment. |
Copy BisonModule.h, FlexModule.h, Symbols.py, and Makefile.pre.in to the directory where you will build the modules. Create a Setup.in and type
make -f Makefile.pre.in boot
make
The two .h files are used similarly, by including them into the Bison or Flex input and adding some declarations and a macro call at the end.
The actual Flex rules for tokens are pretty much as normal for flex; just return a unique token type integer. For rules which do not return, like whitespace (spaces, tabs, newlines, etc.) and comments, call the macro ADVANCE:
ws [ tn]+
...
{ws} { ADVANCE; /* and skip */ }
To insert a sub-file into the token stream, use the PUSH_FILE macros:
str "([^"\n]|\.)*"
...
"input "{str} { PUSH_FILE_YYTEXT(7,yyleng-1); }
for things like 'input "file"'. PUSH_FILE_YYTEXT's arguments are slice-like offsets into yytext; the 7 is the length of 'input "' and the yyleng-1 is the index of the last quote mark.
A PUSH_FILE_STRING macro takes a zero-terminated file name argument. Both of these call ADVANCE to update the position.
At the end of the Flex input file, add two things: An array of TokenValues, which provide a map between strings and the numerical token types, and a call to the FLEXMODULEINIT macro. For example:
...
%%
TokenValues module_tokens[] = {
{"NUMBER", NUMBER},
{"VAR", VAR},
{0,0}
};
FLEXMODULEINIT(hoclexer, module_tokens);
where NUMBER and VAR are constants and "hoclexer" is the name of the module.
After importing the module, it gives access to the functions:
onstring(maketoken, string)
| begin scanning string |
onfile(maketoken, file)
| begin scanning a file (name or object) |
readtoken()
| read the next token, returning a pair consisting of the token value and the object returned by maketoken. |
lasttoken()
| re-call maketoken on the last token |
close()
| free resources and stop scanning |
and the dictionaries:
names
| a map between numeric types and the string names of the tokens. This is created from the TokenValues array. |
types
| a map between string names and numeric types, also from TokenValues. |
maketoken should be a function with three parameters:
| the type of the token, an integer |
| the text of the token, a string |
| the postion of the token |
A position is a tuple of
| a pair with the beginning line and column |
| a pair with the ending line and column |
| the filename |
| a list of tuples, giving the file name, line, and column of stacked, yet-to-be finished positions. The list does not include the current position. |
Each call to the readtoken and makesymbol functions below should return a pair (like the readtoken function described above) of the integer symbol type and the symbol object. The only requirements of the object are the methods "append" (for use by the REDUCELEFT macro) and "insert" (for use by the REDUCERIGHT macro). These objects are passed around by the rules to build a parse tree, where each non-terminal symbol has a list of child symbols.
The grammar file read by Bison is fairly normal, but the code associated with the rules should be fairly limited:
| The start rule of the grammar should, when reduced, call the RETURNTREE macro with the value of the top of the tree. The argument of RETURNTREE is the symbol object that represents the top of the parse tree. For example: |