Assigned: Monday, June 2 in class
Content last revised: 12 noon Monday, June 2
Due date changed on June 9.
In this assignment you write a DFA-based Lexical Analyser that recognizes some of the basic lexemes used in an Interpreter. You should be able to use much of your code from Assignment Two in this program.
Your Lexical Analyser will be written as a separate compilation unit, to hide its internal details. You will use a small set of public functions to set up the Analyser and interact with it. Only one global variable is allowed (see the instructions). See the course notes for help in designing a separate compilation unit using static functions and data.
Design, write, and thoroughly test a C Language module that functions as a Lexical Analyser (Scanner). Write a driver program that calls your Scanner repeatedly, returning and printing each token found by the Scanner in the input stream. Your Scanner has some public and some private functions to enable it to work. (Private functions and data are defined using the static keyword.) Some of these functions are described here:
This function sets up the Scanner. It receives and records in private storage the FILE* pointer of the file opened by the calling program. It does any other initialization needed to begin the scanning of the input stream, such as setting up dynamic memory buffers.
This function reads the open file pointer and identifies a single token and the line number on which it was found. The token is a lexeme and its classification. The lexeme can be several different things, so it is stored as a C Language union. It is either empty, a character string in dynamic memory (char*), or an int or double numeric constant. The classification is an enum type denoting the type of the lexeme being returned. The enum has the following values, indicating what was found by the Scanner:
T_INT | An unsigned integer constant, e.g. 0, 123, 32767 |
T_ID | An identifier, as specified for previous assignments. |
T_PLUS | The plus sign: + |
T_MINUS | The minus sign: - |
T_MUL | The multiplication asterisk: * |
T_DIV | The division slash: / |
T_EQU | The assignment equals sign: = |
T_SEMI | The semicolon: ; |
T_EOF | A special token that indicates that the Scanner has reached end-of-file. |
T_ERR | Anything else. The union contains more details on the specific kind of error. |
The Scanner places the lexeme, its classification type, and its line number in a global structure named lex_token. (The Scanner may also return a pointer to this structure, to facilitate coding, if desired.)
The Scanner leaves the input stream positioned ready to read the next lexme. A calling program will make repeated calls to lex_gettoken() to read successive tokens from an open input file.
Call this function to close down the Scanner and release any memory resources it may have allocated. (Note that the open input file is not closed here, since it was not opened by the Scanner. Input files are the responsibility of the calling program [e.g. main()].)
The Scanner manages its own memory, including any memory allocated to or in the global lex_token structure. Functions outside the Scanner module must not modify the contents of the global token structure, and they must take care not to obtain pointers to storage that belongs to the Scanner module. (The Scanner module may eventually free or modify any storage it controls.)
This means that functions outside the Scanner module must copy any strings that they need to keep.
The Scanner module itself prints nothing except those error messages that cannot reasonably be returned as a T_ERR token to the calling function. Write a driver routine that opens an input file, initializes the Scanner, and repeatedly calls the Scanner to read all the tokens in the input file. Print the contents of the global token structure for each token found. When end-of-file is reached, tidy up and exit.
Example: The output should look something like this, only nicer:
Name: Ian Allen Input file: Assignment Three Data-1.txt Line Type Lexeme ------------------- 36 T_ID area 36 T_EQU = 36 T_ID PI 36 T_MUL * 36 T_ID radius 36 T_MUL * 36 T_ID radius 36 T_SEMI ; 37 T_ID print 37 T_ID area 37 T_SEMI ; 38 T_ERR @$%&%$@ 39 T_EOF
The deliverables for your assignment are due in the Ian Allen assignment box before the due date.
The aggregate of all assignment marks comprises 20% of
your final mark.
All assignments must be completed satisfactorily to get
credit for the course.
Please review the C Programming Style guidelines. Assignment source code is marked for readability. You earn marks by making the source clear, simple, and concise. You lose marks if I have to puzzle over what a piece of code does.
Late assignments are handled according to the policy given in the course outline.
Inside every big program is a little program struggling to get out.
See the course notes for programming help for the Scanner.
See also the code for a working sample DFA program.