CST 8152 - Assignment 3

Due 12 noon Monday June 16, 1997

Assigned: Monday, June 2 in class

Content last revised: 12 noon Monday, June 2

Due date changed on June 9.

Purpose

In this assignment you write a DFA-based Lexical Analyser that recognizes some of the basic lexemes used in an Interpreter. You should be able to use much of your code from Assignment Two in this program.

Your Lexical Analyser will be written as a separate compilation unit, to hide its internal details. You will use a small set of public functions to set up the Analyser and interact with it. Only one global variable is allowed (see the instructions). See the course notes for help in designing a separate compilation unit using static functions and data.

Instructions

Design, write, and thoroughly test a C Language module that functions as a Lexical Analyser (Scanner). Write a driver program that calls your Scanner repeatedly, returning and printing each token found by the Scanner in the input stream. Your Scanner has some public and some private functions to enable it to work. (Private functions and data are defined using the static keyword.) Some of these functions are described here:

Public function: `lex_init()`

This function sets up the Scanner. It receives and records in private storage the FILE* pointer of the file opened by the calling program. It does any other initialization needed to begin the scanning of the input stream, such as setting up dynamic memory buffers.

Public function: `lex_gettoken()`

This function reads the open file pointer and identifies a single token and the line number on which it was found. The token is a lexeme and its classification. The lexeme can be several different things, so it is stored as a C Language union. It is either empty, a character string in dynamic memory (char*), or anint or double numericconstant. The classification is an enum type denoting the type of the lexeme being returned. The enumhas the following values, indicating what was found by the Scanner:

`T_INT`	An unsigned integer constant, e.g. `0, 123, 32767`
`T_ID`	An identifier, as specified for previous assignments.
`T_PLUS`	The plus sign: `+`
`T_MINUS`	The minus sign: `-`
`T_MUL`	The multiplication asterisk: *``**
`T_DIV`	The division slash: `/`
`T_EQU`	The assignment equals sign: `=`
`T_SEMI`	The semicolon: `;`
`T_EOF`	A special token that indicates that the Scanner has reached end-of-file.
`T_ERR`	Anything else. The `union` contains more details on the specific kind of error.

The Scanner places the lexeme, its classification type, and its line number in a global structure named lex_token. (The Scanner may also return a pointer to this structure, to facilitate coding, if desired.)

The Scanner leaves the input stream positioned ready to read the next lexme. A calling program will make repeated calls tolex_gettoken() to read successive tokens from an open input file.

Public function: `lex_finished()`

Call this function to close down the Scanner and release any memory resources it may have allocated. (Note that the open input file is not closed here, since it was not opened by the Scanner. Input files are the responsibility of the calling program [e.g. main()].)

Memory Management

The Scanner manages its own memory, including any memory allocated to or in the global lex_tokenstructure. Functions outside the Scanner module must not modify the contents of the global token structure, and they must take care not to obtain pointers to storage that belongs to the Scanner module. (The Scanner module may eventually free or modify any storage it controls.)

This means that functions outside the Scanner module must copy any strings that they need to keep.

Output

The Scanner module itself prints nothing except those error messages that cannot reasonably be returned as a T_ERRtoken to the calling function. Write a driver routine that opens an input file, initializes the Scanner, and repeatedly calls the Scanner to read all the tokens in the input file. Print the contents of the global token structure for each token found. When end-of-file is reached, tidy up and exit.

Include your name in the output.
Write out the file name of the input file entered by the user.
Write out the contents of the global token structure for each token found in the input file.
All the output should go to the same output file.

Example: The output should look something like this, only nicer:

Name: Ian Allen
Input file: Assignment Three Data-1.txt

Line  Type   Lexeme
-------------------
  36  T_ID   area
  36  T_EQU  =
  36  T_ID   PI
  36  T_MUL  *
  36  T_ID   radius
  36  T_MUL  *
  36  T_ID   radius
  36  T_SEMI ;
  37  T_ID   print
  37  T_ID   area
  37  T_SEMI ;
  38  T_ERR  @$%&%$@
  39  T_EOF

Important Notes

Use the robust string prompting routine you wrote in Assignment One to obtain file names.
Use the dynamic memory code from Assignment Two to handle long identifiers.
Use the example DFA skeleton code given. You may only read the input file in one place. Do not spread multiple calls to fgetc() or fgets() throughout your Scanner. See the course notes for the skeleton of a DFA lexical analyser.
Use realloc() to handle strings of any length.
Make sure you detect identifiers that are followed immediately by end-of-file.
Only one global variable. All other data passing must be via function arguments and return values. Consider returning a structure that contains all the needed function return values, rather than using multiple function output arguments passed by reference.
Follow the C Programming Style guidelines carefully.

The deliverables for your assignment are due in the Ian Allen assignment box before the due date.

Deliverables for this assignment

The fully documented source listing of just your lexical analyser (including your own .h files).
The submission must follow the online course submission guidelines.
A description or listing of your input test file(s) (and output, if appropriate) showing how you tested your program. (Do not print out large test files. Describe them instead.) Remember to test empty files.
I will show you how to deposit your assignment source code into a directory on the Algonquin server when the server is ready. All assignment source code will be collected for analysis later in the term.

Evaluation

The aggregate of all assignment marks comprises 20% of your final mark.
All assignments must be completed satisfactorily to get credit for the course.

Please review the C Programming Style guidelines. Assignment source code is marked for readability. You earn marks by making the source clear, simple, and concise. You lose marks if I have to puzzle over what a piece of code does.

Late assignments are handled according to the policy given in the course outline.

“Inside every big program is a little program struggling to get out.”