Frequently Asked Questions

This page last updated: Sunday September 27, 1998 01:07

Here are some clarifications on the assignments, based on questions I've received so far:

What language are we Parsing?

The language we are parsing is the language defined by the grammar. It is not the C programming language, though it may look similar to C in many respects.  The grammar may differ from assignment to assignment. The grammar for Assignment 4 is Program Grammar #1.

What exactly is an identifier? Can it start with an underscore?

Identifiers are defined according to the rules for C language. Can a C language identifier start with an underscore? Look it up.

Does an identifier have to be separated from a number by a blank?

In C language, yes. In our language, no.

The string "123abc" contains the identifier abc preceded by the number 123. The string "abc123" is the identifier abc123.

What do I do with blanks and whitespace?

Leading whitespace - whitespace occurring when your Scanner is in its Start state - should be discarded by the Scanner.

What you do with whitespace when your Scanner is in another state than the Start state depends on what kind of token you're collecting.  In many cases, the whitespace may end the token.  In other cases, the whitespace may become part of the lexeme for the token.

What do I do about "junk"?

If your Scanner is called and can't make a recognized token out of the next lexeme in the input, you have some "junk" that needs to be handled.

One way of dealing with "junk" is to collect as many junk characters as you can into a token of type "Junk" and return that junk token to the Parser, letting the Parser discard it, print a message, and try to recover.

Another way to handle the unrecognized input is to have the Scanner start skipping characters until it can make a token out of something, printing an error message telling the user what was skipped.

How do I recognize many different tokens in the Scanner?

Start by designing a transition diagram that recognizes all the tokens, then convert the transition diagram to a Next State table and copy it into your Scanner.

See how the text (Section 3.4) builds up a more complex transition diagram from smaller ones in the two Figures 3.11 (p.100) and 3.12.

Each of the component transition diagrams ends in its own accepting state. Each accepting state will indicate the recognition of some token (or possibly more than one token) in the language.

Unlike the text, which uses several different start states in Figures 3.11 to 3.14, our Scanner needs only one start state.

We do not need to have different start states because we are only recognizing a few simple tokens, and each token is sufficiently distinct that our Scanner never needs to backtrack and try a different transition diagram with a different start state.

(The book shows a more general scanner that might try several different transition diagrams, each with its own different start state. We don't need that extra complexity.)

All of our component transition diagrams will have the same start state.

Start States vs. Start Symbol

The start state of your lexical analyser has nothing to do with the start symbol of a grammar for the Parser. Lexical analysis knows nothing about the grammar of a language.

Don't get confused trying to draw a transition diagram for your grammar. Transition diagrams are for recognizing lexemes and classifying them into tokens. See Section 3.4 in the text.

The structure of the grammar is handled by the Parser. The Parser recognizes the grammar based on the recursive-descent parsing functions that you write and put in your parser.c source file.

Where do I start?

Follow the step-by-step directions in the assignment. Get your Scanner working first, using the Assignment 3 driver, then write the Parser to drive your new Scanner.

How do I know it works?

Run the big a4test.txt file through your Assignment 4 program. The output should look like the supplied output.txt file.

Do I have to do any more testing if it works with a4test.txt?

Yes.

Test that all of your syntax error handling works by trying input that has syntax errors in it. Try to invent test input that causes every one of your syntax error messages to print. Try an empty file, a file with a single token, a file with two tokens, a file containing just a semicolon, etc.

Document what you tried, and hand that in too. Convince me that you have exercised every error message in your program by trying different input files.

p.s. Look at your test output and verify that your program is working; don't just hand in the output. I received several assignments that showed incorrect test result output.