Ian D. Allen
Rideau B215A
idallen@freenet.carleton.ca
Focused on Your Career
Customized Training
Workplace Skills
Quality Instruction
B.A.
Honours Psychology
University of Waterloo
1980MMath
Computer Science
University of Waterloo
1985
Ian D. Allen
idallen@freenet.carleton.ca
Rideau B215A
Telephone (727-4723) x5949
Office hours
see my office door
Notes and Course Outlines
online:
http://idallen.com/teaching/
http://www.algonquincollege.com/cst/8152/
in library, if needed
Your Student Number
marks are grouped by number
Your Course Number
CST8152 section 010 - Ian Allen
Course Times
Lectures: three hours / week
010 Mon 10:00-10:50 RE201
Wed 10:00-10:50 RE201
Fri 08:00-08:50 RE201
Labs: one hour / week
011 Thu 16:00-16:50 RC207
012 Wed 11:00-11:50 RC207
013 Wed 12:00-12:50 RC207
course outlines are available online
please review academic policies
withdrawing
repeating courses
probation
academic discipline
Lectures
3 hours Class
1 hour Lab (mandatory)
Evaluation / Earning Credit
40% in-class tests & quizzes
40% final examination
20% assignments
Assignment Late Penalty
up to one week ... 20%
after one week ... 100%
A: 80-100
An outstanding level of achievement has been consistently demonstrated.B: 70-79
Achievement has exceeded the required level.C: 60-69
Course requirements are met.D: 50-59
Achievement is at a marginal level. Consistent, ongoing effort is required for continuing success in the program.
Midterm #1
in class
Monday, June 9, 1997
Midterm #2
in class
Monday, July 7, 1997
Statutory Holidays
three
Final Exam Period
in auditorium
August 16 - 22 (inclusive)
Do the assignments
understand the theory
apply the theory
practice
Attend Lectures and Labs
remember your diskettes
back up your diskettes!
be on time
ask questions
Learn the tools
memorization isnt enough
There are no dumb questions
ask
ask
ask
Stop me if I go too fast
assist me in knowing your learning styles
Do your homework
use class time to ask questions
Be here
Lab attendance is mandatory
Be on time
use class time well
Listen
to me
to questions (and answers!)
Focus
dont distract me or others
dont distract yourself
Diskettes
3 ½ inch
high density (HD)
disks for assignments
spare disks for back-up
You
sign in
please be on time
attendance is mandatory
missed labs result in no course credit: see the course outline
Account Access Review
bring diskettes to Lab
verify account / password
start Borland C
test save to disk
test printing
test email
C Programming Review
review Algonquin Standard
prepare Standard Header
Algonquin Standard Header
purpose
history
inputs
from any source
outputs
to any destination
algorithm
high level description
not necessarily full pseudocode
accurate
write it once, and only once
modifications require one change
less code is better code
only -1, 0, and 1 as constants
Boolean != NULL != 0 != \0
check and validate input (including excess)
dont modify function arguments
fopen/fclose at same level
check function return codes
read input in only one place
avoid global variables
print really good error messages
get the code right first, then optimize
Compiler:
Translates your language to the target language.
Interpreter:
Creates a virtual machine on the target system that understands your language directly.
Compiled code runs faster than interpreted code, since direct execution of the target language is faster than the indirect emulation of the virtual machine interpreter.
Text: Fig. 1.3, 1.10
Text: Sect. 1.1, 1.2, 1.3; Fig. 1.3, 1.9
preprocessor
=> pure source program
compiler
=> target assembly program
assembler
=> relocatable machine code
loader / link editor
=> absolute machine code
Front End: Analysis
Lexical
Syntactic
Semantic
Back End: Code Generation
Intermediate representations
Optimization
Actual target machine code generation
Lexical Analysis (Scanning)
read characters and group them into lexemes in the language; categorize the lexemes into tokens
Syntactical Analysis (Parsing)
read tokens and check the nested, hierarchical ordering of the tokens
recognize/build trees of tokens that belong to the language grammar
Semantic Analysis
given a grammatically correct parse tree, what does each part mean?
The smallest element/unit/word recognized in the compiled or interpreted language is called a lexeme
e.g. getc } { "abc" 1.2 * ++ ,
Each lexeme is recognized by the Scanner to be in some category; that category is returned as the token associated with the lexeme
e.g. ( IDENTIFIER, "getc" )
The Parser recognizes/builds a hierarchical parse tree of tokens; the parse tree expresses the grammar of the language
The scanner is a tokenizer: it reads your program and recognizes lexemes
It categorizes the lexemes by token type and returns a (token,lexeme) pair to the syntax analyser
e.g. 12.57 : the token is number,
the lexeme is the value 12.57e.g. My_Var : the token is identifier,
the lexeme is the string "My_Var"The scanner is self-contained: it doesnt know anything about the syntax or grammar of the language
The analysis is usually non-recursive
A Parser: reads (token,lexeme) pairs from the Scanner and recognizes the tokens in the grammar of the language
builds a parse tree, corresponding to the grammar, for the semantic analyser
determines the structure of the language
doesnt know about the characters that make up the language (Scanner)
doesnt know anything about the meaning of the grammar (Semantics)
usually recursive, e.g. to handle nesting
parentheses: S := ( expr )
blocks: BEGIN, END
Parsing: a decomposition of the scanned tokens in an input stream
(a sentence in the language) into its component partsParse Tree
"A hierarchical representation of a sentence, starting at the root and working down toward the leaves."
parsing is based on scanned tokens in the language, not on lexemes
if the parsing succeeds, the input stream is syntactically correct
The semantic phase of compilation walks the parse tree built by the Parser and recognizes semantic constructs
e.g. type checking
int array[10];
array[2.34] = 0;
e.g. uniqueness checking
int array[10];
int array[20];
e.g. flow-of-control checking
main(){
break;
}
Lexical
1abc
2.4efg
get$value
Syntactic
a = 1 + ;
printf("Hello\n" ;
int ;
Semantic
see previous slide
/* This program is syntactically
* correct. Are you surprised?
*/
main(){
int a,b,c;
1;
2;
3;
1+1;
a;
b;
c;
a+b;
1 || a;
b && c;
1 + a / 2 * b + c % 5;
}
Intermediate Code
a program for an abstract machine
easy to produce
easy to further translate into the exact target language
might be directly interpretable
Code Optimization
improve the intermediate code
Code Generation
relocatable machine, or assembler
Text chapter 3
Regular Expressions 3.3
Finite State Machines
DFA
Transition Diagrams 3.4
Transition Tables
NextState() function
programming a DFA
abc
concatenation ("followed by")
a | b | c
alternation ("or")
*
zero or more occurrences
+
one or more occurrences
?
zero or one occurrence
FSM:
a finite set of states
a set of transitions (edges)
a start state
a set of final states
Deterministic Finite Automata
all outgoing edges are labelled with an input character
no two edges leaving a given state have the same label
Therefore: the next state can be determined uniquely, given the current state and the current input character
We need the FSM structure:
a finite set of states
a set of transitions (edges)
a start state
a set of accepting (final) states
We need the DFA criteria:
given a particular state and an input character, the next state is unique
We code it
Both a next state and a current state variable are used. This gives the action code three pieces of information to work with: current state, next state, and the character just read.
The semantic actions go between the call to the NextState() function and the setting of the current state variable.
The character ch is read and end-of-file is tested after testing for an accepting state. We want the loop to end as soon as the DFA accepts.
Since the loop may end without being in an accepting state, code at the end of the loop must tidy up and decide what to return if that happens.
typedef enum { S_START, S_ACCEPT, S_ERROR, ... } State;
State current, next;
...
current = S_START;
while( current != S_ACCEPT && (ch=fgetc(fp)) != EOF ){
next = NextState(current,ch);
if( next == S_ERROR ){
/* handle the failure */
}
/* remaining semantic actions go here */
current = next;
}
switch( current ){
case S_START:
case S_ACCEPT:
case S_ERROR:
default:
}
next = NextState(current,ch);
Given the current state of the DFA, current, and the current input character, ch, return the next state.
Determines the next state by one of:
indexing a Transition Table
switch statements
some if statements
/* Use a transition table to get the next state:
*/
State
NextState(State current, int ch){
return Table[current][Column(ch)];
}
/* Turn a character into a column index:
*/
int
Column(int ch){
if( isascii(ch) && isdigit(ch) ) return 0;
if( isascii(ch) && isalpha(ch) ) return 1;
switch( ch ){
case +: return 2;
case -: return 3;
case EOF: return 4; /* needs special handling */
default: return 5;
}
}
/*
* The defined token types for this sample grammar.
* These are recognized and returned by the scanner.
*/
typedef enum {
T_SEMI, T_EQUAL, T_EOF, T_ERR,
T_IDENT, T_STRING, T_INTEGER,
T_PLUS, T_MINUS, T_MUL, T_DIV,
} TypeList;
/*
* The structure holding the (lexeme,value) pair.
* A union is used to handle different lexeme types.
*/
typedef struct {
TypeList type; /* enum of types */
union {
char *string; /* for strings */
long integer; /* for integers */
/* other types go here as needed */
} value;
} TokenType;
/*
* The global current token, set by scanner().
* scanner() may return a pointer to this global
* for coding convenience. This storage is declared
* in either main() or in the scanner (never in a
* header file!).
*/
TokenType lex_token; /* GLOBAL */
/* This is one complete file of C source.
* It illustrates the use of the "static" keyword
* to prevent local module data structures and
* functions from being visible outside this file.
*/
static int x; /* data global to this file only */
/* A global function to initialize the hidden
* data structure from outside the module:
*/
void
initialize(int j){
x = j;
}
/* A global function to return the value of
* the hidden data structure in the module:
*/
int
fetch(){
return local_func(x);
}
/* A local function, invisible outside this file,
* to perform support operations in the module:
*/
static int
local_func(int j){
return j * 2;
}
/*
* The scanner converts the lexeme to the correct
* type, inserts the value and the type into the
* global structure, then returns a pointer to the
* global structure (for coding convenience).
*/
TokenType lex_token; /* global token structure */
extern TokenType *scanner();
TokenType *scanner(){
extern TokenType lex_token; /* global token */
/* . . . */
lex_token.type = T_STRING; /* return char* */
lex_token.value.string = mystring;
return &lex_token;
/* . . . */
lex_token.type = T_IDENT; /* IDs are char* */
lex_token.value.string = mystring;
return &lex_token;
/* . . . */
/* convert ASCII digits to a real integer */
status = convert_to_integer(lexbuf,&myint);
if( status != GOOD_INT ) do_error(. . .);
lex_token.type = T_INTEGER; /* return integer */
lex_token.value.integer = myint;
return &lex_token;
A simple, recursive Toy Grammar for an assignment statement, consisting of four productions:
<assignment> -->
ID = <expression> ;
<expression> -->
<term> ( (+|-) <term> )*
<term> -->
<factor> ( (*|/) <factor> )*
<factor> -->
ID | CONST | ( <expression> )
The grammar has ten terminals (recognized token types):
ID, CONST, (, ), +, -, *, /, ;, =Four non-terminals (each of which will be coded as a separate function in a top-down, recursive-descent, predictive parser):
<assignment>, <expression>, <term>, <factor>Plus and minus tokens separate the terms of the expression.
Multiply and divide tokens separate the factors of the terms.
The current input token uniquely determines which of the four productions must be applied, with no ambiguity. No backtracking will be needed to parse sentences in this language.
Suitable for top-down, recursive-descent, predictive parsing (text p.41): Start with the root non-terminal (<assignment>) and work down toward the terminals, the leaves of the grammar parse tree. The expected terminals in the grammar are matched by the parser against the actual token types returned by the scanner.
The input is correct if the parsing succeeds without any errors.
A predictive parser can be built directly from the productions in the grammar. Non-terminal symbols become function calls; terminal symbols are matched directly against tokens according to the token type returned by the Scanner. To simplify coding and avoid having to pass around a pointer to the current token type, the returned token type is kept in a global variable.
The scanner is always called after a match of a terminal token, to keep the global tokentype always pointing to the next token. This is called the look ahead token. The very first look ahead token has to be read by an initial call to scanner() before any of the grammars parsing functions are called. This is done first thing in the Parser, or in main() before calling the Parser.
A function in the Parser may discover that a look-ahead token fails to match an expected symbol in the Grammar. If the match is mandatory, this failure is a syntax error; the Parser must issue an error message and either exit or try to recover. However, if the match is optional, the parsing function simply leaves the look-ahead token alone; some other part of the Parser will match it later (or will issue a syntax error message).
Note: We start with just parsing the input. No actions will be taken on anything we match; we simply match it and move on. (This is like having a DFA inside a scanner that doesnt have any actions in it to save characters.) Thus, no output will be generated by the parsing process unless the parsing encounters an error.
A small example translating a simple two-production grammar into two recursive-descent parsing functions. Terminal symbols are matched directly; the two non-terminals turn into function calls. This grammar is similar to the first and last productions of the Toy Grammar:
<stmt> --> ID = <item> ;
<item> --> ID | CONST
stmt(){
if( tokentype != ID )
error("Missing identifier");
scanner(); /* get the next lookahead token */
if( tokentype != EQUALS )
error("Missing =");
scanner(); /* get the next lookahead token */
item(); /* non-terminal is function call */
if( tokentype != SEMICOLON )
error("Missing ;");
scanner(); /* get the next lookahead token */
}
item(){
switch( tokentype ){
case ID:
case CONST:
scanner(); /* get next lookahead token */
break;
default:
error("Missing identifier or constant");
}
}
This is the first of the four functions that parse the Toy Grammar. This first function handles the root non-terminal that is the root of the Parse Tree for this Grammar. All other productions reduce to this one.
<assignment> --> ID = <expression> ;
assignment(){
if( tokentype != ID )
error("Missing identifier");
scanner(); /* get the next lookahead token */
if( tokentype != EQUALS )
error("Missing =");
scanner(); /* get the next lookahead token */
expression(); /* non-terminal is function call */
if( tokentype != SEMICOLON )
error("Missing ;");
scanner(); /* get the next lookahead token */
}
Here are two more of the four functions needed to parse the Toy Grammar. Note the use of a while() loop to match possibly repeated (optional) elements in a grammar production, based on whether the look ahead token indicates that a repeated element is present.
The Scanner is called to re-load the look-ahead token only if a match occurs. If the token isnt matched, we assume some other part of the Grammar must match it, so the Scanner is not called on a mis-match.
<expression> --> <term> ( (+|-) <term> )*
expression(){
term(); /* non-terminal is function call */
while( tokentype == PLUS || tokentype == MINUS ){
scanner(); /* get next lookahead token */
term(); /* non-terminal is function call */
}
}
<term> --> <factor> ( (*|/) <factor> )*
term(){
factor(); /* non-terminal is function call */
while( tokentype == MULT || tokentype == DIV ){
scanner(); /* get next lookahead token */
factor();/* non-terminal is function call */
}
}
Here is the last of the four functions of the Toy Grammar. This grammar rule has several alternatives. A switch() statement selects the right alternative to parse based on what the current look ahead token is.
Remember, no actions are yet added to these functions. All they do is recognize the input and move on; they dont do anything with it yet.
<factor> --> ID | CONST | ( <expression> )
factor(){
switch( tokentype ){
case ID:
scanner(); /* get next lookahead token */
break;
case CONST:
scanner(); /* get next lookahead token */
break;
case LEFTPAREN:
scanner(); /* get next lookahead token */
expression();
if( tokentype != RIGHTPAREN )
error("Missing )");
scanner(); /* get next lookahead token */
break;
default:
error("Missing ID, CONST, or (");
}
}
See Aho section 4.1
A simple panic-mode error handling system requires that we return to the top of the parser when any error is detected. At the top of the parser, we skip forward until we find a semi-colon (;), then resume parsing.
To return to the top of the parser when an error is detected in one of the parsing functions, we need to add error-handling code to all the functions.
Each parsing function may succeed (in which case we continue parsing) or fail (in which case we stop parsing and return). For example:
Boolean expression(){
if( ! term() )
return FALSE; /* failure */
while( tokentype == PLUS || tokentype == MINUS ){
scanner();
if( ! term() )
return FALSE; /* failure */
}
return TRUE; /* parsing succeeded so far */
}
Boolean term(){
if( ! factor() )
return FALSE; /* failure */
while( tokentype == MULT || tokentype == DIV ){
scanner();
if( ! factor() )
return FALSE; /* failure */
}
return TRUE; /* parsing succeeded so far */
}
To complete the panic-mode error recovery, the topmost parsing function must detect the failure to parse and skip forward until a semi-colon is found. Once the semi-colon is found, one more token of look ahead is read; this prepares the parser to resume parsing after the statement with the syntax error. The code for this looks something like this:
parser(){
scanner(); /* load first look-ahead token */
while( tokentype != T_EOF ){
switch( tokentype ){
case ID:
if( ! assignment() )
panic(); /* recover */
break;
/* other statement types will go here */
default:
errprint("Unknown statement type");
panic(); /* recover */
}
}
}
/* Skip forward to EOF or to after next semi-colon */
panic(){
while( tokentype != T_SEMI
&& tokentype != T_EOF)
scanner();
if( tokentype == T_SEMI )
scanner(); /* skip the ; */
}
Now that we have a parser that can recognize assignment statements, we need to add semantic actions to the parser so that useful work is done. These actions are specified by embedding semantic action symbols in the grammar productions, resulting in a set of rules called a translation scheme (Aho p.37-40: Translation Schemes, Emitting a Translation). First, look at the simple grammar without any action symbols:
<stmt> --> ID = <item> ;
<item> --> ID | CONST
Now, turn it into a translation scheme, with the semantic action symbols embedded in the grammar productions in the correct places:
<stmt> --> ID {enter} = <item> {pop;copy} ;
<item> --> ID {lookup;push} | CONST {eval;push}
Here are what each of the action symbols mean:
{enter}: enter the ID into the symbol table if it isnt already there
{pop}: pop the top value off the value stack
{copy}: copy the value to the symbol table at the location of ID
{lookup}: find this ID in the symbol table; get its current value
{push}: push the current value on the value stack
{eval}: evaluate the numeric constant
The action symbols are placed in the exact places we want the actions to occur. We map these locations to the corresponding locations in the code for the recursive-descent parsing functions, and we insert code that performs these actions. The value stack is used by the parser to hold values of expressions (including constants and identifiers).
Here is one of the augmented grammar productions that make up the translation scheme, with the semantic action symbols inserted as before:
<stmt> --> ID {enter} = <item> {pop;copy} ;
The code for the recursive-descent parsing function that implements the semantic actions for this production looks something like this:
Boolean stmt(){
if( tokentype != ID ){
errprint("Missing identifier");
return FALSE;
}
\\ {enter} action goes here: Call a function that
\\ returns a pointer to the identifier in the
\\ symbol table. If not found there, enter it.
. . . usual code to match EQUALS goes here . . .
if( ! item() )
return FALSE;
\\ {pop;copy} action goes here: Pop the top value
\\ off the value stack and copy it into the
\\ symbol table at the location of the identifier
if( tokentype != SEMICOLON ){
errprint("Missing ;");
return FALSE;
}
scanner();
return TRUE; /* parsing succeeded */
}
An interpreter doesnt generate code. It performs semantic actions as it parses the input. For expressions, such as those used in the Toy Grammar, it uses a value stack to hold values and their types.
The value stack stores intermediate results of parsing for later use in a grammar rule. Heres an example using a simple translation scheme with semantic actions inserted in their proper places:
<stmt> --> ID {enter} = <plus> {pop;copy} ;
<plus> --> <item> + <item> {pop;pop;add;push}
<item> --> ID {lookup;push} | CONST {eval;push}
Look at how the semantic actions use the value stack:
{lookup;push}: When the parsing function implementing item() recognizes an ID token, it looks up the value and type of the identifier in the symbol table and pushes the current value and type of that identifier onto the value stack. (An undefined identifier would be a run-time error.)
{eval;push}: When the parsing function implementing item() recognizes a constant, it evaluates the constant and pushes the value and type of the constant onto the value stack.
{pop;pop;add;push}: The parsing function implementing plus() parses two items. Each call to item() pushes a value on the stack. After both items have been pushed, plus() pops each one off, adds them together, and pushes the result back on the stack. In more complex grammars, a separate execute(operator) function does the arithmetic.
{pop;copy}: When the plus() function returns control to the parsing function implementing stmt(), stmt() pops the stack and copies the top value and type into the symbol table at the location for the identifier located at the start of stmt(). (Type checking would be done here, too.)
{enter}: This semantic action doesnt use the value stack. It looks up the identifier in the symbol table and returns a pointer to its location there. If the identifier doesnt exist, it adds it and returns a pointer to it.
What is a compiler? an interpreter?
How do they differ?
Where does compilation fit in the steps that make up Language Processing?
What is the output of each step of the process?
What is the "front end" of a compiler? the "back end"?
Name and describe the function of each of the three parts of the compiler front end.
What is the output of each of the three parts?
Give C Language examples of errors that would be detected in each part.
Name and describe the basic functions of the compiler back end.
Define these terms: lexeme, token, pattern, deterministic automaton, nondeterministic automaton,
Why is syntax analysis usually recursive, where lexical analysis is not?
What are the meanings of the Regular Expression operators *, +, and ?
Given a description of a set of strings, write a Regular Expression that matches those strings.
e.g. strings of letters ending in a or bcd
e.g. strings of digits containing a 3 before a 5
e.g. unsigned integer constant, unsigned floating constant
Given a Regular Expression, give examples of strings matched by the expression.
What is a Finite State Machine (FSM)?
What restrictions are placed on a FSM to make it a Deterministic Finite Automaton (DFA)?
Given a Regular Expression, write the DFA transition diagram that recognizes it (or vice-versa).
Given a DFA, write its corresponding Transition Table (or vice-versa).
Write a C Language NextState() function that implements the transitions of a given DFA or Transition Table.
Write a while() loop that uses the NextState() function to implement a DFA.
Write a Column() function that classifies characters for indexing a Transition Table.