CST 8152 - PowerPoint Slides

Algonquin College Computer Studies CST 8152 Compilers

Ian D. Allen

Rideau B215A

idallen@freenet.carleton.ca

Welcome to Algonquin College

Focused on Your Career

Customized Training

Workplace Skills

Quality Instruction

Instructor Ian D. Allen

B.A.
    Honours Psychology
    University of Waterloo
    1980

MMath
    Computer Science
    University of Waterloo
    1985

Contact Information

Ian D. Allen

idallen@freenet.carleton.ca

Rideau B215A

Telephone (727-4723) x5949

Office hours

see my office door

Notes and Course Outlines

online:

http://idallen.com/teaching/

http://www.algonquincollege.com/cst/8152/

in library, if needed

Things You Should Know

Your Student Number

marks are grouped by number

Your Course Number

CST8152 section 010 - Ian Allen

Course Times

Lectures: three hours / week

010 Mon 10:00-10:50 RE201

Wed 10:00-10:50 RE201

Fri 08:00-08:50 RE201

Labs: one hour / week

011 Thu 16:00-16:50 RC207

012 Wed 11:00-11:50 RC207

013 Wed 12:00-12:50 RC207

CST 8152 Course Outline

course outlines are available online

please review academic policies

withdrawing

repeating courses

probation

academic discipline

Course Learning Requirements

Lectures

3 hours Class

1 hour Lab (mandatory)

Evaluation / Earning Credit

40% in-class tests & quizzes

40% final examination

20% assignments

Assignment Late Penalty

up to one week ... –20%

after one week ... –100%

Marking Scheme

A:    80-100
    An outstanding level of achievement has been consistently demonstrated.

B:    70-79
    Achievement has exceeded the required level.

C:    60-69
    Course requirements are met.

D:    50-59
    Achievement is at a marginal level. Consistent, ongoing effort is required for continuing success in the program.

CST 8152 Major Academic Events

Midterm #1

in class

Monday, June 9, 1997

Midterm #2

in class

Monday, July 7, 1997

Statutory Holidays

three

Final Exam Period

in auditorium

August 16 - 22 (inclusive)

How to Succeed in this Course

Do the assignments

understand the theory

apply the theory

practice

Attend Lectures and Labs

remember your diskettes

back up your diskettes!

be on time

ask questions

Learn the tools

memorization isn’t enough

Lab and Lecture Dialogue Protocol

There are no dumb questions

ask

ask

ask

Stop me if I go too fast

assist me in knowing your learning styles

Do your homework

use class time to ask questions

Ground Rules

Be here

Lab attendance is mandatory

Be on time

use class time well

Listen

to me

to questions (and answers!)

Focus

don’t distract me or others

don’t distract yourself

Required for Laboratories

Diskettes

3 ½ inch

high density (HD)

disks for assignments

spare disks for back-up

You

sign in

please be on time

attendance is mandatory

missed labs result in no course credit: see the course outline

CST 8152 Assignment 0

Account Access Review

bring diskettes to Lab

verify account / password

start Borland C

test save to disk

test printing

test email

C Programming Review

review Algonquin Standard

prepare Standard Header

C Programming Style - CST 8152

Algonquin Standard Header

purpose

history

inputs

from any source

outputs

to any destination

algorithm

high level description

not necessarily full pseudocode

accurate

C Programming Style

write it once, and only once

modifications require one change

less code is better code

only -1, 0, and 1 as constants

Boolean != NULL != 0 != ‘\0’

check and validate input (including excess)

don’t modify function arguments

fopen/fclose at same level

check function return codes

read input in only one place

avoid global variables

print really good error messages

get the code right first, then optimize

Compilation in Context

Compiler:

Translates your language to the target language.

Interpreter:

Creates a virtual machine on the target system that understands your language directly.

Compiled code runs faster than interpreted code, since direct execution of the target language is faster than the indirect emulation of the virtual machine interpreter.

Text: Fig. 1.3, 1.10

Language Processing

Text: Sect. 1.1, 1.2, 1.3; Fig. 1.3, 1.9

preprocessor

=> pure source program

compiler

=> target assembly program

assembler

=> relocatable machine code

loader / link editor

=> absolute machine code

Compilation: Front End / Back End

Front End: Analysis

Lexical

Syntactic

Semantic

Back End: Code Generation

Intermediate representations

Optimization

Actual target machine code generation

Three Phases of Analysis

Lexical Analysis (Scanning)

read characters and group them into lexemes in the language; categorize the lexemes into tokens

Syntactical Analysis (Parsing)

read tokens and check the nested, hierarchical ordering of the tokens

recognize/build trees of tokens that belong to the language grammar

Semantic Analysis

given a grammatically correct parse tree, what does each part mean?

Lexemes, Tokens, Trees, and Grammar

The smallest element/unit/word recognized in the compiled or interpreted language is called a lexeme

e.g. getc } { "abc" 1.2 * ++ ,

Each lexeme is recognized by the Scanner to be in some category; that category is returned as the token associated with the lexeme

e.g. ( IDENTIFIER, "getc" )

The Parser recognizes/builds a hierarchical parse tree of tokens; the parse tree expresses the grammar of the language

Lexical Analysis: The Scanner

The scanner is a tokenizer: it reads your program and recognizes lexemes

It categorizes the lexemes by token type and returns a (token,lexeme) pair to the syntax analyser

e.g. 12.57 : the token is number,
the lexeme is the value 12.57

e.g. My_Var : the token is identifier,
the lexeme is the string "My_Var"

The scanner is self-contained: it doesn’t know anything about the syntax or grammar of the language

The analysis is usually non-recursive

Syntactic Analysis: The Parser

A Parser: reads (token,lexeme) pairs from the Scanner and recognizes the tokens in the grammar of the language

builds a parse tree, corresponding to the grammar, for the semantic analyser

determines the structure of the language

doesn’t know about the characters that make up the language (Scanner)

doesn’t know anything about the meaning of the grammar (Semantics)

usually recursive, e.g. to handle nesting

parentheses: S := ‘(’ expr ‘)’

blocks: BEGIN, END

Syntactic Analysis: The Parse Tree

Parsing: a decomposition of the scanned tokens in an input stream
(a sentence in the language) into its component parts

Parse Tree

"A hierarchical representation of a sentence, starting at the root and working down toward the leaves."

parsing is based on scanned tokens in the language, not on lexemes

if the parsing succeeds, the input stream is syntactically correct

Semantic Analysis: Finding Meaning

The semantic phase of compilation walks the parse tree built by the Parser and recognizes semantic constructs

e.g. type checking

int array[10];

array[2.34] = 0;

e.g. uniqueness checking

int array[10];

int array[20];

e.g. flow-of-control checking

main(){

break;

}

Examples of C Language Errors

Lexical

1abc

2.4efg

get$value

Syntactic

a = 1 + ;

printf("Hello\n" ;

int ;

Semantic

see previous slide

C Language Degenerate Expressions

/* This program is syntactically

* correct. Are you surprised?

*/

main(){

        int a,b,c;

        1;

        2;

        3;

        1+1;

        a;

        b;

        c;

        a+b;

        1 || a;

        b && c;

        1 + a / 2 * b + c % 5;

}

Back End: Code Generation

Intermediate Code

a program for an abstract machine

easy to produce

easy to further translate into the exact target language

might be directly interpretable

Code Optimization

improve the intermediate code

Code Generation

relocatable machine, or assembler

Lexical Analysis in detail

Text chapter 3

Regular Expressions 3.3

Finite State Machines

DFA

Transition Diagrams 3.4

Transition Tables

NextState() function

programming a DFA

Regular Expressions

abc

concatenation ("followed by")

a | b | c

alternation ("or")

*

zero or more occurrences

+

one or more occurrences

?

zero or one occurrence

Finite State Machines and DFA

FSM:

a finite set of states

a set of transitions (edges)

a start state

a set of final states

Deterministic Finite Automata

all outgoing edges are labelled with an input character

no two edges leaving a given state have the same label

Therefore: the next state can be determined uniquely, given the current state and the current input character

Programming a DFA

We need the FSM structure:

a finite set of states

a set of transitions (edges)

a start state

a set of accepting (final) states

We need the DFA criteria:

given a particular state and an input character, the next state is unique

We code it

Code Fragment for a DFA Scanner

Both a next state and a current state variable are used. This gives the action code three pieces of information to work with: current state, next state, and the character just read.

The semantic actions go between the call to the NextState() function and the setting of the current state variable.

The character ch is read and end-of-file is tested after testing for an accepting state. We want the loop to end as soon as the DFA accepts.

Since the loop may end without being in an accepting state, code at the end of the loop must tidy up and decide what to return if that happens.

typedef enum { S_START, S_ACCEPT, S_ERROR, ... } State;

State current, next;

...

current = S_START;

while( current != S_ACCEPT && (ch=fgetc(fp)) != EOF ){

   next = NextState(current,ch);

   if( next == S_ERROR ){

      … /* handle the failure */ …

   }

   … /* remaining semantic actions go here */ …

   current = next;

}

switch( current ){

case S_START: …

case S_ACCEPT: …

case S_ERROR: …

   …

default:       …

}

The NextState() Function

   next = NextState(current,ch);

Given the current state of the DFA, current, and the current input character, ch, return the next state.

Determines the next state by one of:

indexing a Transition Table

switch statements

some if statements

/* Use a transition table to get the next state:

*/

   State

NextState(State current, int ch){

   return Table[current][Column(ch)];

}

/* Turn a character into a column index:

*/

   int

Column(int ch){

   if( isascii(ch) && isdigit(ch) ) return 0;

   if( isascii(ch) && isalpha(ch) ) return 1;

   switch( ch ){

   case ‘+’: return 2;

   case ‘-’: return 3;

   case EOF: return 4; /* needs special handling */

   default: return 5;

   }

}

Typedefs for a Scanner

/*

* The defined token types for this sample grammar.

* These are recognized and returned by the scanner.

*/

typedef enum {

        T_SEMI, T_EQUAL, T_EOF, T_ERR,

        T_IDENT, T_STRING, T_INTEGER,

        T_PLUS, T_MINUS, T_MUL, T_DIV,

} TypeList;

/*

* The structure holding the (lexeme,value) pair.

* A union is used to handle different lexeme types.

*/

typedef struct {

    TypeList type;    /* enum of types */

    union {

       char *string;    /* for strings */

       long integer;    /* for integers */

       /* other types go here as needed … */

    } value;

} TokenType;

/*

* The global current token, set by scanner().

* scanner() may return a pointer to this global

* for coding convenience. This storage is declared

* in either main() or in the scanner (never in a

* header file!).

*/

TokenType lex_token;        /* GLOBAL */

C Language Modularity: Using "static" data and functions

/* This is one complete file of C source.

* It illustrates the use of the "static" keyword

* to prevent local module data structures and

* functions from being visible outside this file.

*/

static int x;    /* data global to this file only */

/* A global function to initialize the hidden

* data structure from outside the module:

*/

    void

initialize(int j){

    x = j;

}

/* A global function to return the value of

* the hidden data structure in the module:

*/

    int

fetch(){

    return local_func(x);

}

/* A local function, invisible outside this file,

* to perform support operations in the module:

*/

    static int

local_func(int j){

    return j * 2;

}

Returning a pointer to the global token structure in your scanner()

/*

* The scanner converts the lexeme to the correct

* type, inserts the value and the type into the

* global structure, then returns a pointer to the

* global structure (for coding convenience).

*/

TokenType lex_token;   /* global token structure */

extern TokenType *scanner();

TokenType *scanner(){

    extern TokenType lex_token; /* global token */

    /* . . . */

    lex_token.type = T_STRING;    /* return char* */

    lex_token.value.string = mystring;

    return &lex_token;

    /* . . . */

    lex_token.type = T_IDENT;    /* ID’s are char* */

    lex_token.value.string = mystring;

    return &lex_token;

    /* . . . */

    /* convert ASCII digits to a real integer */

    status = convert_to_integer(lexbuf,&myint);

    if( status != GOOD_INT ) do_error(. . .);

    lex_token.type = T_INTEGER; /* return integer */

    lex_token.value.integer = myint;

    return &lex_token;

Parsing: A Simple Toy Grammar

A simple, recursive Toy Grammar for an assignment statement, consisting of four productions:

<assignment> -->
    ID ‘=‘ <expression> ‘;’

<expression> -->
    <term> ( (‘+’|‘-’) <term> )*

<term> -->
    <factor> ( (‘*’|‘/’) <factor> )*

<factor> -->
    ID | CONST | ‘(’ <expression> ‘)’

The grammar has ten terminals (recognized token types):
       ID, CONST, (, ), +, -, *, /, ;, =

Four non-terminals (each of which will be coded as a separate function in a top-down, recursive-descent, predictive parser):
        <assignment>, <expression>, <term>, <factor>

Plus and minus tokens separate the terms of the expression.

Multiply and divide tokens separate the factors of the terms.

The current input token uniquely determines which of the four productions must be applied, with no ambiguity.     No backtracking will be needed to parse sentences in this language.

Suitable for top-down, recursive-descent, predictive parsing (text p.41): Start with the root non-terminal (<assignment>) and work down toward the terminals, the leaves of the grammar parse tree. The expected terminals in the grammar are matched by the parser against the actual token types returned by the scanner.

The input is correct if the parsing succeeds without any errors.

Recursive Decent Parsing: How to turn a Grammar into C code

A predictive parser can be built directly from the productions in the grammar. Non-terminal symbols become function calls; terminal symbols are matched directly against tokens according to the token type returned by the Scanner. To simplify coding and avoid having to pass around a pointer to the current token type, the returned token type is kept in a global variable.

The scanner is always called after a match of a terminal token, to keep the global tokentype always pointing to the next token. This is called the look ahead token. The very first look ahead token has to be read by an initial call to scanner() before any of the grammar’s parsing functions are called. This is done first thing in the Parser, or in main() before calling the Parser.

A function in the Parser may discover that a look-ahead token fails to match an expected symbol in the Grammar. If the match is mandatory, this failure is a syntax error; the Parser must issue an error message and either exit or try to recover. However, if the match is optional, the parsing function simply leaves the look-ahead token alone; some other part of the Parser will match it later (or will issue a syntax error message).

Note: We start with just parsing the input. No actions will be taken on anything we match; we simply match it and move on. (This is like having a DFA inside a scanner that doesn’t have any actions in it to save characters.) Thus, no output will be generated by the parsing process unless the parsing encounters an error.

Recursive Descent Parsing: Coding A Very Small Grammar

    A small example translating a simple two-production grammar into two recursive-descent parsing functions. Terminal symbols are matched directly; the two non-terminals turn into function calls. This grammar is similar to the first and last productions of the Toy Grammar:

       <stmt> --> ID ‘=’ <item> ‘;’

       <item> --> ID | CONST

stmt(){

    if( tokentype != ID )

       error("Missing identifier");

    scanner();    /* get the next lookahead token */

    if( tokentype != EQUALS )

       error("Missing ‘=’");

    scanner();    /* get the next lookahead token */

    item();    /* non-terminal is function call */

    if( tokentype != SEMICOLON )

       error("Missing ‘;’");

    scanner();    /* get the next lookahead token */

}

item(){

    switch( tokentype ){

    case ID:

    case CONST:

       scanner(); /* get next lookahead token */

       break;

    default:

       error("Missing identifier or constant");

    }

}

Recursive Decent Parsing: Functions to parse the Toy Grammar I

    This is the first of the four functions that parse the Toy Grammar. This first function handles the root non-terminal that is the root of the Parse Tree for this Grammar. All other productions reduce to this one.

<assignment> --> ID ‘=‘ <expression> ‘;’

assignment(){

    if( tokentype != ID )

       error("Missing identifier");

    scanner();    /* get the next lookahead token */

    if( tokentype != EQUALS )

       error("Missing ‘=’");

    scanner();    /* get the next lookahead token */

    expression();    /* non-terminal is function call */

    if( tokentype != SEMICOLON )

       error("Missing ‘;’");

    scanner();    /* get the next lookahead token */

}

Recursive Decent Parsing: Functions to parse the Toy Grammar II

Here are two more of the four functions needed to parse the Toy Grammar. Note the use of a while() loop to match possibly repeated (optional) elements in a grammar production, based on whether the look ahead token indicates that a repeated element is present.

The Scanner is called to re-load the look-ahead token only if a match occurs. If the token isn’t matched, we assume some other part of the Grammar must match it, so the Scanner is not called on a mis-match.

<expression> --> <term> ( (‘+’|‘-’) <term> )*

expression(){

    term();    /* non-terminal is function call */

    while( tokentype == PLUS || tokentype == MINUS ){

       scanner(); /* get next lookahead token */

       term();    /* non-terminal is function call */

    }

}

<term> --> <factor> ( (‘*’|‘/’) <factor> )*

term(){

    factor();    /* non-terminal is function call */

    while( tokentype == MULT || tokentype == DIV ){

       scanner(); /* get next lookahead token */

       factor();/* non-terminal is function call */

    }

}

Recursive Decent Parsing: Functions to parse the Toy Grammar III

Here is the last of the four functions of the Toy Grammar. This grammar rule has several alternatives. A switch() statement selects the right alternative to parse based on what the current look ahead token is.

Remember, no actions are yet added to these functions. All they do is recognize the input and move on; they don’t do anything with it yet.

<factor> --> ID | CONST | ‘(’ <expression> ‘)’

factor(){

    switch( tokentype ){

    case ID:

       scanner(); /* get next lookahead token */

       break;

    case CONST:

       scanner(); /* get next lookahead token */

       break;

    case LEFTPAREN:

       scanner(); /* get next lookahead token */

       expression();

       if( tokentype != RIGHTPAREN )

          error("Missing ‘)’");

       scanner(); /* get next lookahead token */

       break;

    default:

       error("Missing ID, CONST, or ‘(’");

    }

}

Recursive Decent Parsing: Error Handling in nested functions

See Aho section 4.1

A simple panic-mode error handling system requires that we return to the top of the parser when any error is detected. At the top of the parser, we skip forward until we find a semi-colon (‘;’), then resume parsing.

To return to the top of the parser when an error is detected in one of the parsing functions, we need to add error-handling code to all the functions.

Each parsing function may succeed (in which case we continue parsing) or fail (in which case we stop parsing and return). For example:

Boolean expression(){

    if( ! term() )

       return FALSE;       /* failure */

    while( tokentype == PLUS || tokentype == MINUS ){

       scanner();

       if( ! term() )

          return FALSE;    /* failure */

    }

    return TRUE;    /* parsing succeeded so far */

}

Boolean term(){

    if( ! factor() )

       return FALSE;       /* failure */

    while( tokentype == MULT || tokentype == DIV ){

       scanner();

       if( ! factor() )

          return FALSE;    /* failure */

    }

    return TRUE;    /* parsing succeeded so far */

}

Recursive Decent Parsing: Error Handling by finding the semi-colon

    To complete the panic-mode error recovery, the topmost parsing function must detect the failure to parse and skip forward until a semi-colon is found. Once the semi-colon is found, one more token of look ahead is read; this prepares the parser to resume parsing after the statement with the syntax error. The code for this looks something like this:

parser(){

    scanner();    /* load first look-ahead token */

    while( tokentype != T_EOF ){

       switch( tokentype ){

       case ID:

          if( ! assignment() )

             panic();    /* recover */

          break;

       /* other statement types will go here */

       default:

          errprint("Unknown statement type");

          panic();       /* recover */

       }

    }

}

/* Skip forward to EOF or to after next semi-colon */

panic(){

       while( tokentype != T_SEMI

           && tokentype != T_EOF)

          scanner();

       if( tokentype == T_SEMI )

          scanner(); /* skip the ‘;’ */

}

Recursive Decent Parsing: Action Symbols in the Grammar

Now that we have a parser that can recognize assignment statements, we need to add semantic actions to the parser so that useful work is done. These actions are specified by embedding semantic action symbols in the grammar productions, resulting in a set of rules called a translation scheme (Aho p.37-40: Translation Schemes, Emitting a Translation). First, look at the simple grammar without any action symbols:

    <stmt> --> ID ‘=‘ <item> ‘;’

    <item> --> ID | CONST

Now, turn it into a translation scheme, with the semantic action symbols embedded in the grammar productions in the correct places:

    <stmt> --> ID {enter} ‘=’ <item> {pop;copy} ‘;’

    <item> --> ID {lookup;push} | CONST {eval;push}

Here are what each of the action symbols mean:

{enter}: enter the ID into the symbol table if it isn’t already there

{pop}: pop the top value off the value stack

{copy}: copy the value to the symbol table at the location of ID

{lookup}: find this ID in the symbol table; get its current value

{push}: push the current value on the value stack

{eval}: evaluate the numeric constant

The action symbols are placed in the exact places we want the actions to occur. We map these locations to the corresponding locations in the code for the recursive-descent parsing functions, and we insert code that performs these actions. The value stack is used by the parser to hold values of expressions (including constants and identifiers).

Recursive Decent Parsing: Actions in the Parsing Functions

Here is one of the augmented grammar productions that make up the translation scheme, with the semantic action symbols inserted as before:

<stmt> --> ID {enter} ‘=’ <item> {pop;copy} ‘;’

The code for the recursive-descent parsing function that implements the semantic actions for this production looks something like this:

Boolean stmt(){

    if( tokentype != ID ){

       errprint("Missing identifier");

       return FALSE;

    }

    \\ {enter} action goes here: Call a function that

    \\ returns a pointer to the identifier in the

    \\ symbol table. If not found there, enter it.

    . . . usual code to match EQUALS goes here . . .

    if( ! item() )

       return FALSE;

    \\ {pop;copy} action goes here: Pop the top value

    \\ off the value stack and copy it into the

    \\ symbol table at the location of the identifier

    if( tokentype != SEMICOLON ){

       errprint("Missing ‘;’");

       return FALSE;

    }

    scanner();

    return TRUE; /* parsing succeeded */

}

Recursive Descent Parsing: The Interpreter Value Stack

An interpreter doesn’t generate code. It performs semantic actions as it parses the input. For expressions, such as those used in the Toy Grammar, it uses a value stack to hold values and their types.

The value stack stores intermediate results of parsing for later use in a grammar rule. Here’s an example using a simple translation scheme with semantic actions inserted in their proper places:

<stmt> --> ID {enter} ‘=’ <plus> {pop;copy} ‘;’

<plus> --> <item> ‘+’ <item> {pop;pop;add;push}

<item> --> ID {lookup;push} | CONST {eval;push}

Look at how the semantic actions use the value stack:

{lookup;push}: When the parsing function implementing item() recognizes an ID token, it looks up the value and type of the identifier in the symbol table and pushes the current value and type of that identifier onto the value stack. (An undefined identifier would be a run-time error.)

{eval;push}: When the parsing function implementing item() recognizes a constant, it evaluates the constant and pushes the value and type of the constant onto the value stack.

{pop;pop;add;push}: The parsing function implementing plus() parses two items. Each call to item() pushes a value on the stack. After both items have been pushed, plus() pops each one off, adds them together, and pushes the result back on the stack. In more complex grammars, a separate execute(operator) function does the arithmetic.

{pop;copy}: When the plus() function returns control to the parsing function implementing stmt(), stmt() pops the stack and copies the top value and type into the symbol table at the location for the identifier located at the start of stmt(). (Type checking would be done here, too.)

{enter}: This semantic action doesn’t use the value stack. It looks up the identifier in the symbol table and returns a pointer to its location there. If the identifier doesn’t exist, it adds it and returns a pointer to it.

CST 8152 Compilers Midterm #1 Review Questions

What is a compiler? an interpreter?

How do they differ?

Where does compilation fit in the steps that make up Language Processing?

What is the output of each step of the process?

What is the "front end" of a compiler? the "back end"?

Name and describe the function of each of the three parts of the compiler front end.

What is the output of each of the three parts?

Give C Language examples of errors that would be detected in each part.

Name and describe the basic functions of the compiler back end.

Define these terms: lexeme, token, pattern, deterministic automaton, nondeterministic automaton,

Why is syntax analysis usually recursive, where lexical analysis is not?

What are the meanings of the Regular Expression operators ‘*’, ‘+’, and ‘?’

Given a description of a set of strings, write a Regular Expression that matches those strings.

e.g. strings of letters ending in ‘a’ or ‘bcd’

e.g. strings of digits containing a ‘3’ before a ‘5’

e.g. unsigned integer constant, unsigned floating constant

Given a Regular Expression, give examples of strings matched by the expression.

What is a Finite State Machine (FSM)?

What restrictions are placed on a FSM to make it a Deterministic Finite Automaton (DFA)?

Given a Regular Expression, write the DFA transition diagram that recognizes it (or vice-versa).

Given a DFA, write its corresponding Transition Table (or vice-versa).

Write a C Language NextState() function that implements the transitions of a given DFA or Transition Table.

Write a while() loop that uses the NextState() function to implement a DFA.

Write a Column() function that classifies characters for indexing a Transition Table.