Recursive Descent Predictive Parsing

This page last updated: Sunday September 27, 1998 01:07

How to turn a Grammar into C code

A predictive parser that uses one token of lookahead can be built directly from the productions in the grammar using the methods described in the text (p.46 "Designing a Predictive Parser").

Non-terminal symbols become function calls; expected terminal symbols in the grammar are matched against tokens read from the input according to the token type returned by the Scanner.

The scanner is always called after a match of a terminal token, to skip over that token and load the next one. This token is called the lookahead token. The very first lookahead token has to be read by an initial call to scanner() made before any of the grammar's parsing functions are called. This is usually done first thing as part of the initialization of the Parser.

A function in the Parser may discover that a lookahead token fails to match an expected terminal symbol in the Grammar. If the match is mandatory, this failure is a syntax error; the Parser must issue an error message and either exit or try to recover. However, if the match is optional, that particular parsing function simply leaves the look-ahead token alone; some other part of the Parser will match it later (or will issue a syntax error message).

Note: We start with just parsing the input. No actions will be taken on anything we match; we simply match it and move on. (This is like having a DFA inside a scanner that doesn't have any actions in it to save characters.) Thus, no output will be generated by the parsing process unless the parsing encounters an error.

Coding A Very Small Grammar

This is a small example of how to translate a simple two-production grammar into two recursive-descent parsing functions. Terminal symbols in the grammar are matched against the input read by the Scanner; the two non-terminals in the grammar turn into function calls.

The match() function

To simplify the code, we write a match() function that matches the current lookahead symbol against the expected symbol in the grammar, and define some macros:

We make extensive use of tokenType(), a function that returns the English description of a token's type as a string, suitable for printing.  Where possible, we print the lexeme value of a token when an error is encountered.  The function myerror() is assumed to exit the program after printing the given message.

static Token lookaheadT;  /* token is global only to the parser */

#define TYPEOFTOKEN    lookaheadT.type_of_token
#define LEXEMESTR      lookaheadT.lexeme_string
#define LINENUMBER     (long int)lookaheadT.line_number
#define GETLOOKAHEAD   {lookaheadT = scanner();}

   static void
match(TokenType tt){   
   if( TYPEOFTOKEN != tt ){
      myerror("File %s Line %ld: Expecting %s;"
         " found %s '%s'",
         filename,
         LINENUMBER,
         tokenType(tt),
         tokenType(TYPEOFTOKEN),
         LEXEMESTR );
      /*NOTREACHED*/
   }
   GETLOOKAHEAD;
}

When panic-mode error recovery is added, the above function becomes Boolean.  It should not exit the program.  It should return FALSE if the token type doesn't match, and TRUE if it does match.

A Very Small Grammar

Here is the two-production Very Small Grammar:

       <stmt> --> ID '=' <item> ';'
       <item> --> ID | CONST

And here are the two functions that recognize it, derived from the above two productions using the match() function and the above macros:

   static void
stmt(void){
   match(ID);
   match(EQUALS);
   item();    /* every non-terminal is a function call */
   match(SEMICOLON);
}
   static void
item(void){
   switch( TYPEOFTOKEN ){
   case ID:
      match(ID);
      break;
   case CONST:
      match(CONST);
      break;
   default:
      myerror("File %s Line %ld: Expecting %s or %s;"
         " found: %s '%s'",
         filename,
         LINENUMBER,
         tokenType(ID),
         tokenType(CONST),
         tokenType(TYPEOFTOKEN),
         LEXEMESTR );
      /*NOTREACHED*/
   }
}

We could have used the GETLOOKAHEAD macro to load the next lookahead token instead of match() in the switch statement, since we are already certain that the lookahead token type matches.

A Simple Toy Grammar

Here is a simple, recursive Toy Grammar for an assignment statement, consisting of four productions:

<assignment> -->
     ID '=' <expression> ';'

<expression> -->
     <term> ( ('+'|'-') <term> )*

<term> -->
     <factor> ( ('*'|'/') <factor> )*

<factor> -->
     ID | CONST | '(' <expression> ')'

The grammar has four non-terminals (each of which will be coded as a separate function in a top-down, recursive-descent, predictive parser):

        <assignment>, <expression>, <term>, <factor>

All the other symbols in the grammar must be terminal symbols; consequently, the grammar has ten recognized token types:

       ID, CONST, (, ), +, -, *, /, ;, =

The Scanner would return the TokenType classification of these token types to the Parser using an enum or a set of #define statements to specify the different types, e.g.  ID, CONST, LEFTPAREN, RIGHTPAREN, PLUS, MINUS, MULT, DIV, SEMI, EQUALS.

Plus and minus tokens separate the terms of the expression. Being high in the grammar, close to the start symbol <assignment>, these operators have lowest precedence.

Multiply and divide tokens separate the factors of the terms.  Being low in the grammar, close to the tokens, these operators have highest precedence.

The current input token uniquely determines which of the four productions must be applied, with no ambiguity. No backtracking will be needed to parse sentences in this language.

This grammar is suitable for top-down, recursive-descent, predictive parsing (text p.41): Start with the root non-terminal (<assignment>) and work down toward the terminals, the leaves of the grammar parse tree.  The expected terminals in the grammar are matched by the parser against the actual token types returned by the scanner.

The input is correct if the parsing succeeds without any errors.

Functions to parse the Toy Grammar

Here are four functions that parse the Toy Grammar. Each function gets its name from a non-terminal symbol in the grammar.

assignment()

This first function handles the root non-terminal that is the root of the Parse Tree for this Grammar. All other productions reduce to this one.

<assignment> --> ID '=' <expression> ';'
static void assignment(void){
   match(ID);
   match(EQUALS);
   expression();    /* non-terminal is function call */
   match(SEMI);
}

expression() and term()

In these functions, note the use of a while() loop to match possibly repeated (optional) elements in a grammar production, based on whether the look ahead token indicates that a repeated element is present.

The Scanner is called (in the match() function) to re-load the look-ahead token only if the current token type matches what we expect. If the token type isn't matched, this is not an error because the matching here is optional according to the grammar.  We assume some other part of the Grammar will match the token, so the Scanner is not called on a mis-match; the function simply returns without touching the lookahead token.

<expression> --> <term> ( ('+'|'-') <term> )*
static void expression(void){
    term();       /* non-terminal is function call */
    while( TYPEOFTOKEN == PLUS || TYPEOFTOKEN == MINUS ){
       match(TYPEOFTOKEN);
       term();    /* non-terminal is function call */
    }
}
 
<term> --> <factor> ( ('*'|'/') <factor> )*

static void term(void){
    factor();     /* non-terminal is function call */ 
    while( TYPEOFTOKEN == MULT || TYPEOFTOKEN == DIV ){
       match(TYPEOFTOKEN);
       factor();  /* non-terminal is function call */ 
    }
}

factor()

This is the last of the four functions of the Toy Grammar. This grammar rule has several alternatives. A switch() statement selects the right alternative to parse based on what the current lookahead token is.

<factor> --> ID | CONST | '(' <expression> ')'
static void factor(void){
   switch( TYPEOFTOKEN ){
   case ID:
      match(ID);
      break;
   case CONST:
      match(CONST);
      break;
   case LEFTPAREN:
      match(LEFTPAREN);
      expression();
      match(RIGHTPAREN);
      break;
   default:
      myerror("File %s Line %ld: Expecting %s, %s, or %s;"
         " found: %s '%s'",
         filename,
         LINENUMBER,
         tokenType(ID),
         tokenType(CONST),
         tokenType(LEFTPAREN),
         tokenType(TYPEOFTOKEN),
         LEXEMESTR );
      /*NOTREACHED*/
   }
}

We could have used the GETLOOKAHEAD macro to load the next lookahead token instead of using match() after the case statements the switch statement, since we are already certain that the lookahead token type matches.

Remember, no actions are yet added to these functions. All they do is recognize the input and move on; they don't do anything with it yet.  Making the parser do actions is the function of semantic actions.

The above parsing functions exit the program when any error is encountered.  This is not very friendly.  We must improve the error handling using Panic-Mode Error Recovery.