Semantic Action Coding

This page last updated: Sunday September 27, 1998 01:07

Overview

Given a translation scheme that contains pseudocode semantic actions,  we explore how to insert the code into the recursive descent predictive parsing functions that recognize the grammar.

The Parser of the interpreter we are building uses a semantic value stack.  In the examples that follow, the stack is a stack of Item structures.  An Item structure contains a type and some data, some of which may be part of a C Language union. The type field tells us what is currently stored in the Item.  The Item structure is typedef'd to be identical to the Token structure returned from the Scanner.

Using the Token structure as the stack data structure

Items pushed on the value stack may be of various types, including strings, integers, real numbers, etc., depending on the requirements of the grammar.  Thus, the elements pushed on the stack are not simple numbers, they are structures containing both a type field and a C Language union of different types.  The type field remembers what type of data was stored in the union, so that when we pop the item we know what type it is.

The requirements for a basic value stack element are quite similar to those for a Token. In both, we need to keep track of some data and the type of the data.  For convenience, we make the stack Item structure identical to the Token structure returned from the Scanner:

typedef Token Item;

This definition lets us use all the TokenType constants to indicate the type of item pushed on the stack. Tokens can be assigned directly to Items, and vice-versa.  We can even keep track of the line numbers of intermediate expressions that are pused on the stack, using the lineno field of the structure:

/* The Token returned by the Scanner.
 * A token is a classified lexeme.  Here, you find the lexeme
 * and its classification, and the line number on which it was found.
 * The union holds the converted form of the lexeme for those tokens
 * that need to be converted (e.g. to integers).
 */
typedef struct Token {
        TokenType type; /* type of the lexeme recognized */
        char *lexeme;   /* actual lexeme character string */
        long lineno;    /* line number on which lexeme was found */
        union {
            long uint;  /* for holding unsigned integers */
        } u;
} Token;

Thus, token.lexeme or item.lexeme refers to the string element inside the structure; token.u.uint or item.u.uint refers to the integer in the structure.

To copy a Token to an Item, you can use simple structure assignment, since they are the same basic type:

Token t = scanner();
Item i;

i = t;   /* this copies a Token structure to an Item structure */

The semantic action examples in this file use this Item structure as the basic data type for push and pop

The stack routines know nothing about the contents of the structure.  The stack is only responsible for pushing and popping copies of the structures.

Managing dynamic memory

You must be careful about who "owns" dynamic memory among the Scanner, Parser, and Stack:

Let's look at how the Parser must handle pushing and popping a token received from the Scanner:

Token t = scanner();
Item i;

i = t;
i.lexeme = my_strdup(i.lexeme);    -- get a copy of the string
push( &i );
...
i = pop();
mem_free(i.lexeme);                -- free the memory again
...

Since we must do this lexeme allocation and copy for anything pushed on the stack, the Parser code can be simplified if we write some utility functions that handle this memory management:

Utility functions for handling Items

Here are the prototypes and specifications for some useful utility functions used by the parser to handle the Item values on the stack.  Code for these may be found in the assignment source directories.

static char *item_value(
   Item *itemp  /* IN: pointer to item to make printable */
);

static void copy_push_item(
   Token *tokenp /* IN: pointer to token to copy */
);

static void item_free(
   Item *itemp  /* IN: pointer to item whose memory is to be freed */
);

A Print Statement - <pstate>

Here is an augmented grammar production that is part of a translation scheme, with the semantic action symbols inserted:

<pstate> --> PRINT <xpress> {pop;print} ';'

The code that implements the semantic actions for this production is inserted into the recursive-descent parsing function exactly where it appears in the production.  It looks something like this:

static Boolean pstate(void){
    Item val     -- structure containing type and value

    CALL match(PRINT), return FALSE if it fails

    CALL xpress(), return FALSE if it fails

    -- {pop;print} action goes here:

    val = pop()                       -- pop top of the value stack
    printf("%s", item_value(&val) )   -- print what we popped
    item_free( &val )                 -- free memory of popped Item

    CALL match(SEMICOLON), return FALSE if it fails

    return TRUE                      -- parsing succeeded
}

A Simple Expression - <xpress>

Here is another augmented grammar production that is part of a translation scheme, with the semantic action symbols inserted:

<xpress> --> CONST {eval;push}
         --> STR {eval;push}
         --> '-' <xpress> {pop;negate if numeric;push}

Unary Minus I

The semantic action for Unary Minus in an interpreted grammar is to pop the value of the operand off the semantic expression value stack, negate it, then push it back. Since the operand has to be evaluated first, before it can be negated, the semantic action for negation appears after the expression operand is parsed.

The operation for unary minus is not the same as two-operand subtraction; but, the operator used (the minus sign) is the same.  Only the Parser knows, by the placement of the minus sign at the beginning of a certain production, that this use of the minus sign is the Unary Minus operation.  The Scanner cannot make this distinction (since the Scanner knows nothing about the grammar), so the Scanner will always return the MINUS token type for both operations.

The code for the recursive-descent parsing function that parses and implements the semantic actions for the above productions looks something like this:

static Boolean xpress(void){
    Item val;    -- structure containing type and value

    SWITCH TYPEOFTOKEN
    CASE CONST:  /*FALLTHROUGH*/
    CASE STR:
       -- copy_push_item is a utility function described above
       copy_push_item( &token )  -- copy/push the token onto the stack
       get next lookahead token
       BREAK
    CASE MINUS:   -- grammar says this must be a Unary Minus
       lineno = current line number
       get next lookahead token
       CALL xpress(), return FALSE if it fails
       val = pop()                -- pop the value off the stack
       IF val.type is not an integer type THEN
          -- type mismatch: print a good error message here
          -- indicating that only integers can be negated

          item_free( &val )       -- free memory of popped Item
          RETURN FALSE
       ENDIF
       val.u.uint = - val.u.uint  -- negate the value
       val.type = integer type    -- (redundant) set type of result
       val.lineno = lineno        -- pass on line number of expression
       push( &val )               -- push value back on the stack
       BREAK
    DEFAULT:
       -- print a good syntax error message here
       RETURN FALSE    
    END SWITCH
    RETURN TRUE                   -- parsing succeeded
}

The equivalence of the Item typedef and the Token typedef makes coding easy.  The token structure returned by the Scanner can be directly assigned to an Item variable and pushed on the stack by the utility function.

Be careful to free the memory of anything popped off the stack that does not get pushed back on.

A Plus Expression - <plus>

Here is another augmented grammar production that is part of a translation scheme, with the semantic action symbols inserted:

    <plus> --> <pterm> '+' <pterm> {pop;pop;add if numeric;push}

static Boolean plus(void){
    Item val1;    -- structures containing type and value
    Item val2;
    Item val3;
    Boolean ret;  -- return value of this function

    CALL pterm(), return FALSE if it fails

    CALL match(PLUS), return FALSE if it fails

    CALL pterm(), return FALSE if it fails

    val2 = pop()           -- pop second operand from stack
    val1 = pop()           -- pop first operand from stack

    IF val1.type and val2.type are both integers THEN
       val3.lexeme = NULL            -- new item has no string value
   
    val3.type = integer type      -- set the type of the result
       val3.u.uint = val1.u.uint + val2.u.uint   -- add them
       push( &val3 )                 -- push the sum
       ret = TRUE
    ELSE
       -- type mismatch: print a good error message here
       -- indicating that only integers can be added

       ret = FALSE
    ENDIF

    free memory used by popped val1 and val2 Item structures

    RETURN ret
}

Note that we don't actually need to have three Item structures; two would be sufficient.  We could simply perform the arithmetic and put the result back into the first structure again, and push it back onto the value stack.  We'll show that in the next section.

A function to execute operations

We have seen that some productions have semantic action symbols that require operations to be performed on operands popped off the semantic expression value stack.  We can localize most of these operations into a single execute() function whose task is to pop operands and perform operations.  This takes all the stack manipulation code out of most of the parsing functions, and makes them much simpler.

The execute() function takes an argument that tells what type of operation has to be performed, and a line number giving the line on which the operation was found (for error messages).   For example, given the following translation scheme requiring addition:

<add_expression> --> <xpress> '+' <xpress> {pop;pop;add if #;push}

The simplified code/pseudocode for the recursive-descent parsing function might look like this:

   static Boolean
add_expression(void){
   lineno = line number of current token
   CALL xpress(), return FALSE if it fails
   CALL match(PLUS), return FALSE if it fails
   CALL xpress(), return FALSE if it fails
   CALL execute(PLUS,lineno), return FALSE if it fails
   RETURN TRUE
}

All the code for semantic value stack manipulation has been moved into the execute() function, simplifying the parsing function.  The line number passed to execute() is picked to be something that will be useful in error messages, in case execute() is unable to perform the requested operation.  The line number of the token that begins the add_expression is not a bad choice.  The line number of the PLUS token might also be good.

Unary Minus II

For the unary minus operator, the code given above for the simple expression xpress() simplifies to this when all the push/pop code is moved into the execute() function:

CASE MINUS:   -- grammar says this must be a Unary Minus
       lineno = current line number
       get next lookahead token
       CALL xpress(), return FALSE if it fails
       CALL execute(UNARY,lineno), return FALSE if it fails
       BREAK

The UNARY token type is not a token type returned by the Scanner.  The Scanner cannot tell a Unary minus from a two-operand Subtraction minus.  The Parser can tell them apart, and when the Parser recognizes a use of the single-operand unary minus, as above, it uses the UNARY token type as an argument to the execute() function instead of the MINUS token type.  This tells execute() to do a negation using one operand instead of a subtraction using two operands.

The execute() function performs the indicated action on operands that have been pushed on the stack. The execute() function pops the number of operands off the stack that are appropriate for the given operation, tests for compatibility of operands with the given operator, does the type conversions (if any), and then does the actual operation and pushes the result back on the value stack.  It returns TRUE if the operation worked, FALSE if it failed:

   static Boolean
execute(
   TokenType optype,     -- IN: type of operator
   long int lineno       -- IN: line number of expression
){
   Item op1;             -- First operand type and value
   Item op2;             -- Second operand type and value
   
   SWITCH optype
   CASE PLUS:            -- the '+' operation (two operands)
      op2 = pop()        -- op2 was pushed last 
      op1 = pop()        -- op1 was pushed first

      IF op1.type and op2.type are not both integers THEN
         -- type mismatch: issue a good error message here
         -- indicating that only integers can be added
         free memory in both op1 and op2 popped Item structures
         RETURN FALSE
      ENDIF

      op1.u.num += op2.u.num     -- do the PLUS operation
      op1.type = integer type    -- (redundant) set type of result
      op1.lineno = lineno        -- pass on line number of expression
      push( &op1 )               -- push the result back on stack
      free memory of op2 popped Item structure
      RETURN TRUE
   CASE UNARY:                   -- Unary Minus (one operand)
      op1 = pop()                -- pop the operand off the stack

      IF op1.type is not an integer type THEN
         -- type mismatch: print a good error message here
         -- indicating that only integers can be negated
         free memory of op1 popped Item structure
         RETURN FALSE
      ENDIF

      op1.u.uint = - op1.u.uint  -- negate the value
      op1.type = integer type    -- (redundant) set type of result
      op1.lineno = lineno        -- pass on line number of expression
      push( &op1 )               -- push value back on the stack
      RETURN TRUE
   CASE ... :                     -- other cases go here
      RETURN TRUE
   DEFAULT:
      -- should never get here unless parser is programmed incorrectly
      print an error message about the unknown operator
      RETURN FALSE (or exit the program)
   END SWITCH
}

Simple changes to execute() can handle each new operator.  The operation doesn't have to be an arithmetic operation; it might be a print statement that pops and prints values, or an assignment statement that pops and puts values into the symbol table.  All the semantic action code can be kept in the one execute() function.

Only operands that have been popped and not pushed back on the stack need to be freed by item_free().

Notes on stack handling


Ian D. Allen CST 8152 Home Page