CST 8152 - Jargon

This page last updated: Sunday September 27, 1998 01:07

This page connects many of the jargon words used in building a compiler or interpreter.

We want to recognize a LANGUAGE, which is made up of SYMBOLS or characters that constitute a SENTENCE in the language.

We write a CONTEXT-FREE GRAMMAR for the language.
The grammar is composed of PRODUCTIONS containing TERMINAL symbols and NON-TERMINAL symbols.
Terminal symbols are TOKENS. They are sequences of symbols (characters) having a collective meanting, e.g. identifier, integer, keyword, string, etc.
The symbol sequences (character sequences) are called LEXEMES. They are recognized and classified by a LEXICAL ANALYSER or SCANNER as it reads the input source language sentence.
The structure of the lexemes is recognized based on PATTERNS.
The patterns are often expressed concisely using REGULAR EXPRESSIONS and REGULAR DEFINITIONS.
Regular expressions are equivalent to FINITE AUTOMATA.
Finite automata can be either DETERMINISTIC or NON-DETERMINISTIC.
Deterministic finite automata can be constructed from non-deterministic finite automata; however, the construction may require an exponential number of states.
We can compile a regular expression into a recognizer by constructing the finite automaton to which it is equivalent.
Finite automata can be represented using STATE MACHINES that have one START state and one or more ACCEPTING STATEs.
State machines, like automata, may be DETERMINISTIC or NON-DETERMINISTIC.
State machines may be expressed using TRANSITION DIAGRAMS (deterministic) and TRANSITION GRAPHS (non-deterministic).
From the transition diagrams and graphs we can construct TRANSITION TABLES that tell what the next state is, based on the current state and the current input character.
Transition tables can be coded. If the state machine enters an accepting state while reading the symbols (characters) in the input sentence, we have recognized a lexeme in the language. Which lexeme we have recognized depends on which accepting state we entered.

The grammar of the language groups the input tokens into a hierarchy called a PARSE TREE. The tree imposes a structure on the input sentence. A grammar may be AMBIGUOUS, in which case there exist two different parse trees for some sentences in the language.
A node in the parse tree may have various ATTRIBUTES associated with it, e.g. a numeric value, a line number, a type, etc.
A node might compute an attribute by INHERITING the attribute from a parent node, or by SYNTHESIZING the attribute from one or more of the child nodes in the tree.
We specify a DEPTH-FIRST TRAVERSAL of a parse tree, which visits all the nodes in the tree in a depth-first, left-to-right order.
We can add SEMANTIC RULES to the productions of the grammar, to dictate how the attributes are computed for each node in the tree. The rules say how to compute the attributes; but, they don't specify when or in what order.
Alternately, we can add SEMANTIC ACTIONS to the productions of the grammar. The SEMANTIC ACTION SYMBOLS are small pieces of code or pseudocode inserted into the right sides of grammar productions at the exact location where we want our parser to perform some action.
The parser executes each semantic action symbol when it encounters it during the parsing process.
The actions typically have effects other than simply setting the node attribute; some actions involve doing arithmetic, some involve generating output.