CST 8152 - The String Token

The STRING token type for the Toy Language is a subset of a C-language double-quoted string. The string starts with a double-quote and ends with an unescaped double-quote. The quotes are not part of the collected string. The backslash is the escape character within the string. (Backslash has no special meaning outside of the string.) The escaped character following the backslash inside the string is treated specially (in the manner of C), as follows:

Examples (similar to C language strings):

"This is a simple string."
"This is line one.\nThis is line two.\nThis is line three.\n"
"You can \"quote\" within the string using escaped double-quotes."
"Unrecognized escape sequences such as \x and \y appear unchanged."
"To get a single backslash, use two, like this: \\ ."

DFA Design

The Deterministic Finite Automaton that recognizes these strings has exactly four states:

  1. one start state
  2. one collecting state
  3. one escape state
  4. one accept state

The collecting state is entered upon seeing the first double-quote character. The escape state is entered from the collecting state upon seeing a backslash, and is exited by any following character. The accept state is entered from the collecting state upon seeing a (second) double-quote. While in the collecting state, all characters except backslash and double-quote cause transitions back to the collecting state.

Collecting the String

Inside the DFA, save into the string all characters where state current == next == COLLECTING. Note that this means characters such as real newlines are valid inside strings; strings may span multiple lines. (Optional enhancement: Prohibit unescaped newlines inside strings. To insert a raw newline in a string, require that it be preceded by the escape character.)

On the state transition from escape back to collecting, switch on the character seen. This character is the character following the backslash. What you save in the current string depends on what the escaped character is:

switch( ch ){
case 'n':   *strptr++ = '\n'; break;   /* newline escape */
case '"':   *strptr++ = '"';  break;   /* double-quote escape */
case '\\':  *strptr++ = '\\'; break;   /* backslash escape */
default:
   /* unknown escape pair -- just put it in the string unchanged */
   *strptr++ = '\\';  /* put in the backslash escape */
   *strptr++ = ch;    /* put in the character after the escape */
}

Note: The above code is for illustration purposes only. It does not contain any string overflow checking, nor does it properly use a #define for the backslash escape character, thus this code would not be acceptable in a real scanner.

When the closing double-quote is seen, the scanner ends the string and returns it to the parser as a new token type.

EOF and error handling

Your scanner must detect EOF occurring inside a string and report it as a warning or an error.

If your scanner uses static buffers and cannot handle arbitrary-length strings, it must still scan until it finds the closing double quote of a string, even if it cannot store all of the scanned characters. If the scanner stopped scanning at the point that the string buffer was full, the next call to the scanner would pick up reading characters in the middle of the string; this would produce many, many false syntax errors in the parser.