C Language Unions - how and why

This page last updated: Sunday September 27, 1998 01:07

Unions are used in C Language to save storage space when you need to store any one of a variety of C language datatypes in a location.

The Token Structure

Consider the basic Token structure returned from a scanner. It contains the type of the token returned, and a pointer to a character string that is the lexeme recognized:
  typedef struct Token {
      TokenType type;
      char *lexeme;
  } Token;
A fully-functioning scanner doesn't just recognize character string lexemes; it must recognize numeric constants as well. Rather than have the Parser receive these character strings from the scanner and have to convert each of these strings to its proper type, e.g. convert "1.234" into a real floating-point number 1.234, we assign that job to the scanner. The scanner will convert the character string lexeme "1.234" into the float or double number 1.234 before returning it in the Token structure.

The scanner also has to return lexemes that are types other than character strings and floating-point numbers. It may also handle integers, long integers, single characters, etc. We need to redo the Token structure to handle all these different types of token values that might be returned.

We enhance the Token structure using an internal structure definition for the lexeme value, and add all the different types we need to return, as follows:
  typedef struct Token {
      TokenType type;
      char *lexeme;        /* the original string form of the lexeme */
      struct Value {
         int inum;         /* string converted to integer */
         long int lnum;    /* string converted to long int */
         float fnum;       /* string converted to float */
         double dnum;      /* string converted to double */
         char ch;          /* a single character lexeme */
      } value;
  } Token;

Conversion of strings of digits to numbers

The Scanner still returns the character string form of the token as token.lexeme; however, it also converts numbers and returns them in the new value structure. For example, the accepting state for unsigned integers in the Scanner might do this:
IF current state accepts an integer THEN
   token.value.inum = convert_str_to_int(token.lexeme,&status)
   IF status is not TRUE THEN
      -- print an error message saying the conversion failed
      token.type = unknown token type
   ELSE
      token.type = integer type              -- success!
   ENDIF
ENDIF
...
return the token to the caller
Back in the parser, the parser would know that tokens of type integer had the real integer value stored as token.value.inum:
token = scanner();
IF token.type is integer type THEN
   printf("Lexeme '%s' has integer value %d\n",
      token.lexeme, token.value.inum);
ELSEIF token.type is a string type THEN
   printf("Lexeme '%s' has string value %s\n",
      token.lexeme, token.lexeme);
ELSEIF ... other types ...
   ...
ENDIF

C Language Unions

Suppose we were to create a table of the tokens recognized by the scanner. The size of the table would be the number of entries in the table multiplied by the size of the Token structure. The size of the Token structure depends on the size of the Value structure inside it. Compute for each data type: one byte for a char, two bytes for an integer (PC), four bytes for a long int, pointer or float, eight bytes for a double, multiplied by 1000:
   Token tokenTable[1000];  --> 1000 * (2 + 4 + (2 + 4 + 4 + 8 + 1))
On a typical PC computer, this table uses 25,000 bytes of storage.

Note that any particular Token can be only one type at one time. If we have stored the number 1.23 as a floating-point value in tokenTable[0].value.fnum, we are not using any of the other structure elements of the tokenTable[0].value Value structure. If an element of this token table contains a pointer to an integer, stored in tokenTable[1].value.inum, then tokenTable[1].value.fnum and all the other elements of the tokenTable[1].value Value structure remain empty. This is true for all 1,000 elements of the table; only one of the Value structure elements in each table entry is in use. The empty elements of the Value structure are using storage needlessly.

Since any single token returned by the scanner can have only one type, only one of the Value structure elements will ever be used in each token. We can tell the compiler to "overlap" all the elements of the Value structure onto the same memory location by using a C Language union instead of a struct:
   typedef struct Token {
      TokenType type;
      char *lexeme;        /* the original string form of the lexeme */
      union Value {
         int inum;         /* string converted to integer */
         long int lnum;    /* string converted to long int */
         float fnum;       /* string converted to float */
         double dnum;      /* string converted to double */
         char ch;          /* a single character lexeme */
      } value;
   } Token;
The union reserves memory space for the largest element in the Value union, which in this case would be the 8-byte double with the tag dnum. On a typical PC computer, this table now uses only 14,000 bytes of storage:
   Token tokenTable[1000];   --> 1000 * (2 + 4 + (8))
All the elements inside the union share the same storage locations. This works very well as long as we only ever use one element at a time, which is exactly how the scanner works. To know which data type is stored in the memory space occupied by the union, we look at the TokenType field set by the scanner.

If you store something in the union using one union tag, e.g. the double tag dnum, then retrieve the information using a different union tag, e.g. the long int tag lnum, you usually get garbage:
token.value.dnum = 1.23456e12;
printf("Number garbage is %ld\n", token.value.lnum);
Some or all of the bytes of the union will be used to store the floating-point number 1.23456e12. Some or all of the same bytes will be used when printing the integer. The resulting integer won't contain anything meaningful, though it might relate to the integer value of some or all of the bytes used in the C Language internal representation of the floating point number.

In general, the only safe way you can retrieve data from a union is to use exactly the union tag that was used to put the data there immediately before.

Shorter Syntax for Unions

Having to type token.value.lnum to get the value out of the union is rather long. We'd prefer to say something more direct, such as token.lnum, but C won't allow this. The existence of the union is an implementation necessity required by C Language syntax, and programmers over the history of C have used different ways to "hide" nested structures and unions.
Hide by Macro

#define l_num value.lnumis one trick that lets us write token.l_num and have the preprocessor rewrite that as token.value.lnum. Its disadvantage is that you need a lot of #define statements for a big structure or union, and the code becomes harder to relate to the structure definition.

Hide by shortening

The simplest approach is to cut down the length of the name of the union or structure nested within the outer structure to one letter:
   typedef struct Token {
      TokenType type;
      char *lexeme;        /* the original string form of the lexeme */
      union Value {
         int inum;         /* string converted to integer */
         long int lnum;    /* string converted to long int */
         float fnum;       /* string converted to float */
         double dnum;      /* string converted to double */
         char ch;          /* a single character lexeme */
      } u;
   } Token;
This lets us write token.u.lnum, which is somewhat shorter. Many C programmers use this convention.