Generation of Executable Code
from Source Program Files


Source Program Files

"Source" program files contain text (characters) that describe the data elements and the processing required to perform some particular information processing task.

The form that this textual description takes depends upon the specific "language" being used. In general such "languages" can be divided into two major categories: "high-level" languages and "low-level" (or "assembler") languages. In either case, the textual description must be converted, from text (characters), into the binary patterns actually used by the computer's internal circuits to specify and control the activities required. A "compiler" is a program that translates "high-level" source program files into these binary patterns (often called "machine-level" code; a "compiler" translates single "high-level" statements into several (often, many) "machine-level" instructions. An "assembler" is a similar, but much simpler, translating program that translates each "low-level" instruction into a single "machine-level" instruction; each "machine-level" action is represented by a single, simple "low-level" textual code, called a "mnemonic" code. Mnemonic coding provides a reasonable way of controlling the computer at the machine level, although it would be a tedious method for developing very large programs or systems.

Understanding the "assembler" process is necessary as a starting point for understanding the translation process of higher level languages (using "compilers") into instructions the computer can process,

In addition to the use of mnemonic codes to represent machine-level actions, assembler (or mnemonic) coding typically uses symbolic labels to represent addresses (including data addresses) that at the machine-level are identified as (numeric) bit patterns. When a specific memory location is required for use in an instruction, a symbolic reference to the corresponding label is used. A label can only correspond to a single (low-level) memory address, however, there may be many symbolic references to this label.

Translation of "mnemonic code" using symbolic addresses into machine-level instruction requires that the mnemonic (source) code file be read twice (a "two-pass" assembly process):

  1. once to determine the memory address to associate with each symbolic reference (each label), and
  2. a second time to replace symbolic references with the associated addresses (and translate operation codes)

Steps in Assembler to Machine Language Translation

Modularization of Program Development

Even once mnemonic code (or high level language code) has been translated into machine level instructions, there is generally a lot of work that may need to be done before we have a working program.

Most programs are developed in a "modular" fashion and are composed of routines (or "subroutines") that have been assembled or compiled into (separate) independent "object" files. Some of these subroutine object files may have been developed as integral components of the specific program being developed, but others may be "general purpose" routines (perhaps even supplied with the "assembler" or "compiler") that may be stored in composite library file or object subroutines.

All these files need to be combined into one complete program. Addresses will not match where they were assumed to belong by the "assembler" (or "compiler") since there would be no way for these programs to tell the size of other routines that would need to appear in memory locations ahead of any specific subroutine. All such addresses, therefore, will need to be modified to compensate for where the routines have actually been loaded into real memory.

In order to more fully explore this process, we will first need to consider how the "Little Man" Computer could be expanded to permit handling of "subroutines".

Subroutine Management

Support for the use of subroutines requires additions/extensions to the basic LMC structure:

The original "Little Man" Computer posed some significant restrictions in implementing an "extension" that would permit management of subroutines. Specifically:

  1. Only one decimal digit (0) was not used by the original "Little Man" as a machine-level "operation code". This meant that, while it was possible to add an new instruction to "call" a subroutine, there were no numeric code values left to add a "return" instruction.

  2. Establishing a method for saving the "return address" (the location where control should resume after completion of the subroutine) is difficult. The "Little Man" Computer supplies only a single "working" register, the Accumulator, and no easy way to use the contents of this register as an address. Furthermore, if the Accumulator were used to hold the "return address" when a call were made to a subroutine, this would eliminate the possibility of using the Accumulator to pass values to or return values from the subroutine.

LMC Subroutine Implementation Method:

The method selected for the extended LMC (called the "sonOfLMC") is to use the unused numeric code 0xy as the CALL xy instruction in a method very similar to the use of 9xy as the JMP xy instruction.

The difference between the CALL xy and JMP xy is that the CALL xy first stores a "return address" trick in the memory location that immediately precedes the location being CALL'ed, i.e. at memory location xy - 1. That "trick" being stored is a JMP instruction that returns to the location physically after the CALL. In other words, the JMP being stored uses the address currently in the Counter when the CALL is being executed.

For CALL xy to work, each subroutine must be coded with a mailbox that is left unused, located immediately before the instruction that is being CALL'd in the subroutine. This unused mailbox will receive the JMP trick put there by the CALL instruction.

When a subroutine has completed its task and wishes to return to the "calling" routine, it must JMP to this trick location (that is to the mailbox containing the JMP put there by the CALL that entered the subroutine). The JMP to the JMP will cause control to return to the statement physically following the original CALL.

Parameter passing and value return for subroutines

The method for passing parameter values to a subroutine is left unspecified by the structure of this extended version of the LMC. Parameter passing is a protocol to be decided upon between the two routines involved, but, in general, information sharing (passing argument values to the subroutine and returning result values to the calling routine) is normally achieved through the use of "global" labelled memory locations.

The main routine will typically copy values into global variables in the subroutine before calling it, and the subroutine will store return values in global variables before returning.

This "normal" method assumes that, if separately assembled (or compiled modules (routines) are used, there will be some method for identifying shared, labelled memory locations.

Separate Module Assembly (to Object Files)

As has been already noted, modules (or "subroutines") are generally assembled (or compiled) from "source" files into separate "object" files. This results in three major problems:

  1. Lost Public Labels - As a source file is translated into machine code by the assembler program, all the label address information is lost. Machine code has no label information. Object files must therefore keep a separate table of Public Labels for use by the linker in identifying important address locations inside the module. Not all labels in the source code may be public. Local variable labels do not need to be preserved in the Public Label table. The label that is the start of a subroutine must be put in the Public Label table so that other programs can call the subroutine.

  2. Calls to External Routines - Calls to routines not included with the source can not have their actual memory addresses filled in to the machine code, since those addresses are not known at the time the source translation is done. The object file must keep a list of these "unresolved external references" for later resolution by the linker program when the linker knows where in memory these external routines have been placed. The object file must include the name of the called external routine and the location within the calling routine where the external routine is called, so that the linker knows what memory locations to fix when the actual memory addresses are known.

  3. Relocatable Address References - Instructions or data that were translated to machine code based on the assumption that the routine would be loaded starting at a specific address, usually zero, need to be modified if the routine is loaded at a different address. Not all machine code needs to be modified, only instructions or data that refer to addresses inside the routine need to be adjusted. The object file must contain a table of all locations containing addresses that must be relocated to reflect the actual load address used for the routine.

Object File Format

As a result of the above three problems, object files typically contain three tables to accompany the machine code. This means an object file contains four parts: machine code plus three tables:

  1. a Public Label table containing a list of all Public Labels defined within the module, and their corresponding mailbox (memory) addresses. "Public" labels are labels that should be made available to other modules. All other labels are private, local labels that are not visible to other modules.

  2. an External Reference table containing a list of all mailbox addresses that make references to public labels defined in other modules, plus the name of the actual label being referenced.

  3. a Relocation Address table containing a list of all the addresses containing instructions that include address references to addresses within the module (based on the false assumption that the module would be loaded starting at address 00). All relocatable memory address references will need to be adjusted to reflect the true memory address values once the true starting address of the module has been established by the linker.

All the addresses in the above three tables are themselves subject to relocation if the subroutine is loaded at a non-zero address.

LMC Relocation and Linking Example

Suppose we want a simple program that outputs pairs of numbers: the first number of each pair output will be a count value starting at 000 and stopping before reaching a MAX of 010; the second number of each pair will be double the value of the first number. In order to allow the user time to read the numbers output, the program will pause after each output for a certain amount of time (performing some time wasting loop). (Note that the program stops before reaching the MAX value; the MAX value is not output.)

We will code this as 3 separate source files: a MAIN program and two subroutines Pause() and Dble(). Here is the pseudocode for the three modules:

        // MAIN program
        int count = 1
        int max = 10
        do {
           output count
           call Pause()
           call Dble(count)
           count += 1
        } while ( count < max );
        stop

        // PAUSE subroutine
        subroutine Pause() {
           int count = 0
           do {
              count -= 1
           } while ( count != 0 );
        }

        // DBLE subroutine
        subroutine Dble(arg) {
           output arg+arg  
           call Pause()
        }

Here are the three modules, translated in to LMC assembly language and then assembled into three object files:

Mnemonic Code Object File - machine code and tables
; main program
Repeat  LDA Count  ; output count
        OUT
        CALL Pause
        LDA  Count
        STO  Arg   ; store subroutine argument
        CALL Dble
        LDA  Count ; count += 1
        ADD  One
        STO  Count
        SUB  Max   ; while ( count < max )
        SKP
        JMP  Repeat 
        HLT
Count   DAT  000
One     DAT  001
Max     DAT  010
Pause   EXT     ; Pause is an External
Arg     EXT     ; Arg is an External
Dble    EXT     ; Dble is an External
         
    00: 113    
    01: 600    
    02: 000    
    03: 113    
    04: 200    
    05: 000    
    06: 113    
    07: 314    
    08: 213    
    09: 415    
    10: 802    
    11: 900    
    12: 700    
    13: 000    
    14: 001    
    15: 010    
Public Label Location
(none, or possibly "MAIN")  
 
External Reference Location
Pause 02
Arg 04
Dble 05
 
Relocatable Address Locations
00, 03, 06, 07, 08, 09, 11
 
Mnemonic Code Object File - machine code and tables
; Pause subroutine
        DAT        ; CALL return address
Pause   LDA  Count ; start of subroutine
        SUB  One
        STO  Count
        SKZ        ; while ( count != 0 )
        JMP  Pause
        JMP  Pause-1
Count   DAT  000
One     DAT  001
        PUB  Pause ; Pause is a Public Label
         
    00: 000    
    01: 107    
    02: 408    
    03: 207    
    04: 801    
    05: 901    
    06: 900    
    07: 000    
    08: 001    
Public Label Location
Pause 01
 
External Reference Location
(none)  
 
Relocatable Address Locations
01, 02, 03, 05, 06
 
Mnemonic Code Object File - machine code and tables
; Dble subroutine
Arg     DAT       ; input argument to Dble
        DAT       ; CALL return address
Dble    LDA  Arg  ; start of subroutine
        ADD  Arg  ; double the input value
        OUT
        CALL Pause
        JMP  Dble-1  
        PUB  Arg  ; Arg  is a Public Label
        PUB  Dble ; Dble is a Public Label
Pause   EXT       ; Pause is an External
         
    00: 000    
    01: 000    
    02: 100    
    03: 300    
    04: 600    
    05: 000    
    06: 901    
Public Label Location
Arg 00
Dble 02
 
External Reference Location
Pause 05
 
Relocatable Address Locations
02, 03, 06

As is always the case with separately created modules, the main program and each subroutine have been coded as if none of the other code existed:

If we collect all the Public Labels from all modules, we have Public Labels for Pause, Arg, and Dble. If we collect all the External References from all modules, we have External References for Pause, Arg, and Dble. Every External Reference can be matched with a Public Label, which means all the modules are present.

Link-Edit Processing or "Linking"

Combining multiple object files together to form a single executable program (often stored as a file) is the job of a program called a link-editor, or "Linker". The link process can be viewed as a sequence of five steps:

  1. Copy the numeric codes from all object files into sequentially contiguous locations, keeping track of the address used for each routine's new memory start address. Typically the first object file is loaded starting at address zero and the other object files go at higher memory locations.

  2. The tables in each object file were built assuming the object file was loaded at location zero. Since we have just loaded many subroutines at non-zero start addresses, we have to fix the addresses in the tables for each routine to reflect the new location of the subroutine in memory. Adjust the memory location specifications in the Public Label, External Reference, and Relocatable Address Location tables in each object file by adding the new memory load address of the subroutine to each location specification in each table.

  3. Having loaded all the subroutines into memory, we now know exactly where in memory each public symbol resides and we can go back and fix up the external reference information in all the loaded routines. Adjust all external reference memory locations by adding the actual public address of the externally referenced public symbol to the address portion of instruction at that location.

  4. The code in each object file was built assuming that the object file was loaded at location zero. Since we have just loaded many subroutines at non-zero start addresses, we have to fix each relocatable instruction in each subroutine to account for the move to a new location. Using the relocatable address locations table in each object file, adjust all relocatable address locations in each subroutine by adding the new memory address of the corresponding routine to the address operand at each relocatable location.

  5. The program is now "linked" and ready to execute as a unit. Save the modified numeric code to an executable file.

We will expand each of the above five steps by running through this process with the previously generated object files:

  1. The Linker copies the numeric instruction codes from each of the object files into contiguous (mailbox) memory locations. The order that the modules are placed in memory is not usually under the control of the programmer - the linker chooses. We choose to load Main followed by Pause followed by Dble (with a small line showing the boundary between each):
    00: 113
    01: 600
    02: 000
    03: 113
    04: 200
    05: 000
    06: 113
    07: 314
    08: 213
    
    09: 415
    10: 802
    11: 900
    12: 700
    13: 000
    14: 001
    15: 010
    
    16: 000
    17: 107
    18: 408
    19: 207
    20: 801
    21: 901
    22: 900
    23: 000
    24: 001
    
    25: 000
    26: 000
    27: 100
    28: 300
    29: 600
    30: 000
    31: 901
    

    The main program loaded at zero; Pause loaded starting at memory address 16; Dble loaded starting at memory address 25. The linker builds a table indicating the actual location where each subroutine was loaded in memory:

    Routine Assembled to load at: Actually Loaded At:
    main routine 00 00
    Pause subroutine 00 16
    Dble subroutine 00 25

    The linker will use these new memory locations to "relocate" addresses inside each subroutine, below.

  2. We have put subroutines into memory at non-zero starting addresses, and we now have to fix the addresses in the object file tables to reflect the new memory location of each subroutine. We have three tables to fix in each of the object files: the Public Label table, the External Reference table, and the Relocatable Address Locations table.
  3. Having loaded all the subroutines into memory, we now know exactly where in memory each public symbol resides and we can go back and fix up the unresolved external reference information in all the loaded routines:

    All external references have now been updated to refer to their correct memory locations.

  4. Finally, the instructions at the locations identified by the (modified) relocatable address locations table inside each object file must be adjusted by the actual location where that specific subroutine was loaded. The contents of memory at the addresses specified in the (modified) relocation address locations table must be modified to compensate for the actual load address of the subroutine:

    This completes the relocation of all the instructions in all the subroutines.

The linker has resolved all external references and relocated all relocatable code. At this point all address operands should be correct and the combined block of code properly executable. The code is output to an executable file, ready for loading and execution by the system.

Link Process Summary

Load and Execute Operating System Functions