% Author: Ian! D. Allen - idallen@idallen.ca - www.idallen.com % Date: Winter 2011 - January to April 2011 - Updated 2011-04-02 04:23 EDT % Title: Linking and External “Object” Files and Libraries - [Course Home Page] - [Course Outline] - [All Weeks] - [Plain Text] Loosely based on notes for DAT2343 by Alan T. Pinck. Linking and External “Object” Files and Libraries ================================================= Often programs are built from a combination of “new” code and existing routines which have already been “Assembled” or “Compiled” into object code (machine language). The working program contains the code of all the modules, with each module loaded into memory after the one preceding it. Modules make calls and references to each other, and the actual locations of the code and data items being called and referenced aren’t known until all of the modules have been collected and loaded together into memory. Resolving these module inter-dependencies when a program is being built is the job of the “link editor” or “linker”. To be able to compile or assemble a routine into object code separately from the rest of the program and then later link it into a program, we need to create some data structures that permit the generated object code to be properly integrated and linked with the rest of the program by the linker when the time comes. To do this, in addition to the machine code itself, the object file needs at least three tables to assist the linker in doing its job: 1. **Relocation Table**: A table of locations of relocatable addresses - addresses that must be modified when the module is loaded in a new memory location. 2. **External References**: A table of of symbols that are External References (references to symbols not inside this module), along with the locations inside the module where the references are made. 3. **Public Symbols**: A table of symbols that are Public Symbols, visible and available to other modules, along with their locations inside the module. Let’s examine why we need these three tables. Table 1: The Relocation Table ----------------------------- At link time, the linker builds our program from all its different component object modules. Many different separately-compiled/assembled object modules are collected and linked together into one executable file. When we write and compile/assemble a single module into machine code, we don’t know where in memory the machine code for this module will be placed by the linker. If our module is to be linked into different programs, the module will likely be placed in different locations in memory in every program of which it is a part. There is no “right” starting memory address we can assume for our module that works in every possible program. To make things simple, the compiler/assembler makes the *assumption* that an object module loads and runs at memory location zero. It creates machine code and all the memory references as if the module loaded at location zero, knowing full well that it probably won’t actually be put at location zero by the linker. In fact, only the first module of a program can load starting at memory location zero. (This is usually the main program.) The rest of the modules must come after, in sequence, at higher memory locations. The linker must load each object module into successively higher starting locations in memory as it builds the executable program. If an object module contains instructions that refer to memory locations inside the module itself (and most modules do!), the those memory references have all been created with the assumption that the module will load into memory starting at location zero. (All modules are compiles/assembled assuming a zero start location.) Since we will be loading the module at some higher memory location, all the internal memory addresses in the object module will be incorrect if we don’t “fix” the memory references in the object module when we load the module starting at its new memory address. For example, if a module built to work at location zero gets located at location 030, then the addresses used in memory references in the module (e.g. addresses used for LOAD, STORE, JUMP, CALL, etc.) must all have 030 added to them to work correctly at the new module starting address. A LOAD instruction compiled or assembled to reference address 015 when the module loaded at address zero, should now load from address 015+030=045 if the module is loaded to start at location 030. Only some machine instructions refer to memory addresses that need relocating. Many instructions don’t refer to memory addresses and don’t need any modification when the module is loaded at an address other than zero. For example: An INPUT instruction doesn’t refer to memory; so, it needs no fixing; a LOAD does. The linker needs to know, for every separately compiled/assembled object module, which machine instructions contain addresses that need fixing, and which machine instructions do not need fixing. The compiler/assembler solves this problem by creating a table of the location of every instruction in an object module that has a memory reference, for later use by the linker. This is a table of addresses of instructions that will need to have their memory addresses “relocated” by the linker at the time the object module is actually put into memory and integrated into the program by the linker. The linker uses the table of relocatable addresses in each object module to locate which instructions need to have their memory addresses relocated when that particular object module is loaded into memory at link time. This process of fixing object module addresses is called “relocation”. For example, if the linker needs to load our machine code object module into memory starting at memory address 050, the object module must contain a table of which “relocatable” instructions need to have 050 added to them to make the memory references work when the object code is loaded starting at location 050. If the module loaded at address 060, the linker would add 060 to the memory address references of those relocatable instructions. The Relocation Table is a table of locations inside the module that contain memory addresses that need to be relocated. Note that not every instruction in the object module is relocatable! Only instructions that refer to memory locations inside the object module itself will relocate when the object module is linked into memory at a particular memory address. For example, a LOAD or STORE instruction that referenced a memory location inside the object module would need to have its memory reference relocated. An INPUT or OUTPUT instruction would not need to be relocated, since it doesn’t refer to any memory locations. The Relocation Table contains only the addresses of instructions that need to be relocated. The Relocation Table is built for the linker by the compiler/assembler as part of the compiled/assembled object module. Sometimes, instead of a table of addresses, there are “relocation bits” on each instruction that tell the linker whether or not this instruction needs to be relocated. **Relocation Table**: A table of addresses of instructions inside our module that will need relocating if/when our module is linked into a program at an address that is not zero. Table 2: The External References Table -------------------------------------- The code for our module may make references to labels for things that aren’t defined or allocated in this module, whether they be separately-compiled subroutines, library functions, or global variables or data. These things are not included in our source when we compile/assemble our module. They are compiled/assembled separately and actually have labels and occupy storage in some other module. Since only the linker (not the compiler or assembler) will be putting those other modules into memory, we don’t know exactly where the linker will put the things to which we are referring. We know the names of these things; but, we don’t know (and can’t know) their actual memory addresses at the time we compile/assemble our module’s object code. (Only the linker will know, and only when it links our program.) At the time we compile/assemble our source into object code, we can’t know what memory addresses to use for references to external symbols. (“Symbols” is short for “symbolic references” and is a synonym for the labels that we give to memory locations in modules.) These symbols are considered “unresolved” until the linker can find out their addresses. We solve this problem by keeping a table of the names of the external, unresolved symbols to which our code has made reference. The table keeps track of what the name of the symbol is, and exactly where in our object module we made reference to it. At the time the linker builds our executable, the linker will know exactly where in memory it put the things to which we refer, and the linker is expected to come back to our module and patch in the actual memory addresses where it put the things. By keeping track of the names of the external symbols and the locations in our module where we refer to them, the External Reference table makes it possible for the linker to know which locations in our module to patch. If our module refers to the same symbol in several places, the External References table will contain several entries, one for each location in our module where we referred to the symbol. For example, if at two locations 023 and 029 in our module we use CALL instructions to call an external function named “printf”, the External Reference table for our compiled/assembled object module would record both occurrences: “PRINTF” used at location “023” in our module, and “PRINTF” used at location 029 in our module. When it builds our program, the linker will link our module with the library containing printf. It might load printf into memory starting at location 067. The linker will then look for any modules that have the external symbol “PRINTF” in their External References tables. Our module will be one of them. The linker will use our External References table to know that it must put the actual memory address of printf (067) into the CALL instructions at both locations 023 and 029. Note that the memory addresses in the External Reference table refer to memory locations inside the object module and are therefore themselves relocatable - if our module loads at starting at location 010 instead of location zero, the locations to patch with the location of “PRINTF” will be 023+010=033 and 029+010=039. The table of External References is built for the linker by the compiler/assembler as part of the compiled/assembled object module. **External References Table**: A table of unresolved symbol names and memory locations. The symbol names are the labels of external code or data to which our module referred. The memory locations are locations in our module that need to have the addresses of the external code or data inserted at link time. Table 3: The Public Symbol Table -------------------------------- When a source module is compiled/assembled into object code, the result (apart from these few tables we are discussing) is simply machine code - numbers. All the symbolic names from the source code are gone. Looking at the machine code that is the output of a compiler/linker doesn’t tell you anything about what labels or names were used in the original source code. If you were writing a subroutine that you named with a label called “PRINTF”, only the numeric machine instructions for your subroutine would appear in the generated object code; the label name “PRINTF” that you used would be gone. To make it possible for some other module to refer to your printf subroutine by the name you gave it, the symbolic label “PRINTF” and the memory address inside your module that it stands for must somehow be kept with the generated object code. This is the function of the third table: the table of Public Symbols. This table preserves some of the symbol names (labels) and their associated memory addresses so that other modules can refer to them. Not all of the symbols (labels) and associated addresses used inside a module should be published in the table of Public Symbols and thus made “visible” to other modules at link time. Internal variables and loop labels are not made public when the object code for the module is generated. Most functions only have their function names (e.g. “PRINTF”) visible after they are compiled/assembled; none of the internal variables and label names are made public. For example, the printf library routine module may have been coded with many local variables and other labels for loops or constants; however, after the object code was produced, the programmer arranged that the only entry in the Public Symbol table for the printf module is the name “PRINTF” and the offset from the start of the module where the code for printf actually begins. (The offset is often zero if the code starts right at the beginning of the module.) In fact, most assembler programmers would prefer that only a very few of the symbols that they use be visible in the resulting object code file. By default, symbols used in assembler programs are considered hidden and private unless the assembler programmer explicitly makes them visible through use of some form of “public”, “external”, or “global” statement. High-level languages have different rules for which symbols are made public and which are kept private. The C language makes all function names public by default; local variables declared inside the functions are private. You can remove a function’s name from the public table by preceding its name with the keyword “static”. A “static” function’s name is only visible to source code contained inside the file in which it appears. The name does not appear in the table of Public Symbols, and therefore cannot be found or used by the linker at link time. As noted above, the External Reference table is a table of unresolved symbol names whose memory addresses the linker needs to find to finish linking a module into a program. The Public Symbol tables are tables of symbol names and the corresponding memory addresses used to satisfy those unresolved external symbols. To link the program, the linker must search all the available Public Symbol tables in all available modules and libraries to find a Public Symbol name to satisfy the name used in every External Reference. The table of Public Symbols is built for the linker by the compiler/assembler as part of the compiled/assembled object module. **Public Symbol Table**: A list of the programmer-chosen visible symbols (labels) and their corresponding memory locations in the object module. Relocation and Resolving ======================== There are two types of modifications that must be made to a module as it is linked into a program: Relocation and Resolving. *Relocation* uses the **Relocation Table** inside the object module. *Resolving* matches up symbols in the **Public Symbol** and **External Reference** tables. Relocation ---------- Object code must have address references “relocated” by the linker if the code in the module doesn’t load starting at address zero. The addresses of the relocatable instructions that must be modified are kept with the object code in a **Relocation Table**, used by the linker. Resolving external references ----------------------------- Symbolic references to external symbols must be turned into actual addresses by the linker when referenced items are finally placed in memory. The available symbols and their addresses are made public in **Public Symbol** tables and the linker uses these names and addresses to “resolve” the references found in **External Reference** tables. Both tables are kept with their respective object code files. ###Unresolvable Symbols### Before the modules in a program are ready for execution, every symbolic reference in every External References table must be satisfied. If the linker discovers an External Reference to a name for which it cannot find a matching name in any Public Symbol table, the result is an “unresolved” reference, and most linkers will refuse to create an executable program. Typical causes of unresolved symbols are: spelling errors in the references to the External Symbols or in the symbols themselves, and missing object files (and/or libraries of object files) containing the desired subroutines. ###Duplicate Symbols### If the linker loads into memory two modules that have one or more **Public Symbols** with the same names, the linker will abort with a “multiply defined symbol” error. Every public symbol must resolve to exactly one memory address inside one module. If two modules each try to claim that a public symbol resides inside that module, the linker can’t know which one to use to resolve External References. It has no choice but to abandon linking the program. -- | Ian! D. Allen - idallen@idallen.ca - Ottawa, Ontario, Canada | Home Page: http://idallen.com/ Contact Improv: http://contactimprov.ca/ | College professor (Free/Libre GNU+Linux) at: http://teaching.idallen.com/ | Defend digital freedom: http://eff.org/ and have fun: http://fools.ca/ [Plain Text] - plain text version of this page in [Pandoc Markdown] format [Course Home Page]: .. [Course Outline]: course_outline.pdf [All Weeks]: indexcgi.cgi [Plain Text]: 373_LMC_object_file_format.txt [Pandoc Markdown]: http://johnmacfarlane.net/pandoc/