From Source Code to Silicon
Understanding how the compiler toolchain works will make you a better programmer and a better debugger. Quite a bit happens from the time you hit the 'build' button until the target device is running your code. The process provides both opportunities and pitfalls. Below, we'll take a look at each step in the process and explain how they interact with each other.
C Source Files and Header Files
At the top of the toolchain are the source and header files where you describe to the compiler what it is that you hope to make the hardware do. Source and header files are essentially the human interface of the system. They are written to be readable by humans, yet understandable to the compiler program. They contain comments (or at least they should) and are usually formatted with human-readability in mind (indentation, white space, etc.).
The make utility is a program that controls the entire build process. It is responsible for calling the compiler, passing it all the files that need to be built, and everything else down to calling the linker. When you click on a build button in MPLAB® X IDE, you are actually launching the make program which handles the entire build process in the background. Associated with the make program are makefiles. These files define which files are in a project, keeps track of their dependencies, and only builds parts of the program that have changed since the last build (unless explicitly told to rebuild everything). MPLAB X automatically generates makefiles based on your project's settings, so usually there will be no need to modify them. However, some developers like to take more control of the build process and add additional steps based on their unique needs. In those cases, a custom makefile may be used to drive the build process. However, that is well beyond the scope of this class.
The C compiler is comprised of three main components: the Preprocessor, the Parser, and the Code Generator. Each one performs a distinct task in the overall process.
The preprocessor's main job is to strip away all of the things that make a source file readable by humans and present the code in its most fundamental form to the parser.
- Comments and unnecessary white space are removed.
- Text substitution labels and macros are replaced by their actual values.
- Header file contents are merged into the C source files.
The parser performs the bulk of the work in the compiler.
- Lexical Analysis: Creates tokens from code provided by the preprocessor. (Tokens are characters or a string of characters that together form something significant, such as a keyword, mathematical operator, or variable name).
- Syntactical Analysis: Makes sure that the tokens form expressions that conform to the rules of C.
- Semantic Analysis: Determines what actions must be taken by each expression and pass a list of these actions to the code generator.
The code generator is unique to each microcontroller architecture. It takes a list of actions the program needs to perform from the parser and translates those into device specific assembly language instructions that are just one step removed from the machine code (binary code) that a microcontroller can understand. The generated assembly code is relocatable, which means nothing in it has been assigned to a physical address on the device. Determining where code and variables will reside in a device's memory space is the job of the linker, which we will discuss later.
The assembler takes what is essentially a human-readable form of machine code and translates it directly into binary machine code. Unlike C, assembly instructions have a one-to-one relationship with operations that the target microcontroller can perform. So while the assembler also makes use of a preprocessor, parser, and code generator, its parser is vastly simpler since it doesn't need to perform any semantic analysis. The assembly code is the list of actions to be taken. The output of the assembler is an object file. Object files contain the binary code in an almost executable format ready for processing by the linker.
Librarian or Archiver
The librarian (called the archiver by GCC based compilers) is a tool for placing object files into a container called the library (or archive by GCC based compilers). A library is nothing more than a collection of (usually related) object files that simplifies the task of reusing them in many projects.
The linker's job is to combine a set of object files and library files (themselves, simply a collection of object files) into a single executable file or in the case of microcontrollers, a *.hex file (pronounced simply 'hex file').
The object files are relocatable, meaning that the code and data they contain may be placed anywhere in the memory map of the device. In other words, there are no hard-coded addresses in the original source (there are some exceptions, but we'll ignore those for now). This makes it possible to mix source files and libraries for the same microcontroller in a single project. There can be no address conflicts if the variables and code blocks are not being forced to a specific address in the device's memory map.
One consequence of being relocatable is that any reference to a variable or function in the code is nothing but a place holder. For example, an instruction that tells the program to call MyFunction cannot make the call because MyFunction does not yet have an address. The instruction simply has a placeholder that tells the linker, "when you figure out where MyFunction is going to live in program memory, put its address here so I can jump to it when the program is running."
So, the main task of the linker is to figure out where all of the code and data will reside within the microcontroller's memory. Assisting it in this task is the linker script which defines the structure of the microcontroller's memory as well as which locations may or may not be assigned by the linker for a specific purpose. For example, some memory is reserved specifically for use by interrupts. With this information, the linker attempts to locate each block of code and each variable at a specific address in the device's memory map. If it finds a successful arrangement, the linker will go back to each placeholder and replace the reference to a variable name or function name with the address it was just assigned.
At this point, the program is in its final form and the linker outputs a *.hex file that contains the binary image to be programmed into the device's flash memory. If you built the project for debugging, the linker will instead output an *.elf file (or a *.coff or *.cod file for older compilers and IDEs) which contains the same binary image as the *.hex file, but with special debug code that is required for our debug tools to interact with the device.
Finally, the linker may optionally produce a *.map file which will show how each variable and block of code was arranged into the device's memory space.
The *.hex file is the final, executable form of the program, conceptually equivalent to an *.exe on a DOS/Windows platform. This file is passed on to a programming tool, such as an MPLAB® ICD4 or a PICkit™ 4 (among many others), which then "burns" the binary image into the target microcontroller's flash memory. When the microcontroller is released from reset or is powered up, the code from the *.hex file will run.