LLVM Internals: Independent Code Generator

lahlali issam
6 min readMay 7, 2021

Like many other compiler designs, Clang compiler has three phases:

  • The front end that parses source code, checking it for errors, and builds a language-specific Abstract Syntax Tree (AST) to represent the input code.
  • The optimizer: its goal is to do some optimization on the AST generated by the front end.
  • The back end: that generates the final code to be executed by the machine, it depends on the target.

What’s the Difference between Clang and the Other Compilers?

The most important difference of its design is that Clang is based on LLVM, the idea behind LLVM is to use LLVM Intermediate Representation (IR), it’s like the bytecode for Java.
LLVM IR is designed to host mid-level analyses and transformations that you find in the optimizer section of a compiler. It was designed with many specific goals in mind, including supporting lightweight runtime optimizations, cross-function/interprocedural optimizations, whole program analysis, and aggressive restructuring transformations, etc. The most important aspect of it, though, is that it is itself defined as a first-class language with well-defined semantics.

With this design, we can reuse a big part of the compiler to create other compilers, you can for example just change the front end part to treat other languages.

In this post let’s focus only on the minimal interfaces needed to create an LLVM code generator by using the dependency graph of CppDepend.

The LLVM target-independent code generator is a framework that provides a suite of reusable components for translating the LLVM internal representation(IR) to the machine code for a specified target — either in assembly form (suitable for a static compiler) or in binary machine code format (usable for a JIT compiler).

LLVM IR is designed to host mid-level analyses and transformations that you find in the optimizer section of a compiler. It was designed with many specific goals in mind, including supporting lightweight runtime optimizations, cross-function/interprocedural optimizations, whole program analysis, and aggressive restructuring transformations, etc. The most important aspect of it, though, is that it is itself defined as a first-class language with well-defined semantics. To make this concrete here is a simple example of a .ll file:

define i32 @add1(i32 %a, i32 %b) {
entry:
%tmp1 = add i32 %a, %b
ret i32 %tmp1
}

This LLVM IR corresponds to this C code:

unsigned add1(unsigned a, unsigned b) {
return a+b;
}

The role of a code generator is to translate the IR into a specific target.

LLVM includes out of the box many code generators:

How does the code generator work?

Only two interfaces TargetMachine and DataLayout are required to be defined for a backend to fit into the LLVM system, but the others must be defined if the reusable code generator components are going to be used.

1- TargetMachine

The TargetMachine class provides virtual methods that are used to access the target-specific implementations of the various target description classes via the get*Info methods (getInstrInfo, getRegisterInfo, getFrameInfo, etc.).

This class is designed to be specialized by a concrete target implementation (e.g., X86TargetMachine) which implements the various virtual methods.

Back to TargetMachine to discover with which classes it collaborates directly, for that let’s explore its dependency graph:

What’s the LLVMMC project reported in the above graph?

The MC Layer is used to represent and process code at the raw machine code level, devoid of “high level” information like “constant pools”, “jump tables”, “global variables” or anything like that. At this level, LLVM handles things like label names, machine instructions, and sections in the object file. The code in this layer is used for a number of important purposes: the tail end of the code generator uses it to write a .s or .o file, and it is also used by the llvm-mc tool to implement standalone machine code assemblers and disassemblers.

The MC layer includes the MCStreamer API which could be considered as an assembler API. It is an abstract API that is implemented in different ways (e.g. to output a .s file, output an ELF .o file, etc) but whose API corresponds directly to what you see in a .s file.

MCStreamer has one method per directive, such as EmitLabel, EmitSymbolAttribute, SwitchSection, EmitValue (for .byte, .word), etc, which directly correspond to assembly level directives. It also has an EmitInstruction method, which is used to output an MCInst to the streamer.

On the implementation side of MCStreamer, there are two major implementations: one for writing out a .s file (MCAsmStreamer), and one for writing out a .o file (MCObjectStreamer). MCAsmStreamer is a straightforward implementation that prints out a directive for each method (e.g. EmitValue -> .byte), but MCObjectStreamer implements a full assembler.

For example here are the methods called from the EmitLabel method:

As we can observe these classes are involved in this call:

The MCContext class:

The MCContext class is the owner of a variety of unique data structures at the MC layer, including symbols, sections, etc. As such, this is the class that you interact with to create symbols and sections. This class can not be subclassed.

The MCSymbol class:

The MCSymbol class represents a symbol (aka label) in the assembly file. There are two interesting kinds of symbols: assembler temporary symbols, and normal symbols. Assembler temporary symbols are used and processed by the assembler but are discarded when the object file is produced. The distinction is usually represented by adding a prefix to the label, for example “L” labels are assembler temporary labels in MachO.

MCSymbols are created by MCContext and uniqued there. This means that MCSymbols can be compared for pointer equivalence to find out if they are the same symbol. Note that pointer inequality does not guarantee the labels will end up at different addresses though. It’s perfectly legal to output something like this to the .s file:

foo:
bar:
.byte 4

In this case, both the foo and bar symbols will have the same address.

The MCSection class:

The MCSection class represents an object-file specific section. It is subclassed by object file specific implementations (e.g. MCSectionMachO, MCSectionCOFF, MCSectionELF) and these are created and uniqued by MCContext. The MCStreamer has a notion of the current section, which can be changed with the SwitchToSection method (which corresponds to a “.section” directive in a .s file).

2- DataLayout

The DataLayout class is the only required target description class, and it is the only class that is not extensible (you cannot derive a new class from it). DataLayout specifies information about how the target lays out memory for structures, the alignment requirements for various data types, the size of pointers in the target, and whether the target is little-endian or big-endian.

For all TargetMachine methods related to size, alignment or structure. Its redirected to the DataLayout class to give this information.

To resume, if you are interested to go deep inside the LLVM behavior of the code generators, it’s interesting to get the clang source code, build it and debug some functions of the ClangCodeGen project. To check how it generates the code from the IR.

--

--