The Plinth compiler turns source code into LLVM bitcode. This bitcode is a very versatile format, because it can be translated into machine code that runs on most processors. Not only that, but it can be used to store arbitrary metadata. Plinth takes advantage of this in several ways.
Metadata
Plinth bitcode files do not just store executable code, they also store metadata about how that code works. The high-level structure of a program is encoded in metadata and stored in the generated .pbc
files.
The plinth compiler reads code from source files, and generates a single LLVM module for each type definition (class, interface, etc.)1. At this point, it generates the metadata for that type definition. A class’s metadata consists of:
- The class’s fully-qualified name.
- The super-class’s fully-qualified name (if any).
- A list of fully-qualified names of its implemented interfaces.
- Whether or not the class is abstract, immutable, and other applicable modifiers.
- Lists of fields, constructors, and methods.
All of the relevant data about each field, constructor, and method is also stored in the metadata, for example their types.
The reason for all of this metadata is to allow bitcode files to be used as libraries to import types from. Given all of the information from the metadata, the compiler can resolve a type and any/all of its members without seeing its source code. It can also generate references to things inside that type which will be filled in at link-time.
During linking, LLVM resolves all of the function calls and global variable accesses to the right places so that the code can run. But in addition to this, it links all of the types’ metadata together. The result is a .pbc
file which contains all of the metadata required to import any type that was linked into it, as well as the code to link against. This allows you to link all of the bitcode files from your library together, and import them all at once by using the compiler switch: -i mylibrary.pbc
Executable Code
The rest of what the plinth compiler generates is executable code. What actually gets encoded in machine code is a set of functions and global variables. Each method becomes a function, and each static field becomes a global variable; but there are other uses, such as run-time type information, virtual function tables, and string constants.
When you create a class, the code generator does two things: calls its allocator to allocate enough memory for it and populate its run-time type information (RTTI) and virtual function table (VFT) fields, and calls its constructor to perform any user-specific construction that must be done. In order to understand how this works, let’s take a look at the native layout of an object:
class Foo { uint fooField; void print() { stdout::println("Hello, World!"); } } class Bar extends Foo { string barField; uint add(uint x, uint y) { return x + y; } } |
Foo’s native representation:
Bar’s native representation:
|
The first thing to point out is that Foo’s native representation is compatible with the start of Bar’s native representation. The differences between Foo and Bar are:
- The RTTI of the object is different. This means that it always gives the most-derived type of the object.
- The VFT pointers are different. VFTs are recalculated at run-time for each class, so that they always contain the most-specific methods for their run-time type. This adjustment is what makes virtual function tables useful, and the fact that it is done at run-time allows for methods to be overridden without breaking binary compatibility. (See also: Binary Compatibility2)
In order to create an instance of Bar, its allocator must allocate enough memory to hold it, and then fill in the RTTI pointers and the VFT pointers. Since it knows the most derived type of this new object, it can just copy these pointers from some LLVM global variables. Once it has been allocated, it can be passed into the constructor to be properly initialised by some user-defined code.
Run-time Type Information
Run-time type information, as already mentioned, is a block of data (stored in a global variable) that describes a particular type. All objects on the heap store a pointer to the RTTI block for their type, including arrays, objects, and all other types which have been cast to an object.
Some of the data stored in RTTI is quite interesting. There is the obvious byte representing the sort of type being represented (i.e. array, primitive, tuple, etc.), but there is also a special pointer to something called a VFT search list.
A VFT search list is a simple array of tuples, each containing a raw string ([]ubyte) and a pointer to a VFT. The string is the fully qualified name of a (super-)type of the actual run-time type, and the VFT pointer is for the VFT corresponding to that (super-)type. To explain why this is required, we must consider how interfaces are represented in code.
Interfaces are really just lists of (usually abstract) methods which are inherited by any classes deriving from them. The problem is that any class can implement an interface, and they do not have anything in common whatsoever in terms of object layout, so what is the native representation of an interface? The answer is that it must be a pointer to the object, combined with a VFT for all of the methods the interface declares. In LLVM bitcode, this is represented by a simple structure type (passed by value, not stored on the heap):
{ foo.bar.MyInterface_VFT* , object* }
So where does this VFT pointer come from? If we are converting from a type which we know implements this interface, we know exactly the offset to the interface’s VFT pointer, so we can just look it up. However, if we are casting from something else, such as object
, we need to search somewhere for the pointer. This is where the VFT search list comes in: we know the fully qualified name of the interface we are casting to, so we can just look up the VFT in this search list. Of course, this is quite inefficient, but it is the only way to perform the cast correctly.
There is another reason to have VFT search lists, or at least the string part of them, and that is so that the instanceof
operator can check whether a given class/interface makes up part of an object.
Calling Conventions
In order to call functions in native code, all of the arguments must be passed from the caller to the callee, either using registers or on the stack. These calling conventions are handled automatically by LLVM (although it is possible to specify whether to use the C calling convention or one of several others).
Plinth doesn’t really care which low level calling convention is used, as long as it is used consistently. However, it does care about the way that callees and arguments are passed into functions. One particular influence is the native representation of function types.
Function types are not just represented by a standard function pointer. This is because LLVM functions are essentially static, and passing just a function pointer would make it impossible to pass an instance method around. Instead, function types are actually a tuple of the function’s callee and the function to call on it. For example, if we translated the type of {uint -> boolean}
, we would get the LLVM type:
{ opaque* , i1 (opaque*, i32)* }
Here, opaque*
means a pointer to some structure we do not know the layout of. So now we can convert any object to an opaque*
, and convert the method which takes it as a callee to a function taking an opaque*
as its first argument. In order to call it, we pass the callee opaque*
as the first argument to the function. This works for instance methods, but for static methods the cast won’t work unless we add an extra unused opaque*
argument to the start of the argument list, and pass null as that argument whenever we call it; this is the calling convention for all static methods in plinth.
The most difficult calling convention to get right is the one for interface methods. What you might expect is that the interface callee is passed in the representation described above, as a tuple of a VFT pointer and an object pointer. Unfortunately, things are much more complicated in practise.
The main consideration is that interface methods can override methods from object or any super-interfaces, and can be overridden by methods from any sub-type. This means that in order to insert interface methods into a VFT, they must conform to the normal calling convention for instance methods: the callee must be a pointer to an object. This has the unfortunate side-effect that every interface method must begin with a cast from object to their own interface type, which (as mentioned above) requires a search through that object’s VFT search list.
A much more complicated system that I may investigate in the future might be able to avoid this search, at the cost of much more indirection for VFT pointers and method calls on interfaces. However, this could be precluded by concerns about binary compatibility.
1: If you run the compiler with the -d <output-dir>
option, it will write them to files in that directory.
2: Since this post, the object representation has been changed to add RTTI, an object VFT for builtin methods, and interface VFTs.