Category Archives: Plinth

Casting and Instanceof

Last week, I implemented the instanceof operator. Doing this made me consider in much more detail how casting and instanceof should work, as there are several ways that they could.

The Unified Type System

A unified type system is one where every type can be cast to some all-encompassing super-type. In Plinth, the name of that type is ‘object‘, or to be more specific ‘?#object‘ in order to account for nullability and immutability.

When you cast, say, a double to an object, you are implicitly performing a heap allocation so that the double can be stored on the heap inside something that looks like an object. In fact, it is just an object with a double tacked onto the end, and with its virtual functions replaced by ones which extract that double and call the real functions. The same system is used to cast tuples, compounds, functions, and all of the other primitives to objects.

On the other hand, arrays and classes already look enough like objects to be cast to them directly (using the same pointer).

Because we might want to use instanceof on an object, or check whether to throw a CastError before we actually do a cast, every object stores a pointer to a block of run-time type information (RTTI). This RTTI holds data about the sort of type (e.g. ‘primitive’ or ‘array’), and the properties of the individual type (e.g. the kind of primitive it is, or an array’s base type). When we cast a double to an object, that object has an RTTI block for double.

Casting

Different casts are performed in very different ways. A cast from uint to long simply zero-extends the binary value from 32 bits to 64 bits, whereas a cast from a class Foo to object is just a matter of reinterpreting the same pointer as a different type.

Casting from an object to some other type is much more complicated. If the RTTI for the object does not match the destination type properly, we need to throw a CastError. If it does match, we might have to extract a value from inside the object, or maybe reinterpret that object as a class, or possibly search through the RTTI for a super-interface’s VFT and tuple it with the object pointer.

But what happens if we do the following:

long a = 5;
object obj = a;
int b = cast<int> obj;

The object is a long, but longs can be cast to ints easily – they are just a truncation from 64 bits to 32 bits.

The problem is knowing that obj is a long. It might be anything from a boolean to a string, in which cases the result should definitely be a CastError. Since the RTTI for long doesn’t match the values we expect for an int, the cast won’t be allowed.

However, with a lot of work, it would be possible to allow it. We could write a really long line of checks for whether the run-time-type is int or ubyte or boolean or float or {uint -> string} or [][]double. We would actually only need to check the types that we could convert from, but it would require a huge amount of LLVM code for such a small amount of Plinth code.

This type of “transitive” cast gets especially cumbersome when you consider the fact that you can convert tuples like (ubyte, short, boolean) to (int, long, boolean). If this type of cast were allowed, casting from an object to a tuple would require looking through the RTTI for each of the tuple’s values (recursively) to first check whether the conversion is possible, and second find out how to perform it.

In Plinth, transitive casts are not allowed. In order to handle the sort of cast we tried to do above, we would have to do something like:

int b = cast<int> cast<long> obj;

Instanceof

The instanceof operator (which might be renamed to ‘is‘ at some point) allows you to check whether a value is an instance of a given type. For example:

long a = 5;
object obj = a;
boolean isInt = obj instanceof int; // false
boolean isFoo = obj instanceof ?#Foo; // type error: cannot check against a nullable or immutable type

As shown, you cannot check whether something is nullable or data-immutable. This is because these are properties of the reference, not the value itself: even if the reference is immutable, the value behind the reference usually won’t be. Similarly, the value behind a nullable field won’t itself be nullable – it can’t be null. The exception to this is when checking nested types: []boolean is different from []?boolean.

There are several different ways which instanceof could have worked. For example it could return true whenever the value is assignable to the type without losing data, so a uint with value 3 would be an instance of ubyte, because it fits inside an 8 bit unsigned value. This scheme would have required the same sort of transitive checking described above.

Another way would have been to return true whenever the assignment would work in a single step without losing data (i.e. the same, but without the transitive checks), which would illustrate exactly when casting would work. This could give you true for (ubyte, Foo, string) instanceof (uint, Bar, string). It could even allow nullable values, returning true if a null value were checked against a nullable type. The problem is that this system is very often confusing and can’t tell you unambiguously whether this number you’ve got is a ubyte.

Plinth uses the simplest and hopefully the most obvious system: value instanceof type is only true when the value is already an instance of type. For classes and interfaces, this means checking against all super-types as well, because those super-types are part of the sub-type. But comparing a short to an int will result in false even if the cast is known at compile time to be just a sign-extension to 32 bits.

Bitcode Representation

The Plinth compiler turns source code into LLVM bitcode. This bitcode is a very versatile format, because it can be translated into machine code that runs on most processors. Not only that, but it can be used to store arbitrary metadata. Plinth takes advantage of this in several ways.

Metadata

Plinth bitcode files do not just store executable code, they also store metadata about how that code works. The high-level structure of a program is encoded in metadata and stored in the generated .pbc files.

The plinth compiler reads code from source files, and generates a single LLVM module for each type definition (class, interface, etc.)1. At this point, it generates the metadata for that type definition. A class’s metadata consists of:

  • The class’s fully-qualified name.
  • The super-class’s fully-qualified name (if any).
  • A list of fully-qualified names of its implemented interfaces.
  • Whether or not the class is abstract, immutable, and other applicable modifiers.
  • Lists of fields, constructors, and methods.

All of the relevant data about each field, constructor, and method is also stored in the metadata, for example their types.

The reason for all of this metadata is to allow bitcode files to be used as libraries to import types from. Given all of the information from the metadata, the compiler can resolve a type and any/all of its members without seeing its source code. It can also generate references to things inside that type which will be filled in at link-time.

During linking, LLVM resolves all of the function calls and global variable accesses to the right places so that the code can run. But in addition to this, it links all of the types’ metadata together. The result is a .pbc file which contains all of the metadata required to import any type that was linked into it, as well as the code to link against. This allows you to link all of the bitcode files from your library together, and import them all at once by using the compiler switch: -i mylibrary.pbc

Executable Code

The rest of what the plinth compiler generates is executable code. What actually gets encoded in machine code is a set of functions and global variables. Each method becomes a function, and each static field becomes a global variable; but there are other uses, such as run-time type information, virtual function tables, and string constants.

When you create a class, the code generator does two things: calls its allocator to allocate enough memory for it and populate its run-time type information (RTTI) and virtual function table (VFT) fields, and calls its constructor to perform any user-specific construction that must be done. In order to understand how this works, let’s take a look at the native layout of an object:

class Foo
{
  uint fooField;
  void print()
  {
    stdout::println("Hello, World!");
  }
}

class Bar extends Foo
{
  string barField;
  uint add(uint x, uint y)
  {
    return x + y;
  }
}
Foo’s native representation:

Foo RTTI pointer
object VFT pointer (adjusted for Foo)
Foo VFT pointer
fooField

 

Bar’s native representation:

Bar RTTI pointer
object VFT pointer (adjusted for Bar)
Foo VFT pointer (adjusted for Bar)
fooField
Bar VFT pointer
barField

The first thing to point out is that Foo’s native representation is compatible with the start of Bar’s native representation. The differences between Foo and Bar are:

  • The RTTI of the object is different. This means that it always gives the most-derived type of the object.
  • The VFT pointers are different. VFTs are recalculated at run-time for each class, so that they always contain the most-specific methods for their run-time type. This adjustment is what makes virtual function tables useful, and the fact that it is done at run-time allows for methods to be overridden without breaking binary compatibility. (See also: Binary Compatibility2)

In order to create an instance of Bar, its allocator must allocate enough memory to hold it, and then fill in the RTTI pointers and the VFT pointers. Since it knows the most derived type of this new object, it can just copy these pointers from some LLVM global variables. Once it has been allocated, it can be passed into the constructor to be properly initialised by some user-defined code.

Run-time Type Information

Run-time type information, as already mentioned, is a block of data (stored in a global variable) that describes a particular type. All objects on the heap store a pointer to the RTTI block for their type, including arrays, objects, and all other types which have been cast to an object.

Some of the data stored in RTTI is quite interesting. There is the obvious byte representing the sort of type being represented (i.e. array, primitive, tuple, etc.), but there is also a special pointer to something called a VFT search list.

A VFT search list is a simple array of tuples, each containing a raw string ([]ubyte) and a pointer to a VFT. The string is the fully qualified name of a (super-)type of the actual run-time type, and the VFT pointer is for the VFT corresponding to that (super-)type. To explain why this is required, we must consider how interfaces are represented in code.

Interfaces are really just lists of (usually abstract) methods which are inherited by any classes deriving from them. The problem is that any class can implement an interface, and they do not have anything in common whatsoever in terms of object layout, so what is the native representation of an interface? The answer is that it must be a pointer to the object, combined with a VFT for all of the methods the interface declares. In LLVM bitcode, this is represented by a simple structure type (passed by value, not stored on the heap):

{ foo.bar.MyInterface_VFT* , object* }

So where does this VFT pointer come from? If we are converting from a type which we know implements this interface, we know exactly the offset to the interface’s VFT pointer, so we can just look it up. However, if we are casting from something else, such as object, we need to search somewhere for the pointer. This is where the VFT search list comes in: we know the fully qualified name of the interface we are casting to, so we can just look up the VFT in this search list. Of course, this is quite inefficient, but it is the only way to perform the cast correctly.

There is another reason to have VFT search lists, or at least the string part of them, and that is so that the instanceof operator can check whether a given class/interface makes up part of an object.

Calling Conventions

In order to call functions in native code, all of the arguments must be passed from the caller to the callee, either using registers or on the stack. These calling conventions are handled automatically by LLVM (although it is possible to specify whether to use the C calling convention or one of several others).

Plinth doesn’t really care which low level calling convention is used, as long as it is used consistently. However, it does care about the way that callees and arguments are passed into functions. One particular influence is the native representation of function types.

Function types are not just represented by a standard function pointer. This is because LLVM functions are essentially static, and passing just a function pointer would make it impossible to pass an instance method around. Instead, function types are actually a tuple of the function’s callee and the function to call on it. For example, if we translated the type of {uint -> boolean}, we would get the LLVM type:

{ opaque* , i1 (opaque*, i32)* }

Here, opaque* means a pointer to some structure we do not know the layout of. So now we can convert any object to an opaque*, and convert the method which takes it as a callee to a function taking an opaque* as its first argument. In order to call it, we pass the callee opaque* as the first argument to the function. This works for instance methods, but for static methods the cast won’t work unless we add an extra unused opaque* argument to the start of the argument list, and pass null as that argument whenever we call it; this is the calling convention for all static methods in plinth.

The most difficult calling convention to get right is the one for interface methods. What you might expect is that the interface callee is passed in the representation described above, as a tuple of a VFT pointer and an object pointer. Unfortunately, things are much more complicated in practise.

The main consideration is that interface methods can override methods from object or any super-interfaces, and can be overridden by methods from any sub-type. This means that in order to insert interface methods into a VFT, they must conform to the normal calling convention for instance methods: the callee must be a pointer to an object. This has the unfortunate side-effect that every interface method must begin with a cast from object to their own interface type, which (as mentioned above) requires a search through that object’s VFT search list.

A much more complicated system that I may investigate in the future might be able to avoid this search, at the cost of much more indirection for VFT pointers and method calls on interfaces. However, this could be precluded by concerns about binary compatibility.

1: If you run the compiler with the -d <output-dir> option, it will write them to files in that directory.

2: Since this post, the object representation has been changed to add RTTI, an object VFT for builtin methods, and interface VFTs.

Plinth’s Type System

Plinth has an extensive type system. As described in last week’s post, there are several different sorts of user defined types, but there are also lots of different sorts of built-in types:

Primitives

There are three main sorts of primitive type: integers, floating point numbers, and booleans. Integers have both signed and unsigned variants.

ubyte byte boolean
ushort short float
uint int double
ulong long

As in several other languages (and in contrast to C/C++), bytes are 8 bits, shorts are 16 bits, ints are 32 bits, and longs are 64 bits.

One thing to note is that there is no char type. This is because it would just be an alias for an integer large enough to hold all of the possible character code points (i.e. uint), and we may as well use integers instead. Character literals will be allowed, but they will be equivalent to an integer literal for the character code they represent (so writing ‘a’ will be the same as writing 97).

Arrays

Array types are written slightly differently than they are in most other languages. Here are a few array declarations and creations:

[]uint arrayOfUnsignedIntegers = new [5]uint;
[]?float arrayOfNullableFloats = new [9]?float;
[]string arrayOfStrings = new []string {"Hello", "Arrays"};
?[]#object nullableArrayOfImmutableObjects = null;

As you can see, this syntax allows us to represent the nullability and immutability of each part of the type precisely and unambiguously. With the postfix notation that most languages use, this sort of type would be much more difficult to understand.

Array accesses, however, are still postfix. This makes much more sense for indexing, since this way for multidimensional arrays the indices are in the same order as in the type signature:

[][]double matrix = new [2][10]double;
double lastValue = matrix[1][9];
[]double lastRow = matrix[matrix.length - 1];
double sameLastValue = lastRow[lastRow.length - 1];

If an array is nullable, array element (or even length) accesses on it are not allowed. You must first cast it to a not-null array type.

Tuples

Tuple types are written by enclosing a list of types in parentheses. They can be declared and assigned to as follows:

(uint, string) test = foo();
(?[]object, boolean) asdf = null, false;

The assignment syntax is quite flexible, and allows you to declare multiple variables at once, while extracting their values from e.g. a function call:

uint x, y = 5, 4; // this is equivalent to: (uint, uint) x, y = 5, 4;
x, y = y, x;
([]string, double) str, d = function();

Tuples can also be nullable:

?(uint, string) test2 = test;
?(bigint, double) tuple = check ? bigint(2), 5.3 : null;

There are two different ways to extract a single value from a tuple:

(bigint, string) values = getValues();
bigint large = values ! 1;                 // (the index into the tuple must be a constant value - calculations are not allowed)
(bigint, string) _, text = values;
large, _ = values;

Of course, splitting up a nullable tuple in this way is not allowed, you would have to cast it to not-null first.

Object

The object type is the super-type of all other types. It can be used as follows:

object nothing = object();
object integer = 5;
object tuple = 7, "hello";
?object nullable = null;
#object immutableArray = new [4]short;

As with several other types, objects can be nullable and/or immutable.

Objects are always stored on the heap. This means that in order to convert primitives, compound types, tuples, and functions to objects, we must store them on the heap, boxed inside objects. This is unfortunate, since this conversion is often implicit and is not at noticeable as a normal heap allocation using ‘new’ would be, but it is required in order to have a unified type system, and it is the only way of doing generics without templated code.

Some weird situations start occurring when you convert a compound type to an object. Compound types are usually passed by value, but when we store them as objects on the heap, all references to them will use the same instance of that object, bypassing the usual pass-by-value semantics.

Functions

Function types are written as a list of parameter types followed by a return type:

{uint, string -> void} foo = someMethod;
#{ -> string} toStr = 9.81.toString;

All methods can be accessed as function-typed values in this way. The interesting thing about function values is that they not only store a function pointer, but a pointer to the callee to call the function on. This allows you to pass around a method that changes some of an object’s state. For types which are not just pointers (i.e. primitives, compound types, tuples, functions), the callee is first stored as an object on the heap.

Functions can be nullable and/or immutable, but immutability means something different for functions than it does for objects and arrays. Objects and arrays have data-immutability, where an immutable value cannot have any of its data modified. In contrast, functions have method-immutability, in the sense that an immutable method cannot modify any global state. If you call an immutable function, you can be fairly sure that it won’t have any side-effects (modulo native methods and mutable fields).

Strings

string is actually just a compound type, but string values are special because they can be created by using string literals in double quotes, and because they can be concatenated using the + operator. Other than this, strings are an immutable compound type which has methods for common operations like finding the character at a given position, or finding its length.

Strings have full support for unicode; in fact, they are stored in UTF-8 by default because it usually uses much less memory than UTF-32. Since UTF-8 is a variable-length format, with a single character taking anywhere from one to six bytes, getting the length of a string is an O(n) operation. Under the hood, the string type lazily calculates the UTF-32 representation the first time it is needed (e.g. when finding the length), and stores it for later use.

Here is the source code for string, and some actual plinth code which isn’t just an example.

 

Next week’s post will be on plinth’s generated LLVM bitcode, and run-time type information.

User-Defined Types

Plinth has several different sorts of user-defined type: classes, interfaces, compound types, and (eventually) enums. In this post, I will talk about how each of them works.

Classes

Classes act much as they do in Java and C♯, with single inheritance of classes, but multiple inheritance of interfaces. They are allocated on the heap, and must be created with the new keyword, as follows:

class Foo extends Test implements Runnable
{
  uint value;
  static uint lastValue;
  this(uint value)
  {
    this.value = value;
    lastValue = value;
  }
  void run()
  {
    stdout::println(toString());
  }
}
// ...in a method somewhere...
Foo foo = new Foo(12);

Classes can have all of the different types of members: initialisers, constructors, fields, methods, and properties. All of these (except constructors) can optionally be static, which means they do not require an instance of the class to be used. Members of an instance of a class can be accessed using the . operator, but static members of a class must be accessed with the :: operator on the class’s type. For example:

foo.value = 8;
Foo::lastValue = 97;
foo.run();

Classes also inherit instance members from their super-classes and super-interfaces. All classes (and in fact, all types) get default implementations of certain methods, such as toString() and equals().

Something that all classes must implement is the set of abstract methods and properties which their super-types declare but do not define. One point that is worth noting here is that abstract methods may have implementations which are higher up in the class hierarchy than the abstract method itself (e.g. something may override toString() and make it abstract). In this case, any subclasses would have to provide their own implementation of that method (or use one from an interface – see the next section). This allows interfaces to specify that their implementations do not just use object.toString() or object.equals(...).

Interfaces

Interfaces are the feature I am currently working on. They will work quite differently from those in Java and C♯, in that they will be able to contain implementations of methods. This makes them very similar to mixins (thank you to William Berg for pointing me towards them).

All of their members will be abstract by default, unless an implementation of a given method is provided. They will not be able to contain any data (unless it is static), however they will be able to specify abstract properties that must be implemented by a subclass.

Interfaces will have full multiple inheritance, which is something that can be difficult to implement well. One reason for this is the well-known diamond problem, which is to decide where to look first for a method on an object of type D in the following class hierarchy (remember the method could be implemented on both B and C):

  A
 / \
B   C
 \ /
  D

This problem is fairly easy to solve by starting with the leftmost super-type and checking each parent from left to right in the list, giving the method resolution order of D, B, C, A.

A much more difficult problem is what happens in the following example:

A    B
| \ /|
|  /\|
| /  |\
|/   |/
C    D
 \  /
  E

It may not be clear from the ASCII art, but C extends A before B, whereas D extends B before A. Individually, C and D are well defined, but their subtype E is not. Suppose A and B both implement some method foo(), and something in C depends on inheriting A.foo(), but something else in D depends on inheriting B.foo(). In this case, whichever way around E inherits foo(), it will break something in either C or D.

To combat inconsistencies like this, I am using C3 linearisation to decide on a consistent order for inheritance, or fail in cases where the class hierarchy is inconsistent. This is the same system used by Python, and it also has the side effect that all super-classes of a class are searched before any of its inherited interfaces.

Compound Types

This is a type which I have still not finalised the name of, although its behaviour has been decided. They are very similar to value types, but I don’t want to make ‘value’ a keyword, so I will probably change it from ‘compound’ to ‘value_type’ at some point.

Their main property is that they are stored on the stack (or directly embedded in another object when stored as a field), and because of this, their values are routinely copied into new variables, rather than passed into a function by reference. The only exception to this is when they are the value which the function is being called on, in which case they must be passed by reference in order to allow the function to modify the value.

Due to their fixed size and object layout, they cannot have any form of inheritance (even interfaces). However, they can have all of the normal members that are allowed in classes.

Enums

Enums are a well known concept, but they can be implemented in very different ways. In C, C++, and C♯, they are syntactic sugar for integers, whereas in Java, they are a constant set of Objects. In plinth, they will be more similar to Java’s representation, which allows enum constants to store data.

However, there is a problem: each enum constant must be created statically, which means that there must be an order in which the enum constants are created. What happens if the constructor for the first enum constant tries to use the value for the second, before it has been created? This is a case where using C-style enums would be much easier. (In Java, the JRE ignores the problem and just allows the constants to be null before they are created – something I would like to avoid in plinth.)

The system I am currently considering for plinth is to disallow enums from storing any data which would need to be initialised (similarly to how static variables are already handled). This would mean that the enum constants could be allocated early, and if necessary they could be used safely before any of their constructors are called.

This scheme would make an enum extending a class impossible, because classes need to guarantee that they are initialised before use. However, interface inheritance (including mixins) would be possible.

 

Next week’s post will be about the rest of the type system, and how it all interacts.

Binary Compatibility

Binary compatibility is a feature which is often overlooked in programming languages. These days, many languages use some form of interpreter or virtual machine (e.g. the JVM, Python’s interpreter, the CLR), which means they don’t have to deal with it.

However, in languages which compile to native code (e.g. C, C++, D, Plinth), there are often many restrictions on what a programmer can change in a library without breaking binary compatibility.

What is binary compatibility?

Binary compatibility relates to backwards compatibility for libraries. For a library to be backwards compatible, programs which were compiled against an old version of it should still work with a newer version, without crashing.

In practise, this means that a newer version of the library has to contain all of the same methods and fields that the old one contained. But not only that: the binary representation of every object must also be compatible, so for example the order of fields inside the binary object may not change. To illustrate the sorts of things you can and can’t do while maintaining binary compatibility, a list of rules for C++/Qt is here: KDE Binary Compatibility Dos and Don’ts

How does plinth handle binary compatibility?

Plinth tries to allow as much as possible. However, there will always be some constraints. For example, it will always be forbidden to remove a function which existed in a previous version. Here is a list of some of the things which can and can’t be done while maintaining binary compatibility:

You can:

  • Add new static methods
  • Add new static fields
  • Add new non-static (virtual) methods, if you use a since(…) specifier
  • Override a virtual method from a superclass, if you use a since(…) specifier
  • Add new non-static fields to a sealed class (one which cannot be extended), if you use a since(…) specifier
  • Add new type definitions (classes, compound types, interfaces, enums)
  • Change an existing method into a native upcall or downcall
  • Make a field or method more visible (e.g. change it from private to public)
  • Reorder the fields or methods in a class
  • Remove a private static field or method, if it is never used

You can not:

  • Add a new non-static field to a class which is not sealed
  • Change a since(…) specifier in a way that would change the order of fields or virtual methods
  • Change a since(…) specifier on a constructor or a static method
  • Change the type signature of a method
  • Decrease the visibility of a field or method (e.g. change it from public to private)
  • Remove a static field which could have been used externally
  • Remove a static method which could have been used externally
  • Remove a non-static field
  • Remove a non-static method
  • Remove a public class, or change it from public to package visibility
  • Rename any types, fields, or methods

One important thing to point out in this list is that you can always add new virtual methods, and override existing ones. This is impossible in languages like C++, because they generally only have one virtual function table per object.

Virtual Function Tables (VFTs)

A virtual function table is just a list of pointers to functions. In languages like C++, every object which has virtual functions has a VFT to look them up in; inside the object is a pointer to that object’s VFT.

In order to call a virtual function, you must have a pointer to an object, and you must know the function’s index in the VFT. Given this, you can find the VFT by looking inside the object, and find the function by looking at this index into the VFT. Here is an example:

class Foo
{
  uint fooField;
  void print()
  {
    stdout::println("Hello, World!");
  }
}

class Bar extends Foo
{
  string barField;
  uint add(uint x, uint y)
  {
    return x + y;
  }
}
Foo’s VFT:

Foo.print

 

 

Bar’s VFT:

Foo.print
Bar.add

As you can see, Bar’s VFT has the same layout as Foo’s VFT, but with some extra methods added on the end. This makes it impossible to add virtual methods to Foo later on and maintain binary compatibility, because anything which extends Foo has to know exactly how many virtual methods Foo has in order to put its own methods after the end of that list.

So how does plinth manage to allow adding virtual methods? The key is to have several different VFTs, one for its actual class, and one for each class which it inherits from:

Foo’s object representation:

Pointer to Foo’s VFT
fooField
Foo’s VFT:

Foo.print
Bar’s object representation:

Pointer to Foo’s VFT (adjusted for Bar)
fooField
Pointer to Bar’s VFT
barField
Foo’s VFT (adjusted for Bar):

Foo.print

Bar’s VFT:

Bar.add

Now, since the VFTs are separate, Bar’s does not have to start with Foo’s. This means that, as long as you always add new virtual functions at the end of a VFT, then adding them is binary compatible. This is where since(…) specifiers come in.

Since Specifiers

Whenever you use a since(…) specifier, you provide a version number in brackets, such as since(8.4.2) or since(4.10). The compiler uses this to sort the methods before they are added to the VFT, so that methods with the largest since specifier will always go at the end of the VFT (methods without a since specifier always go at the start). So if you use a since specifier to add methods at the end of the VFT, adding methods is binary compatible.

Fields are more difficult, because each object must start with its superclass’s fields and VFT pointers. This makes it impossible to add new fields to a class which might be extended. However, if a class is sealed (i.e. it cannot be extended), then new fields can be added to the end of it without breaking binary compatibility. This is done in the same way as for virtual methods, by using a since(…) specifier.

Adjusting superclass VFTs

In the example above, there is a “Pointer to Foo’s VFT (adjusted for Bar)”. In that example the adjusting process had no effect, but when methods are overridden, this step is important as it ensures the correct method is always called. Consider this example:

class Foo
{
  uint fooField;
  void print()
  {
    stdout::println("Hello, World!");
  }
  since(1.8) uint add(uint x, uint y)
  {
    stderr::println("Foo.add()");
    return x + y;
  }
}

class Bar extends Foo
{
  string barField;
  uint add(uint x, uint y)
  {
    stderr::println("Bar.add()");
    return x + y;
  }
}
Foo’s VFT:

Foo.print
Foo.add

 

 

 

 

 

Foo’s VFT (adjusted for Bar):

Foo.print
Bar.add

Bar’s VFT:

Bar.add

Without the adjusting process, if we had an object which we knew was a Foo but was actually a Bar as well, we would look in Foo’s VFT for the add method and call Foo.add(), even though the actual object was a Bar – defeating the purpose of virtual functions entirely. With the adjusting process, we would call Bar.add() through the adjusted VFT.

But what if we compiled Bar against version 1.7 of Foo, and then ran it with version 1.8? If the adjustment happened at compile time, the adjusted VFT would only have one element: “Foo.print” (since that’s all that existed in version 1.7), and so calling Foo.add() would make us crash.

Clearly, the adjustment needs to happen at run time, and this is exactly what plinth does. Before execution of any code, plinth will go through each superclass method and search for any overrides of it in subclasses. It does this using VFT descriptor tables which are stored along with the original VFT for that class.

Static members and constructors

During compilation, static members are translated into global variables and functions. This means that accessing them never depends on the layout of an object. Because of this, we can easily have more than one version of a static method in the same class.

Imagine you have a static method which performs something badly, which you want to deprecate and replace with a newer version which does things slightly differently. Usually, you would have to change the name or the type signature of the method so that code compiled against the old version of your library would still work, while forcing new code to use your new version. However, with plinth you can just create a new version of the method with a more recent since specifier. This allows both versions of the method to exist in the library, but things compiled against this version will always use the most recent one, and using older versions of the method is not possible when a new once is available.

The same thing is possible with constructors, since they are called in the same way as static methods, without using a VFT. Static fields are not affected at all by since specifiers (although they are allowed), since only one field can exist with any given name.

Other considerations

Plinth is built with binary compatibility in mind throughout, and features which could cause compatibility problems are avoided where possible.

For example, my last post was about nullability and initialisation. In the Freedom Before Commitment scheme that was outlined in the Summers & Müller paper, the check for whether an object had been initialised depended on which of the fields inside that object had been initialised. The problem with this is that when you call a constructor, you do not necessarily know about all of the fields of the object. In particular, new fields could have been added to the end of a sealed class, which could allow a new object to be used before all of the new fields were initialised. To combat this, the proposal outlined in last weeks post did not allow not-null fields to be first initialised outside a constructor – constructors must always give values to all not-null fields.

Next week’s post will be about interfaces and how they are implemented.

Nullability

Nullability is a feature that allows you to specify whether or not something can be null. This is useful because it allows the language to enforce that access to a field or method is via a non-null value. This should make NullPointerExceptions a thing of the past!

In fact, the only place in plinth where you might get a run-time error because of a null value is when you cast from a nullable type to a not-null type.

Let’s start by looking at some of the things we can and can’t do with nullable values (the ?: and ?. operators are based on those in languages like Groovy):

// casting between null and not-null values:
Foo foo = new Foo();
?Foo nullable = foo;
Foo notNull = cast<Foo> nullable; // throws a run time error if nullable is null

// accessing fields and methods:
stdout::println(nullable.someText); // error: cannot access a field on a nullable object
nullable.doSomething(); // error: cannot call a method on a nullable object

// the null coalescing operator (giving a default value):
?uint index = null;
uint notNullIndex = index ?: 0;

// the null traversal operator:
// (the result is always null if the left-hand-side is null)
?string str = null;
?string subString = str?.substring(0, 2); // method call (the arguments are only evaluated if str is not null)
?#[]ubyte bytes = str?.bytes; // field access
?{string -> boolean} equalsFunc = str?.equals; // method access

Before I implemented nullability in Plinth, every type was not-null, including custom type definitions. The main thing you need to consider with not-null fields is that they have no default value; this makes plinth’s default behaviour very different from most major object oriented languages.

Because some fields have no default value, they must be initialised before the end of the constructor. In fact, they must be initialised before the newly allocated object could possibly be accessed from outside the constructor. This means that, until all of the fields of this have been given values (or have default ones like zero or null), the value of this cannot escape the constructor. In plinth, making sure of this amounts to stopping people from using this (except for things like this.field = value;) and stopping people from calling non-static methods.

So far this is all fine, but what happens when we have two (or more) classes which need not-null references to each other? This problem is difficult to solve, for a number of reasons. I’ll use the following as a very simple example:

class A
{
  B b;
  this(B b) // <-- constructor
  {
    this.b = b;
  }
}
class B
{
  A a;
  this(A a)
  {
    this.a = a;
  }
}

Obviously, in this scenario, one of the constructors must always be run first. This seems like it makes it impossible to create either of them, but let’s suppose that we have a way of allocating an object without calling its constructor. When an object has just been allocated, it’s still ‘uninitialised’ until its constructor has been run. Now, if we could pass an uninitialised object into another constructor, we might be able to do something like this:

uninitialised A allocA = alloc A;
uninitialised B allocB = new B(allocA);
A a = new (allocA) A(allocB); // creates an A inside allocA, like C++'s placement-new operators
B b = allocB;

But now we have more problems. B’s constructor can’t let this escape until all of the fields are initialised, but now after it’s done this.a = a; it can do whatever it wants, including letting ‘this’ escape and trying to access data inside a (which still hasn’t been initialised yet). We need to mark that the constructor can take certain uninitialised values, and then make sure that it can’t let ‘this’ escape even after everything’s been initialised. We could do this with an ‘uninitialised’ modifier:

class B
{
  A a;
  uninitialised this(uninitialised A a)
  {
    this.a = a;
  }
}

Now, we know exactly what isn’t allowed to escape: ‘this’ and ‘a’ (we only allow uninitialised parameters in uninitialised constructors). Then, once we’ve created allocB, we know that it being initialised depends only on allocA being initialised, since it is the only uninitialised argument.

Next, we call the constructor on allocA, to produce a (A’s constructor will have to be modified the same way as B’s). We now also know that allocA being initialised depends only on allocB being initialised. We know the dependency cycle! All that we need to do is find the solution which makes the most things initialised (i.e. the greatest fixpoint of these simultaneous equations). Then we will know that both of them are initialised, thus allowing the assignments to a and b.

This scheme is based loosely on some discussions with and papers recommended to me by James Elford, in particular the Freedom Before Commitment paper[1]. However, it is not yet anywhere near implemented in plinth, for several reasons:

  • I’ve just come up with it recently, and I have lots of other things before it on my to-do list (interfaces! properties! exceptions!).
  • Processing the type system for uninitialised values would be extremely complicated, and would involve each type having a list of things which its initialisation state is conditional upon. I’d like to fill in the rest of the type system before adding this complexity.
  • It can be easily added later. The only things which would not be backwards compatible with the current language are the keywords.
  • There are workarounds which can be used in the mean time (see below)

That said, this cyclic initialisation scheme is something I would very much like to implement eventually, once all of the problems with uninitialised variables have been well-thought-out.

The slightly ugly workaround which can be used until this scheme is implemented is to have a nullable field and a property to access it. This will give a run-time error whenever a field is accessed on a not-properly-initialised object:

class A
{
  ?B realB;
  property B b
    getter { return cast<B> realB; }
    setter { realB = value; };

  this()
  {
  }
  void setB(B b)
  {
    // this method is only necessary if we make property's setter private
    this.b = b; // calls property setter
  }
}

Next week’s post will talk about binary compatibility, and how objects in plinth are represented in LLVM bitcode.

References:
[1] Summers, Alexander J., and Peter Müller. “Freedom Before Commitment.” (2010).

Plinth and Immutability

This post was going to be about immutability in Plinth, but I’ve just realised that I haven’t actually announced Plinth here yet. And so…

The Plinth Programming Language

Plinth is the name for a language I have been designing in my spare time for a while. Several of my posts in 2010 and 2011 were about it (then under the working name “Language X”). This year, I started work on an actual compiler for it that could produce executable machine code right from the start. I am currently in the process of building it up to a full language, and one of the more recent things I’ve implemented is immutability.

If you want to try out the compiler, or see updates as it’s being written, the github project is here: https://github.com/abryant/Plinth

And now back to the original topic:

Immutability

A few weeks ago, I was wrestling with how to implement immutability in a way that’s both safe and easy to use. The basic idea of immutability is similar to const in C++:

  • Types can be marked as immutable using a # prefix.
    i.e. #Foo is an immutable Foo
    Any object f with this type can not have any of its fields modified, so if f had a field b of type Bar, then f.b = null would be illegal.
    Plinth has “deep” immutability, which means that changing any fields of f.b would also be illegal.
  • Methods can be marked as immutable using the immutable modifier.
    When a method has this modifier, it cannot make any changes to anything that might be accessible elsewhere. This means that changing member variables and a static variables is forbidden, as is calling any other methods which are not immutable.
  • Constructors must also be marked as immutable if they are to be called from immutable methods. An immutable constructor can modify member variables, but not static variables.

There are several types which can be marked as immutable:
Arrays, e.g. #[]uint
Class types, e.g. #Foo
Compound types, e.g. #bigint
Function types, e.g. #{uint -> string}
The object type: #object

For arrays, classes, compound types, and the object type, being immutable means that the values inside that variable cannot be changed (e.g. an array element cannot be overwritten); for these four, casting away immutability is forbidden.

However, for function types, it means the opposite: that the function does not modify any external state, so it is an immutable function in the same way methods can be immutable. Because of this difference, function types can be cast away from immutability, but they cannot be cast towards immutability. For example:

#{string -> void} func = someMethod; // fine if someMethod is immutable
{string -> void} f2 = func; // fine, func doesn't change anything
#{string -> void} f3 = f2; // error: f2 might change something

Aside: With this type of immutability, making an object cache the result of some calculations would be impossible. For this reason, fields can be marked as ‘mutable’, which means that they can be modified (assigned to or have their state changed) from within an immutable context. As an example, this type of caching is used in Plinth’s string type, to lazily find its UTF codepoints.

Now, consider what happens when we write a getter (properties will exist eventually, but suppose we want to write a method):

class Foo
{
  // ... constructor ...
  Foo next;
  immutable Foo getNext()
  {
    return next; // error: next is immutable in this context, and cannot be returned
  }
}

So we can’t write an immutable getter that would extract a non-immutable member. We would have to write two getters, one which returns immutable values, and another for non-immutable values. This is extremely inconvenient, both in terms of writing duplicated getters and disambiguating between the two of them.

One way to try to get around these inconveniences is to have a concept of contextual immutability, which is added whenever you access a member in an immutable context. Then, immutable methods could return contextually immutable things, but not explicitly immutable things, allowing us to return next as a Foo rather than a #Foo in this case, and working around the duplication. Then, the result of the method call could have the same immutability as the thing we are calling the method on.

Unfortunately, this presents a problem for factory functions:

class Factory
{
  immutable Foo createFoo()
  {
    return new Foo();
  }
}
// in some method:
#Factory factory = ...;
Foo created = factory.createFoo(); // error: the result is immutable, because factory is immutable

Here, it is impossible to have an immutable Factory create a Foo which is not immutable. This is because we assumed that immutable methods would always return members, but in cases like this they do not.

In order for this to work properly, we would need to add another modifier to the method, stating that it can return contextually immutable values. In my opinion, this would overcomplicate method calls, so I am leaving them to return either always-immutable or never-immutable values.

However, the confusion here does not apply to properties, since they are designed to be accessed as if they were fields. So we can allow property getters to return against contextual immutability:

class Foo
{
  Foo realNext;
  property Foo next
    getter { return realNext; } // works, despite realNext being a #Foo in this context
    setter { realNext = value; };
  // NOTE: we could just do "property Foo next;" instead
}
// in some method:
Foo foo = ...;
Foo next = foo.next; // getter can return a contextually immutable value
#Foo immFoo = foo;
#Foo immNext = immFoo.next; // immutable callee forces immutable result

This gives us the same functionality as it would in methods, except that property getters cannot be used as function values (but even then, you can wrap the access in a closure if necessary).

Hopefully I’ll write about Plinth development more often from now on, since I’ve been getting a lot of the core language implemented recently. The next topic will be something that Plinth does differently from a lot of mainstream languages: nullability.