Category Archives: Plinth

Plinth Isn’t Dead!

Update 2018-02-26: This was written several years ago, and Plinth development is essentially dead now. (In retrospect, although it was true at the time, this was a bad title and it’s been annoying me ever since.)


I thought I’d write a status update, since it’s been almost four months since I last posted anything here.

Firstly, Plinth isn’t dead! I have been working on it, albeit slightly more slowly than I’d have liked, for the last four months. At first, that was because Generics took me a while to get right, as it required major structural changes throughout the compiler. More recently, I’ve just been away on a couple of holidays. For the last few days, though, development has sped up significantly.

Now, since this is a status update, here’s a list of all of the major things I’ve added since my last post:

Generics [commit]

This took much longer than it should have – partly because it changed so much of the compiler’s data structure, but also because code generation for it was much more difficult than I expected (see last post).

A Test Framework [commits: 1 2 3 4]

There is now a suite of tests that covers everything up to name resolution. I still need to write tests for: cycle checking, type checking, control flow checking, and exception checking.

If you want to run all of the tests, just run `ant run-tests` from the project root directory.

A new array-initialisation syntax [commit]

Before this change, it was impossible to create an array of anything that didn’t have a default value, unless you knew its size in advance. Here’s an example:

[]string foo = new [n]string;

^ This is impossible – you can’t create an array of not-null strings unless you know all of their values beforehand. Here’s what you could do before:

[]string foo = new []string {"foo", "bar"}; // only has two elements, not n
[]?string bar = new [n]?string;             // all elements are nullable

But now, with this new syntax, you can create them much more easily:

[]string foo = new [n]string("hello"); // initialises all of the strings to "hello"

Or, if you define a function somewhere first:

string makeString(uint index)
{
  return "string number " + index;
}
[]string foo = new [n]string(makeString);

This will become much easier to use once closures are added, which will hopefully be soon.

Wildcard types [commits: 1 2]

Wildcards are useful if you want to store an object but don’t care about its generic type arguments. For example, a List<?> can only store one type inside it, but we don’t care what that type is. We can’t store new elements in it, but we can retrieve them as objects.
Adding wildcards required a lot of changes to the way run-time type information works, mainly because of cases like this one:

ArrayList<?> list = new ArrayList<Bar>();
List<? extends Foo> list2 = cast<List<? extends Foo>> list;

To do this sort of check, we need to go through all of the super-types of the value being tested (list) for the target type (List), and then make sure the constraints on the wildcard type argument encompass the actual type argument we get from the object’s run-time type information. This gets especially complicated when you cast to a type like List<? extends Foo super Bar>, or when you have multiple type arguments, so the runtime’s code for checking this is a bit long (although that’s partly because it’s written in LLVM bitcode).

A copyright notice [commit]

This should have been added a while ago, but the code is now officially under the three-clause BSD license, so you can use it for almost anything you want.

Array bounds checking [commit]

Now, if you index past the end of an array, you won’t get a segmentation fault any more – an IndexError will be thrown instead. Also, IndexError has been added to the standard library.

Various standard-library improvements [commits: 1 2 3 4 5 6]

Some new methods have been added to string, such as trim() which trims all unicode whitespace characters.
string now has a read-only property called “length” instead of a method, which will be the standard way of exposing the length of something in the standard library (no size() methods on Lists!).
Also, the stdin::readLine() method can now throw EOFException.

ArrayList<T> and LinkedList<T> implementations [commits: 1 2 3]

These both implement a new List<T> interface, which extends Iterable<T>.
ArrayList<T> is actually implemented as a ring buffer in an array. I’m not sure why this isn’t more standard in ArrayList implementations, but it makes insertions to and deletions from the start of the list O(1) with hardly any overhead.

For-each loops [commit]

These loops are extremely flexible: they don’t just iterate through arrays and Iterables, they also work on integers (signed and unsigned) and Iterators. This means you can do things like this:

for uint x in 5
{
  stdout::print(" " + x); // prints: 0 1 2 3 4
}
for short y in value
{
  stdout::print(" " + y);
  // if value == -5, prints: 0 -1 -2 -3 -4
  // if value == 0, prints nothing
  // if value == 3, prints: 0 1 2
}

And this:

Iterator<Foo> iter = list.iterator();
if iter.hasNext()
{
  iter.next(); // skip the first element
}
for Foo foo in iter
{
  stdout::println(foo); // starts at the second element of the list
}

Auto-assign parameters [commit]

These are something I’ve wanted Java to have for a while, but they’re purely syntactic sugar. Basically, they let you define a constructor/method(/property setter) parameter which, instead of creating a local variable, just assigns to a field. It’s easier to show it than to describe it. Instead of writing this:

class Item
{
  string name;
  string description;
  uint price;
  create(string name, string description, uint price)
  {
    this.name = name;
    this.description = description;
    this.price = price;
  }
  // ... getters and setters ...
}

You can write this (also, you can use properties instead of writing getters and setters!):

class Item
{
  property string name;
  property string description;
  property uint price;
  create(@name, @description, @price);
}

The auto-assign parameters get assigned just before the body of whatever you’re calling.
With this change, when you’re using an auto-assign parameter a body is added automatically, so you can put a semicolon instead of an empty body.
You can also do some cool things like:

void setHalfPrice(@price)
{
  price /= 2;
}

This works because the parameter doesn’t create a local variable, so when you say “price” you’re actually referencing the property.

 

That’s as far as I’ve got right now. Hopefully I’ll add default arguments and varargs soon. However, I do need to do some preparation for job interviews, so I’m not sure when my next post will be.

Native Types and Generics

First, a status update: I have finished most of the runtime code required for generics, which involves several changes to how virtual function tables (VFTs) work, and some changes to run-time type information (RTTI) which will allow you to do instanceof checks and casts. What I have not yet done is to decide on a way of representing generic types in native code; there are several obstacles to overcome when devising this native representation, which I will try to describe in this post.

Simple Native Types

Before we start on generics, here is a brief overview of how native types work.

Each type is given a standard representation, which is always the same. For example, uint becomes a 32 bit integer, or an i32 in LLVM bitcode. Knowing the compile-time types allows us to know exactly which run-time type a value will have. Here are some examples of various native types:

Plinth Type(s) LLVM Type Description
uint, int i32 A 32 bit integer (signed/unsigned doesn’t make a difference).
boolean i1 A 1-bit integer.
?uint {i1, i32} A tuple of a 1-bit and a 32-bit integer. If the i1 is false, the value is null.
object,
?object
{%RTTI*, %VFT*}* An object is just a pointer to a combination of run-time type information and a virtual function table.
[]ubyte,
?[]ubyte
{%RTTI*, %VFT*,
i32,
[0 x i8]}*
Arrays are pointers to memory where the elements are stored sequentially.
They store RTTI and a VFT for easy conversions to/from object, along with
the length of the array (i32) and the elements themselves ([0 x i8]).
(short, ulong) {i16, i64} Tuples are structures which store each of their element values in turn.
string {%AB*, %AI*} string is a compound type which stores two values:
an []ubyte (%AB*) and a ?[]uint (%AI*). Compound types
are stored in the same way as tuples: using pass-by-value structures.
class Foo
{
  uint x;
}
{%RTTI*, %VFT*,
 %VFT*, i32}*
Classes are represented as pointers, the same way as objects.
They have the standard RTTI and object VFT, plus their own VFT
for every class and interface in their inheritance tree.
They also store each of their fields.
{long -> boolean} {%RTTI*, %opaque*,
 i1 (%opaque*, i64)*}
Functions are actually tuples of three values:
* An RTTI pointer, for easy conversion to/from object.
* An opaque pointer, which is the callee for the function. We don’t care what this points to, just that the function expects it.
* A function pointer, which takes the opaque pointer as the first argument, followed by the rest of the arguments, and returns the return type.

This is the pre-generics design for native types. It supports all of the types which can be expressed in the language (apart from void, which just maps to the LLVM void type for function return types).

Simple Type Conversions

One interesting thing about plinth’s type system is that it is unified, so that we can cast any value to an object.

Obviously, some of these types are pointers that start with %RTTI*, %VFT* just like object; these types don’t need to be converted, we can just take the same pointer and treat it as an object value.

Other types, such as uint, do need to be converted. Unfortunately, this involves a heap allocation: we create an object with an extra field on the end (in this case i32), store the value inside that object, and cast it to a normal object. The same thing applies to tuples, compound types, and functions.

Let me just note that dynamic languages (e.g. Python) have this much easier – they could just say that all types look like objects, and not worry so much about converting between them. The tradeoff is that allocating everything on the heap like that is much less efficient.

Generic Native Types

In plinth, generic types can represent any other type, including tuples, functions, arrays, etc. Because of this, we need their native representation to look like something which all other types can be converted to: object.

Now, because all type parameters are represented as objects, we already know how to convert to and from them: just do the normal conversion to/from object, and we’re done! Or are we…?

Generic Functions

Generic functions are functions which take generic arguments, or return generic results. Here’s an example of a class that uses them:

class Comparator<T>
{
  {T, T -> int} f;
}
// ... somewhere else ...
Comparator<string> comp = ...;
{string, string -> int} cf = comp.f; // A
comp.f = cf;                         // B

This Comparator class stores a generic function in a field. It takes two of an arbitrary type T, and returns an int. Considering that this function’s native representation takes two objects, and returns an i32. How do we convert that function into one that takes two strings and returns an i32? Obviously some conversion needs to be done, but we can’t do it until the function is called.

The only solution here is to create a proxy function that takes two strings, and make it do some conversion from string to object, and call the original function. (In native code we do this by packing the original function into an object, and making that object the opaque pointer that gets stored alongside the function pointer. In effect, we now have something like this:

proxy function (string)
       |
       V
original function (object)

But now we get to line B, and this is where things get interesting. Obviously here, we have to do the opposite conversion, and since we don’t know that the function we’re converting is a proxy function, we have to wrap it again in the opposite conversion, as follows:

proxy function (object)
       |
       V
proxy function (string)
       |
       V
original function (object)

Due to two opposing conversions, we now have a chain of functions. If we somehow did these conversions in a loop, we would accidentally create lots of entries in this chain, and then wonder why function calls are so slow. The only way around this is to detect that there’s a chain of proxy functions when we do the conversion, and find out whether the function we were about to add to the top is also somewhere lower down in the chain, and if so, just remove the proxies above it. Quite complicated for an implicit conversion!

But even that isn’t the worst thing we have to consider.

Generic Arrays

Generic arrays are arrays of type parameters (or arrays of tuples containing type parameters). Let’s consider the following class:

class Foo<T>
{
  []T array;
  void set([]T array)
  {
    this.array = array;
  }
}
// ... somewhere else ...
Foo<uint> foo = ...;
foo.set(new []uint {2, 3, 4, 5});

What happens when we want to convert from []uint to []T?

The first thing that springs to mind is that this is an O(n) implicit conversion. This MUST be avoided. Even if the performance penalty wasn’t an issue (which it definitely is), performing this conversion would create a second array, which would completely change the semantics of the function call (from pass-by-pointer to pass-by-value).

So the main constraint is that there must be exactly one array of values. What if the array itself was an []uint? Could we still use it as an []T? Actually, we can – we just need to know how to convert between the real type inside the array and T. If we know (a) whether the array elements are pointers, and (b) the size of each array element, then we can treat the array as a byte array an do some index calculations to get to a specific element, and if necessary box that element up in an object.

This system works for the simple cases like []T, but starts to break down when you consider things like arrays of tuples or functions: converting an []{T, T -> int} to an []{string, string -> int} is much more difficult, but can be done. What we need is some sort of proxy array, which instead of storing the elements itself has a reference to a base array, and provides two functions, ‘get’ and ‘set’, for accessing the array elements. These ‘get’ and ‘set’ functions can do the conversions between the base array’s element type and the proxied array’s element type. So for the example of []{T, T -> int} we might have:

%ProxyArray = type {%RTTI*, %VFT*, i32, %BaseArray*, %GetFunc*, %SetFunc*}
                                             |
                                             V
%BaseArray = type {%RTTI*, %VFT*, i32, [0 x %StringFunction]}

%GetFunc = type %ObjectFunction (%BaseArray*, i32)
%SetFunc = type void (%BaseArray*, i32, %ObjectFunction)

%ObjectFunction = type {%RTTI*, %opaque*, i32 (%opaque*, %object*, %object*)*}
%StringFunction = type {%RTTI*, %opaque*, i32 (%opaque*, %string, %string)*}

Then, whenever the proxy array wants to get/set a value in the base array, it converts the array element between a {T, T -> int} and a {string, string -> int} as described in the function section above (which involves generating a proxy function for every array access/store).

For this to work efficiently, we’d like to know the native type of the array at compile time. To achieve this, we can say that arrays which contain anything generic will be proxy arrays, and arrays of non-generic types will be normal arrays.

The system, while getting increasingly complicated, looks like it’s coming together. However, there’s yet another problem…

Creating Generic Arrays

Our current model says that arrays of generic things will be proxy arrays, and arrays of non-generic things will be normal arrays. If all you want to do is pass arrays around, this works great. However, what happens if you want to create an array of type []{T, T -> int}?

In this model, the answer is: first, create the base array which stores the concrete type (functions which take strings), then build a proxy on top of it so that we have a valid reference to a generic array. But this is impossible. Lets say we can create the base array (it’s not too hard, it only contains function pointers, and we don’t need to know the argument types to allocate memory for a pointer). How do we create the proxy array? We need to, at run-time, create get/set functions which know how to convert between {T, T -> int} and {X, X -> int} where X is the concrete type. This involves writing a proxy function that takes argument X and converts it to a T. We can’t generate code at run-time, and we don’t know what X is at compile time, so we can’t generate code for the required proxy function at all.

In fact, the only way that I can imagine doing this is to allow only one sort of array: an array which has get/set functions. We could generate a base array which uses the get/set functions to store its own values, or we could generate an array which just proxies another array and does some type conversion. Then, in order to create a generic array like []{T, T -> int}, we could just create an array of the generic functions, and proxy it to an array of concrete functions where necessary. This will be slower. Much slower for arrays of primitive types, since we need to do an indirect function call on every access/store. But at least it will work in all cases.

If anyone (a) understands this, and (b) can think of a better solution, please comment and let me know.

Calling Generic Functions

This week, I finished most of the inheritance checking for generics. However, while moving on to expression checking, I discovered a major syntax problem. Consider this:

<A, B> B foo(?A a)
{
  // foo uses type parameters A and B, takes a ?A, and returns a B
}
// ... somewhere else:
x = foo<uint, string>(7);

This is the syntax I was planning for calls to generic functions. The type arguments are specified after the function’s name and before the actual arguments, which makes for quite a readable syntax. The problem is that it can be parsed in two different ways:

x = a < b, c > (d + e);

Here, x could be either a tuple of two booleans, containing the results of two comparisons, or the result of calling a function called a<b, c> with the argument (d + e). Obviously, this syntax is not going to work.

Here are some of the ways that other languages specify generic types:

  • C++ and C♯ use the syntax that I was going to use, but they don’t have native support for tuples, so they don’t have this problem.
  • Java (which doesn’t natively support tuples either) forces the call to be qualified, and puts the type arguments before the name of the method:
    this.<String, String>foo("hi");

I considered solving this problem by using the same syntax as Java. However, there is another complicating factor: function types. Function types can be generic, for example we can make bar() return foo:

{<A, B>: ?A -> B} bar()
{
  return foo;
}

// now, we want to write the following, but how do we specify A and B?
bar()(null);

Now, if we were using Java’s syntax, this would make it impossible to specify generic type arguments: doing <uint,string>bar()(null) would be ambiguous: the type arguments could refer to bar or the function it returns.

The way that I’ve decided to handle this is to put the generic type arguments inside the parameter list:

foo(<uint, string>: null);
bar()(<bigint, List<bigint>>: null);

This is slightly harder to read, but completely unambiguous, and works well with the syntax for the type of a generic function. Hopefully, plinth will be able to infer the type arguments in most cases, so that they do not need to be specified at all.

Creating Value Types

There is another problem with this syntax: it isn’t very well suited to creating value types (or compound types, as the syntax currently calls them). Ideally, when creating something, its type arguments should be specified the same way as they are in the type itself. For example:

Foo<uint, string> foo = new Foo<uint, string>();

But creating a value type is actually treated as a function call during parsing, because they look the same to the parser:

bigint b = bigint(4);

Unfortunately, with this new syntax for generics, constructor calls for value types will look nothing like the type being created:

foo<string, uint> a = foo(<string, uint>: ); // (no normal arguments, just type arguments)

As we have already discussed, just moving the type arguments outside the arguments list will not work. The only way to fix this is to use a similar syntax to a class creation: put a keyword at the start of the expression. This keyword should not be ‘new’, because ‘new’ implies that you aren’t creating a new one every time that you pass a value into a method. Instead, I have opted to use ‘create’, as it better illustrates the differences between class and value types:

string s = create string(new []uint {89, 69, 83});
foo<string, uint> a = create foo<string, uint>(s, s.length());

A nice side effect of this new keyword is that it also makes sense for constructors, and is shorter than the current ‘constructor’ keyword. So once this is implemented, constructors will be defined as follows:

class Foo
{
  uint x;
  create(uint x)
  {
    this.x = x;
  }
}

A similar change can be made for property constructors, which eliminates the need for ‘constructor’ to be a keyword at all.

Hiding Methods

This week, I was trying to rewrite the inheritance checker for generics. The inheritance checker is the compiler pass which makes sure that there aren’t any problems with overridden methods, and that only abstract classes can have abstract methods.

The question I was faced with was: Can something inherit from the same generic interface twice, with different type arguments? For example:

interface Processor<T>
{
  void process(T t);
}
class MultiProcessor implements Processor<int>, Processor<string>
{
  void process(int i) {}
  void process(string s) {}
}

This makes conceptual sense, and doesn’t have any ambiguities, so it really should be allowed. Unfortunately, there are two problems that can arise if this is allowed. The first is with the low-level implementation of method overriding, which can be changed; but the second is an ambiguity that can’t be solved without substantial changes to the language. Here’s an example that should illustrate the second problem:

interface List<T>
{
  void add(T t);
  T get(uint index);
  uint size();
}
class MultiList implements List<int>, List<string>
{
  void add(int i) { ... }
  void add(string s) { ... }
  int get(uint index) { ... }
  string get(uint index) { ... }
  // what about size()?
}

MultiList tries to be two different lists at the same time. The add() and get() methods should work fine, because they have different signatures (in plinth, the return type is part of a method’s signature). However, the size() method has the same signature no matter which list it is from. Obviously we don’t want both lists to have the same implementation of size() – we might have 2 ints and 7 strings. In order for everything to still work when we cast from MultiList to List<int> or List<string>, we need to have two different implementations.

But with the current system, if you override a method, it overrides all of the super-type methods with the same signature, making it impossible to have two different size() implementations. The solution is to change the language to allow methods to be hidden instead of overridden. For the unfamiliar, here’s a quick description of the difference:

  • Hiding is what happens to fields: when you make a new one with the same name in a subtype, the one in the supertype still exists, but you can’t access it from the subtype (if you need to access it, you can cast to the supertype first and access it there).
  • Overriding is what happens to methods: when you make a new one with the same signature in a subtype, the one in the supertype gets overwritten with your new one, so that everywhere you access it you are referring to the same thing.

What we actually need to do to solve this sort of problem is to hide a method while providing an implementation for it, i.e. override it and hide it at the same time. We could provide implementations for both size() methods to do different things, and then hide one (or both) of them, so that you get two different answers when you do:

(cast<List<int>> multi).size()
(cast<List<string>> multi).size()

When you do multi.size(), you get whichever of them is not hidden, or a compiler error if both of them are hidden.

The syntax I am considering for this is similar to C♯’s “explicit implementation” syntax for interfaces, but more flexible:

class MultiList implements List<int>, List<string>
{
  // ... add() and get() methods ...
  hiding uint List<int>::size() { ... } // overrides List<int>::size()
  uint size() { ... } // overrides List<string>::size()
}

This gives us implementations for each of the size() methods separately, and hides List<int>::size() so that only List<string>::size() is accessible from MultiList. If we want to call List<int>::size(), we must first cast the MultiList to a List<int>.

This syntax will also allow you to hide a method and provide a new method with the same signature, without overriding it. For example:

class Bar
{
  uint parse(string s) { ... }
}
class Foo extends Bar
{
  hiding uint Bar::parse(string s);

  uint parse(string s) { ... }
}

So, in order to hide a super-type’s method, we declare it with the ‘hiding‘ modifier, and instead of just giving its name, we use SuperType::name.

Note: I am not completely decided on the keyword ‘hiding’. While ‘hidden’ might make more sense, I do not expect this feature to be used very often, so I don’t want to use a keyword which people often use as a variable name.

Low-Level Hiding

The other problem I mentioned was with the low-level implementation of overriding. Currently, each method has a “disambiguator”, which is a string representation of its signature, including its name and its parameter and return types. Methods are only overridden by the plinth runtime if their disambiguator strings are equal.

One problem with this way of doing things is that it doesn’t cope with inheriting members from the same super-type twice. To override both List<int>::get() and List<string>::get(), you would need to provide two methods which have different disambiguators, but the super-type only has one disambiguator for List<T>::get(), so they cannot override it properly with the current system.

Another problem is that it doesn’t allow a method to not override a superclass method, unless it picks a deliberately non-matching disambiguator (and doing so would be bad for binary compatibility).

To solve these problems, the way the runtime generates VFTs will have to be dramatically changed in a way I haven’t fully thought out yet. Luckily, this part of the runtime is only run once when the program is starting up, so the minor efficiency tradeoff that this will probably necessitate is almost certainly worth it.

 

For now, I’ll get back to implementing generics without support for inheriting from the same type multiple times, and support for these features will be added later on.

Generics, Nullability, and Immutability

The last few weeks, I have been working on Generics. Last week, with some awkward grammar hacks, I implemented a parser for them that allows the ‘>>’ token to close two layers of type argument at a time. But this week I have come to a more difficult problem: how do nullability and immutability (the “type modifiers”) work with generics?

I knew that this would be difficult, so the first thing I considered was what problems I was solving by not just disallowing nullable/immutable type arguments. It quickly becomes obvious that doing this would be a bad idea, as it would prevent you from making a List<?Foo> or a Set<#Foo> (which would make generic collections inferior to arrays!).

Before I go into any more detail, I should explain how generics will work in Plinth. Other languages use several different approaches:

  • C++ makes each generic type a template, which is instantiated whenever it is needed. This means re-generating the low-level code for the generic type whenever it is used.
  • Java type-checks everything at compile time, and then erases all of the type information at runtime. This makes it impossible to do several things, like create generic arrays.
  • C♯ uses a system somewhat similar to Java’s, but keeps the run-time-type-information around, which makes it much more flexible.

The C++ approach makes generic nullability and immutability much easier for the compiler, because before it generates code for a generic function, it already knows exactly what type it is working on. The main problem with this approach is that it requires code to be generated at the point where the type is used. This means that any changes made to a generic type will not take effect until all uses of it are recompiled. This type of system would not work for Plinth, because it makes much stricter guarantees about binary compatibility.

Plinth will use a system very similar to C♯’s, but with the type system expanded to support nullability and immutability. One very minor difference is how Plinth will treat value types: it will implicitly cast them all to object, making them pointer types rather than value types. This is necessary so that Plinth only has to generate code for each generic type/method once.

Nullability

The difficult thing about mixing generics with nullability is that in general you can’t know whether or not a type argument is nullable. The following example should illustrate this:

class Foo<T>
{
  T field;
  void foo()
  {
    object o = field; // error: T might be nullable
    field = null; // error: T might not be nullable
  }
}

Both of the errors would actually become a problem when we provide certain type arguments. For example, the first error is set off when we make a Foo<?object>, and the second is set off by Foo<object>.

The compiler represents this type internally as a “possibly-nullable” type, which exists in between nullable and not-nullable. The type hierarchy says that not-nullable is a subtype of possibly-nullable, and possibly-nullable is a subtype of nullable. This allows you to, for example, convert from a T to a ?object, or to convert from a not-null subtype of T to a T.

This subtype relationship gives an obvious way of specifying to the compiler that you don’t want a type to be nullable: say it extends a (not-nullable) object. Unfortunately, there is no easy way to say that a type parameter should always be nullable, but if this is ever necessary it can be worked around by always using ?T instead of T. For example:

class Bar<T extends object, S>
{
  T notNullableField;
  ?S alwaysNullableField;
}

The full inheritance hierarchy for nullability is:

         nullable
            ^
            |
possibly-nullable
            ^
            |
     not-nullable

A <-- B means A is a super-type of B

Immutability

At first glance, immutability looks very similar to nullability where generics are concerned. It has a very similar subtype relationship: not-immutable is a subtype of possibly-immutable, and possibly-immutable is a subtype of immutable. Immutability just has the added restriction that you can’t downcast from immutable to possibly-immutable or from possibly-immutable to not-immutable.

This is extremely close to a complete model, and almost works just as well as the one for nullability. It is only a tiny edge case which makes it completely different in practise.

An edge case

Inside an immutable function, any fields you access (such as a property’s backing variable) are implicitly considered both final and immutable. So normally, from a method, you wouldn’t be able to return one of your object’s fields without the result type being ‘#Foo’ instead of just ‘Foo’. Also, by default, a property’s getter is an immutable function.

With just these rules, a getter would either be unable to return a non-immutable object, or it could be marked as ‘mutable’, which allows it to return a non-immutable object but stops it from being called from inside an immutable function.

Luckily, there is an extra rule: when you access a property (as with a field), the resulting value is implicitly immutable if the object you are accessing it on is immutable.

This extra rule doesn’t change anything about what you can do with immutable values, but it does allow us to add a special case to property getters. If a property getter is an immutable function, and the property is of some type ‘Foo’, and the getter tries to return a field of type ‘Foo’ (which is considered implicitly immutable because the getter is an immutable function), then the getter is allowed to return the value even though it is immutable.

To model this type of scenario, the compiler has a concept of ‘contextual immutability’. This is an extension to the type system which cannot be represented explicitly in code, but represents what happens if something gets made immutable purely because you are accessing it in an immutable context.

Generic Immutability

On top of contextual immutability, generics adds the concept of a possibly-immutable value. What happens if we combine them?

Say we tried to access a field with a possibly-immutable type in an immutable context. If the type argument turns out to be immutable, then the value should be immutable, but if the type argument turns out to be not-immutable, the result should be contextually immutable. This situation could be called something like ‘immutable-or-contextually-immutable’. Obviously, this makes the type system much more complicated, but it is required in order to allow properties to deal with generic types properly.

The full inheritance hierarchy for immutability is:

                     immutable
                      ^     ^
                     /       \
                    /         \
                   /        immutable-or-contextually-immutable
                  /            ^
contextually-immutable         |
                 ^             |
                  \         possibly-immutable
                   \           ^
                    \         /
                     \       /
                   not-immutable

A <-- B means A is a super-type of B

Of course, the only ones that you can explicitly specify are ‘immutable’ and ‘not-immutable’.

Plinth Status Update

This week, I thought I’d give an update on which parts of Plinth are finished and what still needs doing.

Last Tuesday, I finished implementing properties. They work as described in the last few blog posts, with a few extra considerations for initialising inherited properties. Since then, I have been fixing various small omissions and other problems, such as an edge case in inheritance checking, and making sure value-typed variables do not alias each other.

I have also added support for “super.foo” to access fields and methods on a super-class/interface. For methods, this results in a non-virtual method call.

There are several major things which still need implementing:

  • Generics
  • Enums
  • Default and Variadic Arguments
  • For-each loops
  • Closures
  • Access Specifiers (public, protected, package, and private)
  • Standard Library
  • Garbage Collector

The next thing I will be working on is Generics, which will support using any type as a type argument, including things like primitives and functions. Generic types will not be erased at run-time; instead, run-time-type-information pointers will be passed into functions and stored inside objects, so that the type can still be used at run-time.

After Generics, the biggest barriers to the language being useful are the standard library and the garbage collector. The standard library includes (at a minimum) various data structures and APIs for using files and network sockets, which should not be too difficult to implement. On the other hand, the garbage collector is a difficult problem to solve efficiently in theory, especially in concurrent scenarios, so I am expecting it to take quite a while.

While these features are being implemented, I am writing documentation for the language, including a language specification. It is very much a work-in-progress, but it is located here: http://bryants.eu/plinth/

If you want to try out the language, some of the commits over the last week have made it much easier. There’s now an ant build script which will compile the compiler, runtime, and standard library. There’s also a bash script which greatly simplifies running the compiler. Everything you need to know should be in the readme on GitHub, but if there’s anything else you’d like to know, please ask.

Property Initialisation

Since my last post, I have been implementing all of the details of properties. They’re almost done now, I just have a bit of code generation to finish.

While I was implementing them, I was thinking about the system for initialisation I described in my last post, and discovered how broken it was.

Escape Analysis

While an object is being initialised, the value of ‘this‘ is not allowed to escape from the initialiser or the constructor. For example, you can’t do the following:

class Foo
{
  string uninitialised;
  static ?Foo stored;
  constructor() throws Exception
  {
    stored = this;
    throw new Exception();
  }
  static uint main([]string args)
  {
    try
    {
      Foo f = new Foo();
    }
    catch Exception e
    {
      if stored != null
      {
        stdout::println((cast<Foo> stored).uninitialised);
      }
    }
    return 0;
  }
}

If ‘this‘ was allowed to escape the constructor, the stdout::println() would have undefined behaviour, probably causing a crash. So obviously escape analysis is necessary; usually we enforce it by stopping anyone from using ‘this‘ or calling an instance method before the object is fully initialised, but how do we cope with it for properties?

One thing to note about properties is that some of them have to be initialised in the constructor. For example, properties which do not have a default value for their backing variable must always be initialised, and final properties must always be initialised exactly once.

The problem is how to do this initialisation. If we call the setter, it could allow ‘this‘ to escape (and we can’t just check that it doesn’t, because it could be overridden by a subclass). We could check setters much more strictly, but this would stop us from ever doing a lot of the things we might like to do in a setter, such as using the old value of the property.

Property Constructors

To solve this problem, Plinth allows properties to have constructors, which can only run once during object initialisation, and never again. This is the much more strictly checked version of the setter: it is not allowed to use ‘this‘ or depend on anything in the object being initialised, or even depend on anything definitely not yet being initialised (so it can’t either read or write from final variables).

To reduce the amount of unnecessary code written, if a constructor is required but not implemented, it defaults to using the setter’s implementation. (Of course, this doesn’t always work, and in such cases a separate implementation must be provided).

When you are initialising an object, you are allowed to call a property’s constructor by assigning to it for the first time. However, once the constructor has been called, you are not allowed to call its setter until the object is fully initialised. This can lead to interesting situations where we can’t decide between the constructor and the setter:

class Asdf
{
  property string text
  // (use the default getter)
  setter(value)
  {
    stdout::println("setter");
    text = value;
  }
  constructor(value)
  {
    stdout::println("constructor");
    text = value;
  };

  // constructor for Asdf:
  selfish constructor(boolean b)
  {
    if b
    {
      this.text = "blah"; // prints "constructor"
    }
    this.text = text; // control flow error: impossible to decide between setter and constructor
    this.text = "Asdf"; // prints "setter" (allowed because the object is fully initialised)
  }
}

Static and Final Properties

Properties don’t only work for instances of objects, and they are allowed to be final.

The definition of ‘final’ in plinth is: ‘must be assigned to exactly once’, with the extra restriction for instance members: ‘must be assigned to before it can be read’. (The same definition is used for both fields and properties).

So what does this mean for the property functions? As you might guess, final properties never have a setter, and must have a constructor. However, depending on what you’re expecting, non-final properties might look slightly weird.

In the case of static non-final properties, they must have a setter, but they cannot ever have a constructor. This is because you could never enforce that the constructor would be called before the getter or setter: a static initialiser from another class might call the setter before you get a chance to call the constructor. Because of this, a constructor would be meaningless.

For non-static, non-final properties, constructors are actually optional. If you declare one, it definitely exists; if you don’t declare one, it *might* not exist, but it’s still possible. A property which has a backing variable which doesn’t have a default value has to have a constructor, for example:

property string s
getter { return s; }
setter(value) { s = value; };

This property has a constructor which defaults to using the setter’s implementation, even though it wasn’t declared. This allows a subclass to override the constructor to do something different during initialisation if necessary.

The truth table for when getters, setters, constructors exist is as follows:

static final getter setter constructor
?

The ‘?’ means that there is a constructor if either a) one is declared, or b) the property is not marked as unbacked and has a type which does not have a default value.

Properties

Properties are a special way of storing and accessing data. They usually act exactly like fields, but you can also define custom getters and setters to get/set a property’s value in a special way.

The normal way of defining a property is:

property Foo someFoo;

This defines a property called someFoo of type Foo, with default a default getter and setter. The default getter and setter simply assign and retrieve a backing variable. To access this property (assuming it is in a class Bar), we treat it as if it were a field:

Bar bar = new Bar();
Foo oldFoo = bar.someFoo;
bar.someFoo = createNewFoo();

Custom Getters and Setters

Sometimes, it is useful to do some processing in the getter or the setter of a property. In fact, the whole reason for having properties is so that we can override this behaviour if we want to. Here’s how we do it in Plinth:

property string name
getter
{
  return name;
}
setter(newName)
{
  logNewName(newName);
  name = newName;
};

Inside the getter and setter, the references to name actually refer to the backing variable for the property. But they don’t have to be used. In fact, you can declare a property without a backing variable:

string actualName;
unbacked property string name
getter
{
  return actualName;
}
setter(newName)
{
  actualName = newName;
};

In this case, you must define a custom getter and setter, because the default ones cannot be used.

A property can also be abstract, and properties in interfaces are abstract by default. Abstract properties are always unbacked, and have no implementation for their getter and setter. Any types inheriting an abstract property must always define it, even if they do not provide custom getters and setters, as the backing variable must be defined in order to generate the default getter and setter.

Immutability and Properties

By default, properties have a normal setter and an immutable getter. This allows you to read the variable from an immutable context (e.g. inside an immutable method), but modify it only from a normal (mutable) context.

However, this behaviour can be changed using modifiers. If you want a mutable getter, it can be marked as such using the mutable modifier, but the property will no longer be readable from an an immutable context:

property uint x
mutable getter
{
  ++x;
  return x;
};
immutable uint getThriceX()
{
  return x * 3; // error: x cannot be read from an immutable context
}

Another thing you can do is mark a property’s setter as immutable. This allows you to call the setter from inside an immutable method:

unbacked property uint y
immutable setter(value)
{
  stderr::println("y = " + value);
}
getter
{
  return 17;
};

Of course, it is usually impossible to alter anything from inside an immutable function, and a setter is no exception. The only way to modify something from within an immutable function in plinth is to mark the variable being modified as mutable. So we could use our own backing variable and mark it as mutable, or we could mark the property itself as mutable, which accomplishes the same thing. Making a property mutable only affects the backing variable (so mutable unbacked properties are impossible). We can define a property with an immutable setter as follows:

mutable property uint z
immutable setter
immutable getter;

immutable void quadrupleZ()
{
  z *= 4; // works, because the getter and setter are both immutable
}

Here, we declare the property’s setter as immutable. This allows the setter to be called from an immutable context, such as quadrupleZ(). We also declare the getter as immutable for consistency, but this is not necessary since getters are immutable by default.

Returning from a Getter

Forget about properties for a moment. When you read from a field, the value is immutable if the thing you are accessing it on is immutable. For example:

class Foo
{
  Foo next;
}
// in some method:
#Foo myFoo = new Foo();
Foo nextFoo = myFoo.next; // error: myFoo.next is a #Foo, so it can't convert to a Foo

This happens because Plinth has deep immutability, meaning that accessing data on an immutable object results in more immutable data.

The same is true when we use properties: reading a property on an immutable object results in an immutable piece of data. However, reading a property on a normal object results in a normal piece of data. For example:

class Foo
{
  property Foo next;
}
// in some method:
#Foo myFoo = new Foo();
Foo nextFoo = myFoo.next; // error: myFoo.next is a #Foo, so it can't convert to a Foo
Foo otherFoo = new Foo();
Foo otherNextFoo = otherFoo.next; // fine, otherFoo is not immutable, so neither is otherFoo.next

This is fine so far, but now imagine that the getter is a normal method:

class Foo
{
  Foo next;
  immutable Foo getNext()
  {
    return next; // error: next is a #Foo in this context, which can't convert to a Foo
  }
}

As you can see, it is not possible for an immutable method to return one of its fields without the result being immutable.

So how do getters work in this situation? They cheat.

A property’s getter can return values “against contextual immutability”. This means that if it returns a value which is only immutable because it was accessed inside an immutable context, then returning that value works. For example:

class Foo
{
  Foo realNext;

  unbacked property Foo next
  setter(value)
  {
    realNext = value;
  }
  immutable getter 
  {
    if true
    {
      // this works, because although realNext is immutable in this context,
      // immutable getters can return values against contextual immutability
      return realNext;
    }
    #Foo otherValue = new Foo();
    return otherValue; // error: otherValue is explicitly immutable, not just contextually immutable
  };
}

Initialisers and Backing Variables

As with fields, properties can have initialisers. Initialisers simply call the setter with some value, for example:

property uint x = 45
setter(value)
{
  stdout::println("new x=" + value);
  x = value;
};

This will print out “new x=45” during initialisation.

However, there is a subtlety about initialisation which involves the property’s backing variable: accessing the old value from the setter:

property string y = "hello, world!"
setter(value)
{
  string old = y; // error: y may not have been initialised
  stdout::println("old y=" + old + " new y=" + value);
  y = value;
};

If the property’s type does not have a default value (e.g. null or zero), then the backing variable must be initialised before it is read. This means the setter cannot access the old value, since before the setter is called for the first time, the backing variable’s value is undefined.

There is a way around this, but it requires slightly more work. It involves using a custom backing variable and an unbacked property:

string backingStr = "initial";
unbacked property string str = "hello, world!"
setter(value)
{
  string old = backingStr;
  stdout::println("old=" + old + " new=" + value);
  backingStr = value;
}
getter
{
  return backingStr;
};

This will print "old=initial new=hello, world!" during initialisation.

In theory, it would be possible to design the language to allow access to the backing variable from outside the getter and setter. However, the confusion this would cause as to whether or not the setter is called in a given assignment makes it much clearer to use a custom backing variable if access to the old value is needed.

More Exceptions: Semantics

Last week, I discussed how the internals of the stack unwinder work during exception handling. This week, I am going to talk about how Plinth’s exceptions look from a programmer’s point of view, and their semantics.

Exceptional Type System

Everything which can be thrown in plinth implements an interface called Throwable. If something implements Throwable, it can be thrown and caught using throw statements and try-catch blocks.

Underneath Throwable, there are two classes: Error and Exception. The distinction between these is slightly fuzzy, but the main idea is:

Anything which a programmer should expect to have to handle should be an Exception.
Anything which a programmer shouldn’t expect should be an Error.

For example, a programmer shouldn’t expect a cast to fail. Since the value being cast is completely under their control, they should check the value for themselves before casting. Thus, a cast failure results in a CastError (which extends Error).

In another case, a programmer might be interacting with some system they don’t have control over, such as the file system. If they try to open a non-existent file, the I/O library might throw a FileNotFoundException (which extends Exception).

Checked Exceptions

This is a feature a lot of languages either skip (C♯) or get slightly wrong (Java).

A checked exception is an exception that the compiler checks the propagation of. If something throws some checked FooException, then that FooException must either be caught or be declared to be thrown onwards.

In Java, this ensures that if something might throw an exception, you have to declare it as being rethrown at every stage until it’s caught. This can be a big problem if you want to use higher order functions (e.g. as described here), since the higher-order function usually has no idea what exceptions might be thrown through it. You can get around this problem by tunnelling: boxing a checked exception inside an unchecked exception and re-throwing that. The problem with this is that in order to catch the boxed exception again, you must write a catch block which matches any instance of the unchecked exception, and do instanceof checks on the one boxed inside it.

In my opinion, Plinth handles cases like this much better than Java. Instead of having certain types which are checked and others which are unchecked, Plinth considers every Throwable to be checked. However, it allows you to rethrow them as unchecked:

void mightThrow() throws Error, unchecked FooException
{
  if someTest()
  {
    throw new FooException();
  }
  throw new Error();
}

void caller() throws Error
{
  mightThrow();
}

Here, mightThrow() has to declare all of the Throwables it might throw. Because it can throw both FooException and Error, and doesn’t catch either of them, it must have both in its throws clause.

The point here is that it declares FooException as unchecked. Because of this, caller() doesn’t have to declare that it could throw FooException. Nevertheless, it must declare that it throws Error, since it doesn’t catch it.

One point to mention here is that any method can declare that it throws any Throwable type without generating compiler warnings for unnecessary declared exceptions. This is because it could always be rethrowing an unchecked exception as a checked exception.

This system still has the useful property that you have to deal with every exception somewhere, the difference is that now you can explicitly decide to ignore it without writing a whole try-catch statement.

Try-Catch-Finally

Try statements are very useful not just in exception handling, but in general for cleanup code. They function very similarly in Plinth to how they do in other languages.

The basic semantics are that the try block tries to execute, and if it fails at any point by throwing an exception, the language checks each of the catch blocks to see if any of them match the exception. If any do, the first matching catch block is run. The finally block is run after all of the try/catch blocks have finished, even if any of the try or catch blocks throw an exception, return, or try to break out of or continue through an enclosing loop. After the finally block has been run, execution continues wherever it was going before it entered the finally, for example it could continue throwing an exception, or it could branch back to the start of a loop.

An example of a try-catch-finally statement is as follows:

try
{
  callSomething();
}
catch IOException | FooException exc
{
  stderr::println("Caught an IOException or a FooException!");
}
catch Error err
{
  stderr::println("Caught an Error!");
}
finally
{
  doSomeCleanup();
}

As you can see, multiple exceptions can use the same catch block. In the first catch block above, the type of exc is the lowest common super-type of IOException and FooException (or Throwable if there are two equally-close super-types).

Control Flow of Try-Catch-Finally

Control flow checking is a compiler pass which tracks which variables are initialised where. It primarily maintains two sets: the set of initialised variables, and the set of possibly-initialised variables. These sets are used to check that variables are not used before they are initialised, and that final variables are not initialised twice. During a constructor, the control flow checker checks some extra things, such as member variables.

Try-catch-finally statements are easily the most complicated statements in terms of control-flow. If a finally block is involved, the semantics involved in tracking which variables are initialised are very tricky to define. To illustrate this, here is a sketch of the threads of control flow that can happen in a try-catch-finally statement (ignoring break and continue):

plinth-try-catch-finally-control-flow

The arrow coming from the top is where control begins, and the one exiting at the bottom is where it goes on to any statements after the try-catch-finally statement.

The first thing to notice is that on the way to join-up point (a), the try statement may not be executed. This models the fact that an exception may be thrown at any point during the try statement, causing it to terminate abruptly. In fact, any lines of control flow in the try statement that stop execution (i.e. with a break/continue/throw/return) are also combined into join-up point (a). This allows that point to model accurately which variables have been initialised after the try statement halts abnormally.

After point (a), control goes through the catch blocks. Since each of these can also terminate with a break/continue/throw/return, we must also short-circuit all of the catches on the way to the finally block, as we did with the try block. We then reach join-up point (b).

Next, we must process the finally block. There is only one finally block, but the diagram contains it twice. Finally [1] is to do a control-flow check on the finally block with an accurate model of all possible control flow states entering it. Whereas finally [2] is there to help us find out which variables are initialised when control continues on to the statement immediately after the end of the try-catch-finally statement.

At join-up point (b), we have an accurate model that says which variables definitely are initialised and which might be initialised just before the finally block. We use this to check that the finally block doesn’t break any control-flow rules about variables.

In order to model the control flow after the try-catch-finally statement, we combine together the states at each of the returning try and catch blocks at join-up point (c). If a try/catch block doesn’t return, we don’t combine it into that join-up point, because it won’t affect what happens to the control flow after the finally block. Finally, finally [2] finishes, and we can continue to process whatever statements are next.

But what happens after finally [1]? The answer is that it can go to many different places, depending on what caused us to enter the finally block in the first place. If someone broke out of some loops through a finally block, control would branch to after the outer loop. If someone returned, the function might return. On the other hand, there might be another finally block higher up in the function that we also have to process before returning. In general, execution can go to lots of different places. To implement this, the code generator will create an internal integer variable which will decide where to jump to after the finally block finishes.

Exception Handling

Before this week, I had no idea how exception handling worked. I assumed languages which use virtual machines do a lot of bookkeeping in order to maintain some data structure to jump back through, but C++ exceptions may as well have been magic.

During the last week, I’ve come to understand a lot more about how they work, and also struggled a lot at working out how to implement them in Plinth.

There are two main ways to do exception handling: use the Itanium ABI, or use calls to setjmp() and longjmp().

Itanium ABI for Exception Handling

The Itanium ABI, and the variations of it used on other architectures, can also be referred to as “zero-cost” exception handling. This is because they don’t use any CPU time or stack space unless an exception is actually thrown.

The core of this form of exception handling is implemented by a library called an unwinder. The unwinder is what you call to raise an exception in the first place, and it knows how to unwind the stack and call something called a “personality function” for each function call it unwinds.

A personality function is a language-specific function which works out what to do when the unwinder unwinds into a given stack frame; it is passed the exception being thrown and the unwinder’s context for this frame. From this context, it can find out what the instruction pointer was at the time of the call, and it uses this in combination with some language-specific data about the current function to decide whether to continue unwinding past this frame, or to jump into a catch or finally block.

It then sets some registers, including the instruction pointer, and execution jumps to a landing pad. This landing pad does whatever cleanup the personality function signalled it to do. So if the personality function decided that one of the catch blocks was triggered, one of the registers it set would indicate that to the landing pad, which would branch on the contents of that register and execute that catch block.

If a finally block was encountered, and it didn’t throw another exception or return in the mean-time, it would eventually call back into the unwinder to resume the unwinding process from where it left off.

setjmp() / longjmp() Exception Handling

setjmp() and longjmp() are C functions which have the interesting semantics that when you call longjmp(jmp_buf, value), execution jumps directly to the original call to setjmp(jmp_buf), which then returns for a second time, this time returning the value passed into longjmp().

The most important part of this form of exception handling is its linked-list structure. Each function which needs to handle exceptions allocates an entry in this linked-list on the stack, inside which it stores a reference to the previous piece of memory along with a buffer for the setjmp() and longjmp() functions to use. The current top-of-the-stack is stored in a thread-local variable.

In order to call a function, the execution state must first have been saved to the function’s buffer using setjmp(), so that if the function throws an exception, execution can begin again at the relevant catch or finally block.

Apart from the linked-list allocated on the stack, this method of exception handling is very similar to the Itanium ABI. In particular, each linked-list entry contains a pointer to a personality function, which does the same sorts of processing (using the same sort of language-specific data) that the Itanium ABI’s personality function does. The main difference is that it unwinds using longjmp() rather than setting the program counter directly.

As you might have noticed, setjmp() / longjmp() exception handling is much more expensive than it is with the Itanium ABI, as it uses more CPU time and stack space maintaining the linked-list structure. It can be faster when propagating exceptions, but since exception handling is meant to be rare, the Itanium ABI is the preferred method for exception handling most of the time.

LLVM’s Exception Handling

LLVM supports both the zero-cost method and setjmp() / longjmp(). The problem is that it doesn’t support either of them on all platforms.

One example for this is ARM. On ARM, the zero-cost exception handling ABI (EHABI) is slightly different from the Itanium ABI. The unwinder still calls the personality function once per stack frame, but it also delegates the unwinding of that stack frame to the personality function (i.e. the personality function must know how to unwind out of the current function).

From what I can tell, LLVM doesn’t yet emit the correct DWARF exception tables (or language specific data blocks, as they are called in the Itanium ABI) necessary to support ARM’s EHABI (http://llvm.org/bugs/show_bug.cgi?id=7187), which is why LLVM recommends using setjmp()/longjmp() on ARM.

Another problem is that setjmp() / longjmp() exception handling is broken on most platforms which are not ARM: http://llvm.org/bugs/show_bug.cgi?id=1642

Plinth’s Exception Handling

With Plinth, I am trying as hard as I can to keep the generated bitcode files platform independent. This means generating platform-independent code to handle exceptions in a given function. The problem is that from what I can tell, neither of LLVM’s methods for exception handling work on all platforms.

In order to get around this, I will probably have to implement my own exception handling system. The design I am currently working on is very similar to the normal setjmp() / longjmp() system, but without personality functions and without any form of DWARF exception tables (since I cannot generate DWARF myself).

Hopefully, next week I will have a basic exception handling system implemented.

Update 2013/02/12: After asking the LLVM developers about exactly how setjmp() / longjmp() exceptions are written in bitcode, it turns out that the bitcode is the same no matter which scheme you use (email link). Since this allows me to generate bitcode that will work on all platforms (given a compatible unwinding library and a personality function), I will be using zero-cost exceptions where possible, and falling back to SJLJ exceptions everywhere else.