Native Types and Generics

First, a status update: I have finished most of the runtime code required for generics, which involves several changes to how virtual function tables (VFTs) work, and some changes to run-time type information (RTTI) which will allow you to do instanceof checks and casts. What I have not yet done is to decide on a way of representing generic types in native code; there are several obstacles to overcome when devising this native representation, which I will try to describe in this post.

Simple Native Types

Before we start on generics, here is a brief overview of how native types work.

Each type is given a standard representation, which is always the same. For example, uint becomes a 32 bit integer, or an i32 in LLVM bitcode. Knowing the compile-time types allows us to know exactly which run-time type a value will have. Here are some examples of various native types:

Plinth Type(s) LLVM Type Description
uint, int i32 A 32 bit integer (signed/unsigned doesn’t make a difference).
boolean i1 A 1-bit integer.
?uint {i1, i32} A tuple of a 1-bit and a 32-bit integer. If the i1 is false, the value is null.
object,
?object
{%RTTI*, %VFT*}* An object is just a pointer to a combination of run-time type information and a virtual function table.
[]ubyte,
?[]ubyte
{%RTTI*, %VFT*,
i32,
[0 x i8]}*
Arrays are pointers to memory where the elements are stored sequentially.
They store RTTI and a VFT for easy conversions to/from object, along with
the length of the array (i32) and the elements themselves ([0 x i8]).
(short, ulong) {i16, i64} Tuples are structures which store each of their element values in turn.
string {%AB*, %AI*} string is a compound type which stores two values:
an []ubyte (%AB*) and a ?[]uint (%AI*). Compound types
are stored in the same way as tuples: using pass-by-value structures.
class Foo
{
  uint x;
}
{%RTTI*, %VFT*,
 %VFT*, i32}*
Classes are represented as pointers, the same way as objects.
They have the standard RTTI and object VFT, plus their own VFT
for every class and interface in their inheritance tree.
They also store each of their fields.
{long -> boolean} {%RTTI*, %opaque*,
 i1 (%opaque*, i64)*}
Functions are actually tuples of three values:
* An RTTI pointer, for easy conversion to/from object.
* An opaque pointer, which is the callee for the function. We don’t care what this points to, just that the function expects it.
* A function pointer, which takes the opaque pointer as the first argument, followed by the rest of the arguments, and returns the return type.

This is the pre-generics design for native types. It supports all of the types which can be expressed in the language (apart from void, which just maps to the LLVM void type for function return types).

Simple Type Conversions

One interesting thing about plinth’s type system is that it is unified, so that we can cast any value to an object.

Obviously, some of these types are pointers that start with %RTTI*, %VFT* just like object; these types don’t need to be converted, we can just take the same pointer and treat it as an object value.

Other types, such as uint, do need to be converted. Unfortunately, this involves a heap allocation: we create an object with an extra field on the end (in this case i32), store the value inside that object, and cast it to a normal object. The same thing applies to tuples, compound types, and functions.

Let me just note that dynamic languages (e.g. Python) have this much easier – they could just say that all types look like objects, and not worry so much about converting between them. The tradeoff is that allocating everything on the heap like that is much less efficient.

Generic Native Types

In plinth, generic types can represent any other type, including tuples, functions, arrays, etc. Because of this, we need their native representation to look like something which all other types can be converted to: object.

Now, because all type parameters are represented as objects, we already know how to convert to and from them: just do the normal conversion to/from object, and we’re done! Or are we…?

Generic Functions

Generic functions are functions which take generic arguments, or return generic results. Here’s an example of a class that uses them:

class Comparator<T>
{
  {T, T -> int} f;
}
// ... somewhere else ...
Comparator<string> comp = ...;
{string, string -> int} cf = comp.f; // A
comp.f = cf;                         // B

This Comparator class stores a generic function in a field. It takes two of an arbitrary type T, and returns an int. Considering that this function’s native representation takes two objects, and returns an i32. How do we convert that function into one that takes two strings and returns an i32? Obviously some conversion needs to be done, but we can’t do it until the function is called.

The only solution here is to create a proxy function that takes two strings, and make it do some conversion from string to object, and call the original function. (In native code we do this by packing the original function into an object, and making that object the opaque pointer that gets stored alongside the function pointer. In effect, we now have something like this:

proxy function (string)
       |
       V
original function (object)

But now we get to line B, and this is where things get interesting. Obviously here, we have to do the opposite conversion, and since we don’t know that the function we’re converting is a proxy function, we have to wrap it again in the opposite conversion, as follows:

proxy function (object)
       |
       V
proxy function (string)
       |
       V
original function (object)

Due to two opposing conversions, we now have a chain of functions. If we somehow did these conversions in a loop, we would accidentally create lots of entries in this chain, and then wonder why function calls are so slow. The only way around this is to detect that there’s a chain of proxy functions when we do the conversion, and find out whether the function we were about to add to the top is also somewhere lower down in the chain, and if so, just remove the proxies above it. Quite complicated for an implicit conversion!

But even that isn’t the worst thing we have to consider.

Generic Arrays

Generic arrays are arrays of type parameters (or arrays of tuples containing type parameters). Let’s consider the following class:

class Foo<T>
{
  []T array;
  void set([]T array)
  {
    this.array = array;
  }
}
// ... somewhere else ...
Foo<uint> foo = ...;
foo.set(new []uint {2, 3, 4, 5});

What happens when we want to convert from []uint to []T?

The first thing that springs to mind is that this is an O(n) implicit conversion. This MUST be avoided. Even if the performance penalty wasn’t an issue (which it definitely is), performing this conversion would create a second array, which would completely change the semantics of the function call (from pass-by-pointer to pass-by-value).

So the main constraint is that there must be exactly one array of values. What if the array itself was an []uint? Could we still use it as an []T? Actually, we can – we just need to know how to convert between the real type inside the array and T. If we know (a) whether the array elements are pointers, and (b) the size of each array element, then we can treat the array as a byte array an do some index calculations to get to a specific element, and if necessary box that element up in an object.

This system works for the simple cases like []T, but starts to break down when you consider things like arrays of tuples or functions: converting an []{T, T -> int} to an []{string, string -> int} is much more difficult, but can be done. What we need is some sort of proxy array, which instead of storing the elements itself has a reference to a base array, and provides two functions, ‘get’ and ‘set’, for accessing the array elements. These ‘get’ and ‘set’ functions can do the conversions between the base array’s element type and the proxied array’s element type. So for the example of []{T, T -> int} we might have:

%ProxyArray = type {%RTTI*, %VFT*, i32, %BaseArray*, %GetFunc*, %SetFunc*}
                                             |
                                             V
%BaseArray = type {%RTTI*, %VFT*, i32, [0 x %StringFunction]}

%GetFunc = type %ObjectFunction (%BaseArray*, i32)
%SetFunc = type void (%BaseArray*, i32, %ObjectFunction)

%ObjectFunction = type {%RTTI*, %opaque*, i32 (%opaque*, %object*, %object*)*}
%StringFunction = type {%RTTI*, %opaque*, i32 (%opaque*, %string, %string)*}

Then, whenever the proxy array wants to get/set a value in the base array, it converts the array element between a {T, T -> int} and a {string, string -> int} as described in the function section above (which involves generating a proxy function for every array access/store).

For this to work efficiently, we’d like to know the native type of the array at compile time. To achieve this, we can say that arrays which contain anything generic will be proxy arrays, and arrays of non-generic types will be normal arrays.

The system, while getting increasingly complicated, looks like it’s coming together. However, there’s yet another problem…

Creating Generic Arrays

Our current model says that arrays of generic things will be proxy arrays, and arrays of non-generic things will be normal arrays. If all you want to do is pass arrays around, this works great. However, what happens if you want to create an array of type []{T, T -> int}?

In this model, the answer is: first, create the base array which stores the concrete type (functions which take strings), then build a proxy on top of it so that we have a valid reference to a generic array. But this is impossible. Lets say we can create the base array (it’s not too hard, it only contains function pointers, and we don’t need to know the argument types to allocate memory for a pointer). How do we create the proxy array? We need to, at run-time, create get/set functions which know how to convert between {T, T -> int} and {X, X -> int} where X is the concrete type. This involves writing a proxy function that takes argument X and converts it to a T. We can’t generate code at run-time, and we don’t know what X is at compile time, so we can’t generate code for the required proxy function at all.

In fact, the only way that I can imagine doing this is to allow only one sort of array: an array which has get/set functions. We could generate a base array which uses the get/set functions to store its own values, or we could generate an array which just proxies another array and does some type conversion. Then, in order to create a generic array like []{T, T -> int}, we could just create an array of the generic functions, and proxy it to an array of concrete functions where necessary. This will be slower. Much slower for arrays of primitive types, since we need to do an indirect function call on every access/store. But at least it will work in all cases.

If anyone (a) understands this, and (b) can think of a better solution, please comment and let me know.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.