XLR type system

16 views
Skip to first unread message

Christophe de Dinechin

unread,
Mar 13, 2011, 2:50:03 PM3/13/11
to xlr-...@googlegroups.com
Let me try to put together a few thoughts on the XLR type system.

1) XLR is homoiconic, i.e. source code is represented as a data structure. There are 8 basic parse tree node types (integer, real, text, name/symbol, infix, prefix, postfix, block). These are internally represented as C++ classes that derive from the "Tree" type, and are named Integer, Real, Text, Name, Infix, Prefix, Postfix and Block.

2) All XLR data has a canonical form based on its source representation. For example, a list like 1,2,3,"ABC" has a source code representation made of infix "," nodes, and that is the canonical representation for the list. Any such data is represented in the C++ code as a Tree * and is managed by the XLR garbage collector.

3) Data can also have non-canonical "optimized" representations, typically intended to more efficient. For example, an integer value like 3 can be represented using a machine integer instead of a Tree*. Similarly, a list like 1,2,3,"ABC" can be represented more efficiently with a data type equivalent to struct {int x, y, z; char *t; }.

4) The canonical representation includes dynamic type information, i.e. it is possible to tell at runtime if a given tree pointer points to an "Integer" or a "Block": It also includes other information that is not necessary for the proper execution of the XLR program (e.g. source code position). Optimized data representations elides such information either when it's not used or when it can be computed at compile-time.

5) Types in XLR represent shapes of trees. For example, a type can indicate trees that look like "if X then Y else Z". The same value can belong to a multiplicity of types. For example, 1,2,3 belongs to types 'tree', 'infix' but also to more specialized types, e.g. a type matching only 1,2,3.

6) Special type names are predefined to identify basic node types:
- The 'tree' type covers any value, i.e. it corresponds to Tree*
- The 'integer', 'real' and 'text' types correspond to integer, real and text values
- The 'symbol', 'name' and 'operator' types correspond to the Name class with different values:
+ 'symbol' covers any Name
+ 'name' covers symbols beginning with alphabetical characters, such as X or Foobar_42
+ 'operator' covers other symbols, such as + or ->
- The 'infix', 'prefix', 'postfix' and 'block' types correspond to the similarly named C++ classes
- The 'boolean' type corresponds to values 'true' or 'false'

7) Type specifications have the form "X : T", where X is an expression and T is a type, and tell the compiler that expression X must have type T. For example, A:integer indicates that the name A represents an integer value. This is used for example on the left-hand side of a rewrite, e.g. "sin x:real -> ..." defines a 'sin' function that takes a real argument named x.

8) Types are defined the same way as other entities in XLR, using the -> operator. For example, "number -> integer | real" defines a union type named "number" that matches either an integer or a real.

9) Type expressions can have the following form:
- A litteral matches exactly the given value, e.g. 0 matches only the value 0, "ABC" only the text "ABC"
- A type name (referring to the named type expression)
- T|U is a union type for values matching either type T or type U.
- {P} is the type matching the pattern P (similar to patterns on the left of ->)
- (T) is the same as T

For example, the following defines a comma-separated type named "vector3" composed of three real numbers called x, y, and z with a positive product:
vector3 -> {x:real, y:real, z:real when x*y*z >= 0}

10) When evaluating expressions, all possible definitions are tested in program order. In other words, any definition in XLR is "virtual" by default. For example, consider:
color 0 -> "red"
color 1 -> "blue"
color N:integer -> "yellow"
color N:real -> "green"
color T:text -> "pink"
color Other -> "black"
The expression "color X" will evaluate to one of the candidates depending on the shape of X.

11) XLR uses an extension of the Dalmas-Hindley-Milner type inference algorithm to deduce type information from the program. See http://en.wikipedia.org/wiki/Type_inference for an overview. The general idea is to be able to restrict the possible types for a given expression so as to optimize the generated code.

For example, consider the following computation of the Fibonacci sequence:
(x:integer+y:integer):integer -> opcode Add
(x:integer-y:integer):integer -> opcode Sub
fib 0 -> 1
fib 1 -> 1
fib "A" -> 27
fib N -> (fib(N-1) + fib(N-2))
fib 25

With this code, the XLR compiler can deduce that all expressions have the "integer" type, and rule out that fib "A" is ever called. As a result, all operations can be performed using an optimized integer representation, with performance on a par with C.

12) The optimized LLVM type corresponding to the predefined types are as follow:
- "integer" maps to an i64 (64-bit integer type)
- "real" maps to a double data type
- "boolean" maps to an i1 (1-bit integer type)
- "text" maps to a char *

13) [Not implemented yet] The optimized LLVM type corresponding to a pattern where all variables are optimized types is a structure with the corresponding optimized types. For example, {x:real, y:integer} will be represented by a struct containing a double and an i64. Structure types will be represented by pointers, e.g. {x:vector3} will be represented by a vector3* (pointer to vector3).

14) [Not implemented yet] It is not yet clear how dynamic storage for optimized values will work. The XLR garbage collector is currently template-based, although the core classes are size-based. So it's probably not that difficult to leverage that GC.

15) The conversion between optimized and canonical data representations is automatic. Conversion code is inserted at the transition points between code where the types are known and code where the types are dynamic. This conversion process is called "autoboxing" in the compiler.

16) [Not implemented yet] It's possible to optimize cases such as "integer|real" to represent the value with a 64-bit field and a type tag. This might be particularly useful on 64-bit machines, where a pointer would also be 64-bit.


Food for thought...

Reply all
Reply to author
Forward
0 new messages