On 11/23/2022 12:33 PM, Bart wrote:
> On 23/11/2022 14:38, James Harris wrote:
>> Any advice on what node types should be in an IR tree? There seem to
>> be many options and I am not sure where to strike the balance.
>
> My comments will be about an AST. I don't know whether you are using
> 'IR' as a synonym for the AST; I think it's generally used for the next
> stage beyond AST.
>
Yes, typically they are different stages.
Typically the AST is a tree structured representation of the program.
IR is typically either a stack machine (often also called a "bytecode"
or "IL");
Something along the lines of 3AC (Three Address Code) or SSA (Static
Single Assignment).
In my compilers, I have mostly ended up going from the AST from a stack
machine IL; and then from this IL into a 3AC IR.
There is a difference in that while an AST is usually tree structured,
something like a 3AC IR is typically organized into "basic blocks"
containing a series of operations operating on temporaries or similar.
I have mostly ended up keeping the stack IL as:
AST -> Stack: Easy
AST -> 3AC/SSA: Not as Easy
Along with:
Stack -> 3AC / SSA: Relatively easy;
Serializing an IR in SSA form: Massive pain.
As for AST structure, I have tried a few major strategies:
S-Expressions: Used in my early VM projects (Scheme based);
Made a return for most of my "BGBScript 1.x" VMs (~ 2006..2013).
XML Based, used in the first version of BGBScript;
Still used in BGBCC, which was a fork off the first VM.
Later on, BGBScript switched back to S-Expressions, using a core VM
based on my original Scheme interpreter.
JSON style: Used in my BGBScript2 VM (~ 2015..2018). This technically
worked well, and had a "good compromise" between XML nodes and
S-Expressions. But, this VM was short lived (mostly used in my
"BGBTech2" 3D engine, and mostly died off along with it).
Where, syntactically:
BGBScript was along similar lines to JavaScript and ActionScript.
BGBScript2 was a constrained BGBScript, and mostly went over to a more
Java-like syntax (but retains some elements of its predecessors).
BGBCC was originally a case of, when I was younger, it didn't seem like
that huge of a stretch to go from a "terrible JavaScript knock off" over
to trying to compile C (I was pretty wrong about this, but did
eventually end up with a C compiler...).
Currently I am working on my "BJX2 CPU ISA" project, which mostly uses
BGBCC (my C compiler, but also compiles a variant of BGBScript2). In its
present form, BGBCC is using a similar structure to that of the
dictionary-object approach, but still representing things in terms of
XML abstractions.
In this case, nodes consist of a collection of key/value pairs, where:
Each 'key' is a 16-bit tag (4-bit type, 12-bit symbol ID or index);
Each 'value' is 64 bits, with a meaning depending on the 4b type.
Each node only directly holds 14 key/value pairs, if the node needs more
than this, it expands out B-Tree style, so:
1 node, 1 level, 14 keys;
15 nodes, 2 levels, 196 keys;
211 nodes, 3 levels, 2744 keys;
...
All the keys are kept sorted, so access time is "O(log2 n)".
Most calls to get/set a key in a node supply both a name and a pointer
to a variable to cache the ID number. Typically, the ID is primary, but
the string is used if the ID number hasn't been initialized yet
(originally, everything used strings, but this was not ideal for
performance).
Also originally, nodes were organized into linked lists (with internal
linkage), but this has been eliminated.
But, it sort of awkwardly fakes the original DOM style interface mostly
because otherwise I would have needed to rewrite both the parser and
compiler to use the dictionary objects directly.
There is some trickery, say, something at the AST level like:
<if>
<cond>
<binary op=">">
<left> <ref name="x"/> </left>
<right> <int value="0"/> </right>
</binary>
</cond>
<then> ... </then>
</if>
Is internally represented as:
{ tag: #if,
cond:
{tag: #binary, op: ">",
left: {tag: #ref, name: "S"},
right: {tag: #int, value: 0}}
then: { ... }
}
With named keys holding a node implicitly assumed to represent a
"virtual node" as far as the XML is concerned.
There are no arrays, rather array-like cases are handled via
index-number keys and some additional hackery to deal with nodes with a
large number of children.
For nodes which represent a body section containing an ordered list of
child nodes, these are given a sequentially assigned index number
(almost always, the newest node is added to the end of the list).
For a small to moderate number of indexed child nodes, key ranges could be:
000..3FF: Named Keys
400..7FF: Radix Space
800..FFF: Indexed Keys (reverse numbered)
Where, say, the "radix" hack could be used to significantly extend the
max number of indexed keys (via dividing the index space into multiple
levels of nested nodes).
Though, a case could be made for skipping the whole XML thing...
A case could also be made for using tagged references as values rather
than type-tagging the key.
>
>> My guess is that it's best to keep the number of different node types
>> reasonably small. That should mean less code is needed to walk the
>> tree. But when does that number become too small?
>>
>> For example, consider expressions.
>>
>> There /could/ be one node type for "+" and another for "*", as in a
>> node with these fields:
>>
>> +
>> left
>> right
>>
>> Alternatively, both operators could be handled by a single node type
>> of "dyadic operation". Such a node would have four fields:
>>
>> nodetype = dyadic
>> operator (e.g. "+")
>> left
>> right
>
> The problem is, there are loads of ways to do this, they will all work.
> There is no right or wrong way.
>
> I use both of the above methods across three active compilers. With the
> first, its easy to dispatch directly the "+" node when you need to deal
> with "add".
>
> With the second, you have to dispatch first on "dyadic" and then on
> "operator". But it's not a big deal.
>
I used a "binary" node for all of the binary operators.
My ASTs hold "nearly everything", though for C, things like struct
bodies and typedefs are folded off into their own special lists.