Compiler IR tree or AST - what node types to include?

James Harris

unread,

Nov 23, 2022, 9:40:23 AM11/23/22

to

Any advice on what node types should be in an IR tree? There seem to be
many options and I am not sure where to strike the balance.

My guess is that it's best to keep the number of different node types
reasonably small. That should mean less code is needed to walk the tree.
But when does that number become too small?

For example, consider expressions.

There /could/ be one node type for "+" and another for "*", as in a node
with these fields:

+
left
right

Alternatively, both operators could be handled by a single node type of
"dyadic operation". Such a node would have four fields:

nodetype = dyadic
operator (e.g. "+")
left
right

That could be further generalised by having one type of node which would
be usable for both monadic and dyadic operators:

nodetype = operation
operator
number of arguments: 1 or 2
arguments

Further, why limit the number of arguments to 1 or 2? If one were to
allow an arbitrary number of arguments then such a type of node could
also be used for function calls.

In fact, that form is effectively an s expression of the form

(operator arguments)

So for expressions, which of the above schemes is best to use for nodes
in an IR tree?

(A potential fly in the ointment is whether such a node is suitable for
functions which modify some of their parameters.)

Then there are other (non-expression) nodes.

To allow for things like code hoisting I think I need the tree initially
to be at a fairly high level, containing things such as loops and
selections though it would be lowered later to use labels and
conditional branches.

So what other node types are needed? One could have

program
function
identifier: constant or variable or parameter
assignment
iteration
sequential selection
switch
break iteration
break sequential selection
break switch

Is that enough? Too many? Too few? What types are missing?

That's an idea of the kinds of node I am thinking about but I'd
appreciate your suggestions on what types to use in reality.

--
James Harris

Dmitry A. Kazakov

unread,

Nov 23, 2022, 10:39:43 AM11/23/22

to

On 2022-11-23 15:38, James Harris wrote:
> Any advice on what node types should be in an IR tree? There seem to be
> many options and I am not sure where to strike the balance.
>
> My guess is that it's best to keep the number of different node types
> reasonably small. That should mean less code is needed to walk the tree.
> But when does that number become too small?
>
> For example, consider expressions.
>
> There /could/ be one node type for "+" and another for "*", as in a node
> with these fields:
>
> +
> left
> right
>
> Alternatively, both operators could be handled by a single node type of
> "dyadic operation". Such a node would have four fields:
>
> nodetype = dyadic
> operator (e.g. "+")
> left
> right

I have an abstract node type with derived node types for expression
terms like identifiers and literals and for branches. The branch node
type is like:

operation - an enumeration
location - the source code location
operands - the nodes list (1..n)

operation can be for example "+" or "call." E.g.

f(a,b,c)

produces a node

operation => call
operands => f,a,b,c

a + b + c

produces

operation => +
operands => a,b,c

(a,b,c)

produces

operation => parenthesis
operands => a,b,c

etc.

> Further, why limit the number of arguments to 1 or 2? If one were to
> allow an arbitrary number of arguments then such a type of node could
> also be used for function calls.
>
> In fact, that form is effectively an s expression of the form
>
> (operator arguments)

Yes, this is what I use in my parser. Furthermore the parser combines
associative operations into n-arguments like + above. Inverse when added
to also combines - or /. E.g.

a - b + c

produces

operation => +
operands => a,-b,c

> So what other node types are needed? One could have
>
> program
> function
> identifier: constant or variable or parameter
> assignment
> iteration
> sequential selection
> switch
> break iteration
> break sequential selection
> break switch

operator a+b
call f(a,b)

Then various ligatures that are not operators, but some meta-operators
e.g. if you have built-in ranges a..b, components a.b, namespaces a::b
etc. Usually you cannot treat them as first-class operators, because a
in a::b might not be an expression.

I do not create one tree for all program, just for individual
expressions. IMO there is no need to generate a tree for a switch or a
loop, unless you have switch-operator and loop-operator, of course.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Bart

unread,

Nov 23, 2022, 1:33:57 PM11/23/22

to

On 23/11/2022 14:38, James Harris wrote:
> Any advice on what node types should be in an IR tree? There seem to be
> many options and I am not sure where to strike the balance.

My comments will be about an AST. I don't know whether you are using
'IR' as a synonym for the AST; I think it's generally used for the next
stage beyond AST.

> My guess is that it's best to keep the number of different node types
> reasonably small. That should mean less code is needed to walk the tree.
> But when does that number become too small?
>
> For example, consider expressions.
>
> There /could/ be one node type for "+" and another for "*", as in a node
> with these fields:
>
> +
> left
> right
>
> Alternatively, both operators could be handled by a single node type of
> "dyadic operation". Such a node would have four fields:
>
> nodetype = dyadic
> operator (e.g. "+")
> left
> right

The problem is, there are loads of ways to do this, they will all work.
There is no right or wrong way.

I use both of the above methods across three active compilers. With the
first, its easy to dispatch directly the "+" node when you need to deal
with "add".

With the second, you have to dispatch first on "dyadic" and then on
"operator". But it's not a big deal.

> That could be further generalised by having one type of node which would
> be usable for both monadic and dyadic operators:
>
> nodetype = operation
> operator
> number of arguments: 1 or 2
> arguments
>
> Further, why limit the number of arguments to 1 or 2? If one were to
> allow an arbitrary number of arguments then such a type of node could
> also be used for function calls.

How practical this is depends on your language, the arrangements for
matching an arbitrary length list to the node.

Currently I allow 3 operands for one language, and 2 for another.
Remember this not just about expressions, it has to be able to represent
anything.

For variable-length lists, any of the operands (although usually just
the first) is interpreted as the root of a linked list. (With some
compiler versions in dynamic code, each operand could directly be a list.)

> In fact, that form is effectively an s expression of the form
>
> (operator arguments)
>
> So for expressions, which of the above schemes is best to use for nodes
> in an IR tree?
>
> (A potential fly in the ointment is whether such a node is suitable for
> functions which modify some of their parameters.)
>
> Then there are other (non-expression) nodes.
>
> To allow for things like code hoisting I think I need the tree initially
> to be at a fairly high level, containing things such as loops and
> selections though it would be lowered later to use labels and
> conditional branches.
>
> So what other node types are needed? One could have
>
> program
> function
> identifier: constant or variable or parameter
> assignment
> iteration
> sequential selection
> switch
> break iteration
> break sequential selection
> break switch
>
> Is that enough? Too many? Too few? What types are missing?

My ASTs are unusual (although I only discovered this recently) in that
they only represent executable code. That is, the bodies of functions,
or init data for a variable.

Others use the AST also for non-executable elements, like function
declarations. I couldn't see the point of this, unless this is a
scripting where declarations /are/ executed from the top of the module.

However even without declaration nodes, I use either 130 or 200 node
types (depending on whether ADD, SUB etc have dedicated tags, or grouped
under one BINOP tag). My C compiler uses 80.

But you must have already done all this in your compiler; didn't that
use an AST?

James Harris

unread,

Nov 23, 2022, 2:38:13 PM11/23/22

to

On 23/11/2022 15:39, Dmitry A. Kazakov wrote:
> On 2022-11-23 15:38, James Harris wrote:

>> Any advice on what node types should be in an IR tree? There seem to
>> be many options and I am not sure where to strike the balance.

...

> I have an abstract node type with derived node types for expression
> terms like identifiers and literals and for branches. The branch node
> type is like:
>
>    operation - an enumeration
>    location - the source code location
>    operands - the nodes list (1..n)
>
> operation can be for example "+" or "call." E.g.
>
>    f(a,b,c)
>
> produces a node
>
>    operation => call
>    operands => f,a,b,c

Yes, I might do something which is very much like that for two reasons:

1. In my language f is just as much an expression as a, b and c.

2. I may at some point implement different types of call so a 'call'
node would help with later changes.

>
>    a + b + c
>
> produces
>
>    operation => +
>    operands => a,b,c

...

>
>    a - b + c
>
> produces
>
>    operation => +
>    operands => a,-b,c

Both of those are curious. Why have more than two operands?

>
>> So what other node types are needed? One could have
>>
>>    program
>>    function
>>    identifier: constant or variable or parameter
>>    assignment
>>    iteration
>>    sequential selection
>>    switch
>>    break iteration
>>    break sequential selection
>>    break switch
>
>    operator a+b
>    call      f(a,b)
>
> Then various ligatures that are not operators, but some meta-operators
> e.g. if you have built-in ranges a..b, components a.b, namespaces a::b
> etc. Usually you cannot treat them as first-class operators, because a
> in a::b might not be an expression.

I think I understood your post up until that paragraph. What do you mean
by ligatures and meta operators? And aside from the semantics what's
wrong with "a in a::b"?

>
> I do not create one tree for all program, just for individual
> expressions. IMO there is no need to generate a tree for a switch or a
> loop, unless you have switch-operator and loop-operator, of course.

Again, a curious comment. My sources are naturally hierarchical so a
tree seems to be a good way to represent them. Are you saying another
form could be better?

--
James Harris

Dmitry A. Kazakov

unread,

Nov 23, 2022, 3:06:05 PM11/23/22

to

On 2022-11-23 20:38, James Harris wrote:
> On 23/11/2022 15:39, Dmitry A. Kazakov wrote:
>>
>>     a - b + c
>>
>> produces
>>
>>     operation => +
>>     operands => a,-b,c
>
> Both of those are curious. Why have more than two operands?

Because it is easier to fold constants and do other optimizations for
operations declared commutative. Consider:

1 - x + 3

with

+(1,-x,3)

you could fold it into

+(4,-x) == 4-x

>>> So what other node types are needed? One could have
>>>
>>>    program
>>>    function
>>>    identifier: constant or variable or parameter
>>>    assignment
>>>    iteration
>>>    sequential selection
>>>    switch
>>>    break iteration
>>>    break sequential selection
>>>    break switch
>>
>>     operator a+b
>>     call      f(a,b)
>>
>> Then various ligatures that are not operators, but some meta-operators
>> e.g. if you have built-in ranges a..b, components a.b, namespaces a::b
>> etc. Usually you cannot treat them as first-class operators, because a
>> in a::b might not be an expression.
>
> I think I understood your post up until that paragraph. What do you mean
> by ligatures and meta operators? And aside from the semantics what's
> wrong with "a in a::b"?

Because it might evaluate nothing. Consider

System::IO::Print

as an example. You might wish to recognize such things (meta expression)
in the AST and handle them before semantic analysis. Normally, you have
System namespace/package/module visible in the context. When not, it is
a syntax error. Then you look into System names for IO and so on.

>> I do not create one tree for all program, just for individual
>> expressions. IMO there is no need to generate a tree for a switch or a
>> loop, unless you have switch-operator and loop-operator, of course.
>
> Again, a curious comment. My sources are naturally hierarchical so a
> tree seems to be a good way to represent them. Are you saying another
> form could be better?

Because you do not need AST to represent control flow statements or
declarations. You can generate intermediate code or interpret straight
away during recursive descend.

James Harris

unread,

Nov 23, 2022, 3:29:48 PM11/23/22

to

On 23/11/2022 18:33, Bart wrote:
> On 23/11/2022 14:38, James Harris wrote:

>> Any advice on what node types should be in an IR tree? There seem to
>> be many options and I am not sure where to strike the balance.
>
> My comments will be about an AST. I don't know whether you are using
> 'IR' as a synonym for the AST; I think it's generally used for the next
> stage beyond AST.

Does it matter much? Either is a tree representation of the program so
wouldn't they have similar node types, at least initially?

...

>
> The problem is, there are loads of ways to do this, they will all work.
> There is no right or wrong way.

OK, that's reassuring!

...

>> Further, why limit the number of arguments to 1 or 2? If one were to
>> allow an arbitrary number of arguments then such a type of node could
>> also be used for function calls.
>
> How practical this is depends on your language, the arrangements for
> matching an arbitrary length list to the node.

Why would one want to match a list to the node? Wouldn't a "node type"
field provide all the matching needed?

...

>> So what other node types are needed? One could have
>>
>>    program
>>    function
>>    identifier: constant or variable or parameter
>>    assignment
>>    iteration
>>    sequential selection
>>    switch
>>    break iteration
>>    break sequential selection
>>    break switch
>>
>> Is that enough? Too many? Too few? What types are missing?
>
> My ASTs are unusual (although I only discovered this recently) in that
> they only represent executable code. That is, the bodies of functions,
> or init data for a variable.

That sounds reasonable to me. Declarations would be stored in the symbol
table so would probably not need tree nodes.

>
> Others use the AST also for non-executable elements, like function
> declarations. I couldn't see the point of this, unless this is a
> scripting where declarations /are/ executed from the top of the module.

Yes, I allow for forward references such as calling a function which is
defined later in the source file without having to declare it before the
call is seen. That already works for any global (all functions are
global) but I will need other forward references so I'll have to parse
some identifiers purely as names and interpret them later.

>
> However even without declaration nodes, I use either 130 or 200 node
> types (depending on whether ADD, SUB etc have dedicated tags, or grouped
> under one BINOP tag). My C compiler uses 80.

Wow, those numbers sound high! Maybe I have been trying to cut the
numbers down unnecessarily.

>
> But you must have already done all this in your compiler; didn't that
> use an AST?

My compiler doesn't currently have any tree at all, although adding one
has long been in the plan.

--
James Harris

Bart

unread,

Nov 23, 2022, 6:16:11 PM11/23/22

to

I'm sure I posted something like this before. But here is the set of
node tags I currently use in that 130-node compiler, reduced to 125 as
some I saw were obsolete:

const
null
name
block
assem # for inline assembly
assemmacro
assemreg
assemxreg
assemmem
strinclude
andl
orl
notl
istruel
makelist
makerange
makeset
makedict
makeslice
returnmult
keyword
keyvalue
assign
assignmm # m = multiple
assignms
assignmdrem
copy
callfn
cmp
cmpchain
bin
unary
binto
unaryto
incr
inrev
inrange
inset
clamp
index
slice
dot
dotindex
dotslice
ptr
addrof
addroffirst
convert
shorten
autocast
typepun
typeconst
operator
upper
bitwidth
bytesize
typeof
typestr
bitfield
minvalue
maxvalue
cvlineno # compiler variables; these still have dedicated tags
cvstrlineno
cvmodulename
cvfilename
cvfunction
cvdate
cvtime
cvversion
cvtypename
cvtargetbits
cvtargetsize
cvtargetcode
cvctarget
cvwindows
cvlinux
cvnil
cvpi
cvinfinity
cvtrue
cvfalse
whenthen
fmtitem
nogap
space
callproc
return
syscall
to
if
forup
fordown
forall
forallrev
while
repeat
goto
labeldef
redo
next
exit
do
case
docase
switch
doswitch
swap
select
recase
print
println
fprint
fprintln
sprint
sfprint
read
readln
sread
sreadln
stop
eval
empty
emitc
infinity

BGB

unread,

Nov 27, 2022, 4:32:06 AM11/27/22

to

On 11/23/2022 12:33 PM, Bart wrote:
> On 23/11/2022 14:38, James Harris wrote:
>> Any advice on what node types should be in an IR tree? There seem to
>> be many options and I am not sure where to strike the balance.
>
> My comments will be about an AST. I don't know whether you are using
> 'IR' as a synonym for the AST; I think it's generally used for the next
> stage beyond AST.
>

Yes, typically they are different stages.

Typically the AST is a tree structured representation of the program.

IR is typically either a stack machine (often also called a "bytecode"
or "IL");
Something along the lines of 3AC (Three Address Code) or SSA (Static
Single Assignment).

In my compilers, I have mostly ended up going from the AST from a stack
machine IL; and then from this IL into a 3AC IR.

There is a difference in that while an AST is usually tree structured,
something like a 3AC IR is typically organized into "basic blocks"
containing a series of operations operating on temporaries or similar.

I have mostly ended up keeping the stack IL as:
AST -> Stack: Easy
AST -> 3AC/SSA: Not as Easy
Along with:
Stack -> 3AC / SSA: Relatively easy;
Serializing an IR in SSA form: Massive pain.

As for AST structure, I have tried a few major strategies:

S-Expressions: Used in my early VM projects (Scheme based);
Made a return for most of my "BGBScript 1.x" VMs (~ 2006..2013).

XML Based, used in the first version of BGBScript;
Still used in BGBCC, which was a fork off the first VM.
Later on, BGBScript switched back to S-Expressions, using a core VM
based on my original Scheme interpreter.

JSON style: Used in my BGBScript2 VM (~ 2015..2018). This technically
worked well, and had a "good compromise" between XML nodes and
S-Expressions. But, this VM was short lived (mostly used in my
"BGBTech2" 3D engine, and mostly died off along with it).

Where, syntactically:
BGBScript was along similar lines to JavaScript and ActionScript.
BGBScript2 was a constrained BGBScript, and mostly went over to a more
Java-like syntax (but retains some elements of its predecessors).

BGBCC was originally a case of, when I was younger, it didn't seem like
that huge of a stretch to go from a "terrible JavaScript knock off" over
to trying to compile C (I was pretty wrong about this, but did
eventually end up with a C compiler...).

Currently I am working on my "BJX2 CPU ISA" project, which mostly uses
BGBCC (my C compiler, but also compiles a variant of BGBScript2). In its
present form, BGBCC is using a similar structure to that of the
dictionary-object approach, but still representing things in terms of
XML abstractions.

In this case, nodes consist of a collection of key/value pairs, where:
Each 'key' is a 16-bit tag (4-bit type, 12-bit symbol ID or index);
Each 'value' is 64 bits, with a meaning depending on the 4b type.

Each node only directly holds 14 key/value pairs, if the node needs more
than this, it expands out B-Tree style, so:
1 node, 1 level, 14 keys;
15 nodes, 2 levels, 196 keys;
211 nodes, 3 levels, 2744 keys;
...

All the keys are kept sorted, so access time is "O(log2 n)".

Most calls to get/set a key in a node supply both a name and a pointer
to a variable to cache the ID number. Typically, the ID is primary, but
the string is used if the ID number hasn't been initialized yet
(originally, everything used strings, but this was not ideal for
performance).

Also originally, nodes were organized into linked lists (with internal
linkage), but this has been eliminated.

But, it sort of awkwardly fakes the original DOM style interface mostly
because otherwise I would have needed to rewrite both the parser and
compiler to use the dictionary objects directly.

There is some trickery, say, something at the AST level like:
<if>
<cond>
<binary op=">">
<left> <ref name="x"/> </left>
<right> <int value="0"/> </right>
</binary>
</cond>
<then> ... </then>
</if>

Is internally represented as:
{ tag: #if,
cond:
{tag: #binary, op: ">",
left: {tag: #ref, name: "S"},
right: {tag: #int, value: 0}}
then: { ... }
}

With named keys holding a node implicitly assumed to represent a
"virtual node" as far as the XML is concerned.

There are no arrays, rather array-like cases are handled via
index-number keys and some additional hackery to deal with nodes with a
large number of children.

For nodes which represent a body section containing an ordered list of
child nodes, these are given a sequentially assigned index number
(almost always, the newest node is added to the end of the list).

For a small to moderate number of indexed child nodes, key ranges could be:
000..3FF: Named Keys
400..7FF: Radix Space
800..FFF: Indexed Keys (reverse numbered)

Where, say, the "radix" hack could be used to significantly extend the
max number of indexed keys (via dividing the index space into multiple
levels of nested nodes).

Though, a case could be made for skipping the whole XML thing...

A case could also be made for using tagged references as values rather
than type-tagging the key.

>
>> My guess is that it's best to keep the number of different node types
>> reasonably small. That should mean less code is needed to walk the
>> tree. But when does that number become too small?
>>
>> For example, consider expressions.
>>
>> There /could/ be one node type for "+" and another for "*", as in a
>> node with these fields:
>>
>>    +
>>    left
>>    right
>>
>> Alternatively, both operators could be handled by a single node type
>> of "dyadic operation". Such a node would have four fields:
>>
>>    nodetype = dyadic
>>    operator (e.g. "+")
>>    left
>>    right
>
> The problem is, there are loads of ways to do this, they will all work.
> There is no right or wrong way.
>
> I use both of the above methods across three active compilers. With the
> first, its easy to dispatch directly the "+" node when you need to deal
> with "add".
>
> With the second, you have to dispatch first on "dyadic" and then on
> "operator". But it's not a big deal.
>

I used a "binary" node for all of the binary operators.

My ASTs hold "nearly everything", though for C, things like struct
bodies and typedefs are folded off into their own special lists.