When implementing an AST for some language, each AST node typically
holds information about the language construct it represents and
pointers to children nodes (such as a binary op node pointing to its
left-hand and right-hand operands).
Is it common / useful to supply a pointer to the node's parent as
well?
In favor:
* This can simplify some AST processing tasks, especially when using
the visitor pattern - we may get to an interesting node and then need
to look at its ancestors to do the required analysis.
Against:
* Maintaining parent nodes makes the AST creation code more complex
* Wastes space (another pointer for each node)
What are your thoughts?
Thanks in advance,
Eli
> When implementing an AST for some language, each AST node typically
> holds information about the language construct it represents and
> pointers to children nodes (such as a binary op node pointing to its
> left-hand and right-hand operands).
>
> Is it common / useful to supply a pointer to the node's parent as
> well?
I don't think it's common.
One compiler that keeps parent pointers in the tree is GNAT
(the gcc Ada compiler).
> In favor:
> * This can simplify some AST processing tasks, especially when using
> the visitor pattern - we may get to an interesting node and then need
> to look at its ancestors to do the required analysis.
I find that code that follows parent pointers tends to be confusing.
Same reasons that global variables cause trouble. It's usually better
to pass information down the tree as parameters during the tree walk
-- as opposed to wandering off into totally unrelated parts of the
tree.
Parent pointers also raise the possibility of malformed trees -- what if
Parent(X) = Y, but Y has no child X?
If you need to keep track of where you are in a tree walk, you can
keep a stack of pointers. This gives you the parent of the current
node (and grandparent, and so on) -- but not the parent of some
arbitrary node.
> Against:
> * Maintaining parent nodes makes the AST creation code more complex
> * Wastes space (another pointer for each node)
The stack I mentioned above wastes less space.
- Bob
I have no idea in general (haven't looked at that many compiler source
codes in detail yet.)
> In favor:
> * This can simplify some AST processing tasks, especially when using
> the visitor pattern - we may get to an interesting node and then need
> to look at its ancestors to do the required analysis.
>
> Against:
> * Maintaining parent nodes makes the AST creation code more complex
> * Wastes space (another pointer for each node)
With good design the programmer-time cost for maintaining the parent
nodes is basically a one-off time cost. In that case, I don't find
"making the AST creation code more complex" something that would deter
me. If one has an actual AST transform that this helps and have no
specification issues preventing it, I'd proceed regardless of the
listed negatives.
What I don't know, is how common backtracking to parent nodes is in
actual implementations. (Yes, I'm lazy -- I haven't actually
inspected the source code for the obviously open source compilers like
GCC, CLang, and TCC.)
I know I have never had to do that so far in implementing my vaporware
compiler Z.C++ for C and C++. Z.C++'s AST nodes [
http://svn.berlios.de/wsvn/zcplusplus/trunk/ParseTree.hpp ] are
"flat": one or two signature tokens, and as many prefix, infix, and
postfix nodes as memory allows. As the "signature tokens" and
existence/absence of prefix/infix/postfix nodes define interesting
nodes, backtracking is designed out of this AST representation.
The DMS Software Reengineering Toolkit implements AST with parent
nodes. It is convenient to down the tree to find things, and
occassionaly walk back up again to a parent to find another child.
The overhead of managing the parent link isn't that high. If you
don't have such links, you will have to store parent links on you way
down the tree, so it isn't as though parent-links are always just
extra cost.
In fact, DMS actually stores AST nodes as "hypergraph" nodes, where
each node has a set of ports numbered 1..n, and each port may
reference 0, 1, or a list of other nodes (actually, other ports from
which the node can be easily computed). Links between nodes are
bi-directional on all ports, so you always navigate both ways.
This allows DMS to represent arbitrary graphs of which ASTs are an
easy subset (implemented via an API to offer an AST feel). Having
lists as children ports allows lists to be easily represented. Having
lists as parents allows ambigous trees (with DAG sharing) to be easily
represented, and this is a foundation for our GLR parser
implementation, which in turn I think is fundamental to DMS's success
at parsing difficult like C++.
The node types are indexed and easily enumerated; this allows pattern
matchers to go straight to potential matching nodes (rather than
navigating the tree from the root in a search), and once you get a
pattern match hit, now you may want to visit the parent. But you
don't have the parent-list, so you need a parent link.
This also makes it easy to patch the trees (graphs) locally.
Node management these days is more about how many memory touches you
have. If it fits in a cache line, you have only one touch, for reads
and writes, and since memory reads/writes are hundreds of CPU cycles,
it is better to avoid touching memory :-} Since we process trees with
hundreds of millions of nodes, this matters. We generally trade
caching a bit of extra information whereever we can, to avoid
additional memory touches.
We're pretty happy with this design.
--
Ira Baxter, CTO
www.semanticdesigns.com
> When implementing an AST for some language, each AST node typically
> holds information about the language construct it represents and
> pointers to children nodes (such as a binary op node pointing to its
> left-hand and right-hand operands).
>
> Is it common / useful to supply a pointer to the node's parent as
> well?
It depends on your later usage of the AST. Tree transformations may need
the parent node of an subtree.
When all child nodes are kept in an array, that array may have a
reference to its owning (parent) node. Or a sentinel (dummy node) may be
added to every child list, referring to the parent node. In the latter
model no bidirectional node links are required, since parent, children
and sentinel reside in a closed loop.
DoDi
> In favor:
> * This can simplify some AST processing tasks, especially when using
> the visitor pattern - we may get to an interesting node and then need
> to look at its ancestors to do the required analysis.
I've looked at my own code and AST nodes don't use a parent pointer,
presumably because I never do anything exciting with the AST other than just
traverse it top-down and left-right.
> Against:
> * Maintaining parent nodes makes the AST creation code more complex
> * Wastes space (another pointer for each node)
I don't think the space is an issue, assuming it's running on reasonable
hardware. (How many AST nodes will exist at any one time? 100,000 max?
That's less than 1MB extra. It's even possible the AST node might have
padding bytes that can absorb the pointer)
Just put in the pointer if you think it might be useful. (Although, if a
subtree is shared between different parents, you might have to choose which
one you like best...)
--
Bartc
> When implementing an AST for some language, each AST node typically
> holds information about the language construct it represents and
> pointers to children nodes (such as a binary op node pointing to its
> left-hand and right-hand operands).
>
> Is it common / useful to supply a pointer to the node's parent as
> well?
>
> In favor:
> * This can simplify some AST processing tasks, especially when using
> the visitor pattern - we may get to an interesting node and then need
> to look at its ancestors to do the required analysis.
I find the visitor pattern non-intuitive for AST processing. A set of
mutually recursive functions (one for each syntactic category, i.e., one
for expressions, one for statements etc.) that each do case analysis
over the node type and calls recursively on children nodes is much more
intuitive.
> Against:
> * Maintaining parent nodes makes the AST creation code more complex
Even more of a problem is that rewriting the AST becomes more complex,
and you can't share subtrees between several parents (unless you use a
list of parent pointers).
> * Wastes space (another pointer for each node)
That is hardly an issue. Having to maintain an extra invariant is more
of an issue.
> What are your thoughts?
I can't really see any cases where it would be worth the effort.
Torben
It's a very poor idea.
> In favor:
> * This can simplify some AST processing tasks, especially when using
> the visitor pattern - we may get to an interesting node and then need
> to look at its ancestors to do the required analysis.
You can easily pass down a chain of parent link pointers during
traversal as an extra argument, which give you a path all the way
to the root.
> Against:
> * Maintaining parent nodes makes the AST creation code more complex
> * Wastes space (another pointer for each node)
* No easy functional programming:
Given a set of AST's, you can't construct a new AST which has those AST's as
its children, without destructively manipulating the parent pointers of these
children.
This is a non-starter if the children already have existing parents.
Which means that you can't functionally transform one AST into another, such
that the new one re-uses pieces of the original. You must do a full copy of
every AST, just so that it can have its own parent.
By the way, why would you manipulate AST's in a language that needs the visitor
pattern to emulate double dispatch? Don't tear your hair out; arm yourself with
a decent dynamic language.
If you have multiple dispatch, the visitor pattern pops out of a simple
functional traversal of the tree structure. The tree walker simply knows
how to walk the tree; it applies a caller-supplied function to every node.
That function can be a lambda function which invokes a generic function
on the node, plus some additional arguments. The generic function dispatches
on the run-time type of all generic arguments, and there goes the
visitor pattern.
(walk-tree ast (lambda (node) (frobnicate node context-object)))
walk-tree visits every node of ast, invoking (lambda (node) ...) on it,
which invokes the function (frobnicate node context-object).
This may be a generic function which is dispatched based on the
class of node and the class of context-object.
Visitor Pattern disappears into one line of code.
So why would you even think about what kinds of compromises to make
in your data structure design for the sake of some pattern?
> I find the visitor pattern non-intuitive for AST processing. A set of
> mutually recursive functions (one for each syntactic category, i.e., one
> for expressions, one for statements etc.) that each do case analysis
> over the node type and calls recursively on children nodes is much more
> intuitive.
For my own work with post-parse trees (AST or otherwise), I find that
a single visitor with registered node-type passengers proves to be the
most efficient. This allows for a simultaneous (for some value of
"simultaneous") top-down, bottom up, breadth-first/last pass, with
each passenger, and reduces overall processing time accordingly. It
does, however, require passengers to be registered in the right order
if visitation order dependencies are important and is especially
opaque if a passenger actually modifies the tree during its visit to a
node.
But overall, "it works here."
Quinn
> When implementing an AST for some language, each AST node typically
> holds information about the language construct it represents and
> pointers to children nodes (such as a binary op node pointing to its
> left-hand and right-hand operands).
>
> Is it common / useful to supply a pointer to the node's parent as
> well?
>
Without a parent pointer the AST would be fairly worthless and it would
force the tree walking algorithms used to be overly complex. By the way,
for the sake of your sanity transform the AST into a high level IR as soon
as possible.... and use visitor patterns sparingly because they get big and
clumsy for an AST.
-- Gary Oblock
--
--
Bronze Dreams
http://www.bronzedreams.com