concrete vs. abstract syntax trees

eliben

unread,

Feb 13, 2009, 11:40:32 AM2/13/09

to

Hello,

I'm trying to nail down the difference between concrete and abstract
syntax trees. In my C parser (http://code.google.com/p/pycparser/) I
construct an AST directly from the parser (which uses the yacc-like
PLY Python module), and I'm not sure where the concrete syntax tree is
in that process.

Is this correct to say that if I semi-mechanically translate the BNF
of a language to a yacc grammar, what I'll get from parsing is a
concrete syntax tree? If so, this tree is almost useless for any kind
of interesting interaction (especially for languages like C where some
things like types don't fit hierarchically in a way that resembles the
concrete syntax). Is this right?

Also, do compilers more commonly go through the concrete-syntax stage
or do they usually build the AST directly in the parser?

Thanks in advance,

Eli

Hans Aberg

unread,

Feb 14, 2009, 8:29:47 AM2/14/09

to

eliben wrote:
> I'm trying to nail down the difference between concrete and abstract
> syntax trees. In my C parser (http://code.google.com/p/pycparser/) I
> construct an AST directly from the parser (which uses the yacc-like
> PLY Python module), and I'm not sure where the concrete syntax tree is
> in that process.

This link says that a "concrete syntax tree" is the same thing as parse
tree: http://en.wikipedia.org/wiki/Parse_tree

The parse tree, a function of the grammar only, contains the details of
the grammar components, and is what the grammar actions construct
(abstractly, in the most general construction). By contrast, the AST
just retains (by choice of the implementer) the parts important for
translation, by limiting the construction in the explicitly implemented
actions. For example, the expression a + b*c might be represented by the AST
+
/ \
a *
/ \
b c
The parse tree, for a grammar E -> T | E + T, T -> i | T * i, is
E
/|\
E + T
| /|\
T T * i
| |
i i
where actions of the token i return different values a, b, c from the lexer.

Hans Aberg

Ira Baxter

unread,

Feb 14, 2009, 1:02:30 PM2/14/09

to

"eliben" <eli...@gmail.com> wrote in message

> I'm trying to nail down the difference between concrete and abstract
> syntax trees.

A concrete syntax tree is one that matches the productions that drive
the parsing process.

> Is this correct to say that if I semi-mechanically translate the BNF
> of a language to a yacc grammar, what I'll get from parsing is a
> concrete syntax tree?

No. YACC style tree construction typically does not construct one
node per production (although the opportunities to construct nodes are
presented once per production).

> If so, this tree is almost useless for any kind
> of interesting interaction (especially for languages like C where some
> things like types don't fit hierarchically in a way that resembles the
> concrete syntax). Is this right?

I don't see it that way. (You should manage type information
in the symbol table). The problem with CSTs is that they
technically contain nodes for tokens that control the syntax but
don't typically contribute to the semantics. In the grammar rule,
STMT = LHS ":=" EXP ";"
a concrete syntax tree for STMT would have 4 children, one
for each grammer token on the right hand side, notably
ones for ":=" and ";". These latter tokens help distinguish
the assignment statement from other statements, but don't
contribute to the semantic interpreation once a STMT has
been detected. So the concrete syntax tree tends to contain
many tree nodes that aren't interesting to visit after parsing.

OTOH, CSTs are easy and brainless to construct from the grammar. If
you constructed one, and simply didn't visit the dull nodes, you'd
have pretty much all the advantages of an AST. People don't do this
with YACC because people are lazy, and they'd have to write a node
constructor for every grammar token and they (rightfully) don't see
the point of doing this.

However, if you intend to do general purpose program transformation,
as we do with our DMS product, having representatives of the CST nodes
around allows you to retain position information to regenerate the
source tree with high degrees of accuracy, including white space gaps.
And, if you have automate tree matchers, they can automatically avoid
visiting such keyword nodes.

A second downside to CSTs is their size; they're simply bigger than
ASTs in which such nodes aren't present. If you are going to
manipulate only a single "source file", this probably won't matter
with 32 or 64 bit address spaces. If you're going to process
thousands of files, then size matters. If nothing else, if you have
fewer nodes, you have fewer memory touches, and we work on a scale
where the caches easily overflow with nodes, and so memory touches
materially affect performance.

> Also, do compilers more commonly go through the concrete-syntax
> stage or do they usually build the AST directly in the parser?

The YACC style tries to produce AST nodes directly while parsing. The
price is that the parser-writer has to invent the entire AST node
vocabulary, and write node-constructing code in all the relevant
grammer productions. For many modern languages with large grammars,
this is a modest but real amount of work. Worse, as the language
grammar evolves (they do!) somebody has to continually go back and
revise the AST constructing part. Granted, not very hard.

We've tried to get the best of both worlds. Our parsers *logically*
construct CSTs straight from the grammar, using a GLR parser. But, we
automatically compress out "non-value carrying" tokens (leaves such
":=" and ";"), and unary productions that have no semantic import
(which we detect by a) inspecting attribute evaluators over the
grammar; if an attrribute grammar rule simply copies values, you don't
need the CST node for that rule or b) an explicit "keep this"
annotation on the grammar rule). Finally, we introduce list nodes
whereever we detect list-forming productions in the grammar (various
generalizations
of L = e; and L = L "," e ; for list elements e and
non-value-carrying comma tokens).

The result looks amazingly look an AST you might construct by hand,
but at no effort to do so. A nice consequence is that all the
"syntax" hacking (tree matching, tree splicing) is all done in terms
of virtual CSTs, with the DMS machinery covering up all the
differences. There's even a low-level API called "GetNthGrammarChild"
that allows procedural code to walk around the tree as if the CST
existed.

In practice, we can turn the various tree compression accellerators
(eliminate terminals, eliminate unary, form lists) on and off. This
lets us debug the machinery when we occasionally break it.

--
Ira Baxter, CTO
www.semanticdesigns.com

Chris F Clark

unread,

Feb 14, 2009, 4:12:13 PM2/14/09

to

eliben <eli...@gmail.com> writes:

> Is this correct to say that if I semi-mechanically translate the BNF
> of a language to a yacc grammar, what I'll get from parsing is a
> concrete syntax tree?

Yes, a concrete syntax tree is intended to have a direct (1-1)
relationship with the grammar. The concrete part of the name is
describing that relationship.

> If so, this tree is almost useless for any kind
> of interesting interaction (especially for languages like C where some
> things like types don't fit hierarchically in a way that resembles the
> concrete syntax). Is this right?

Concrete systax trees tend to over-promote certain artifacts of the
grammar in the shape of the tree. If the grammar is left-recursive,
the trees tower one way. A right recrusive grammar (for the same
language) would have trees that tower in the opposite direction. An
ELR grammar, using regular expressions, might not have a tree at all
but rather a flat list.

> Also, do compilers more commonly go through the concrete-syntax stage
> or do they usually build the AST directly in the parser?

You need the concrete syntax in the parser itself. However, does it
ever exist as a memory (data structure) image other than on the stack
of the parser? I think generally not except when the user is lazy in
a particular way. It's generally actually easier to build an AST that
captures only the information one needs (e.g. parameters go in a list,
not a tree but expressions make trees) than it was to make a CST
shaped one and work on that. The CST format only occurs when one
can't conveniently (or is too lazy to think about the downstream uses)
build a correctly shaped representation. For example, it is easier
sometimes to build a list like the parameter list backwards--that's a
CST type of choice and may be acceptable in some places. However,
with more effort, one can build the parameter list in "correct" order,
if that is important to do so.

Hope this helps,
-Chris

******************************************************************************
Chris Clark Internet: christoph...@compiler-resources.com
Compiler Resources, Inc. or: com...@world.std.com
23 Bailey Rd Web Site: http://world.std.com/~compres
Berlin, MA 01503 voice: (508) 435-5016
USA fax: (978) 838-0263 (24 hours)
------------------------------------------------------------------------------

Chris F Clark

unread,

Feb 14, 2009, 7:58:45 PM2/14/09

to

Chris F Clark <c...@shell01.TheWorld.com> writes:

> Concrete systax trees tend to over-promote certain artifacts of the
> grammar in the shape of the tree. If the grammar is left-recursive,
> the trees tower one way. A right recrusive grammar (for the same
> language) would have trees that tower in the opposite direction. An
> ELR grammar, using regular expressions, might not have a tree at all
> but rather a flat list.

That said, parser generators often make it easy to generate CSTs.
With a parser generator, the simplest and safest course is to create a
CST. Optimizations to transform it more into an AST can easily be
applied. If one is using a generator and because of that is working
with a CST (that was easy to get from the tool), no shame should be
inferred. As Ira said, certain traversals over CSTs are not that
inefficient compared to ASTs, and if you've save some programmer time,
then it may balance out.

If you are starting from scratch, either find a tool that helps you
construct a tree (whether CST or AST), or write a layer over your tool
to do so. If you have to create a hand-written strcture becausze the
tree your tool creates won't make that shape, no shame in that either.
However, don't waste time creating a hand-written structure that
exactly matches the grammar. That kind of structure is too easy to
automate.

Still more opinions,