Code generation from AST

Lucas S. Silva

unread,

Nov 10, 2007, 5:09:57 PM11/10/07

to

Dear compiler enthusiasts,

I am writing a compiler for a OO language and I am looking for an
algorithms that I can use to generate code from an AST.

The algorithm I am using doesn't deal very well with recursive
expression such as:

a = f.g.h(a , x.y.o(1,2) )

I am using the visitor to process the AST, so when I get to node f it
could be a simple function call or a compound which have parameters
that may also be a function call, an so on.

Is there any good algorithm to deal with this type of situation? I am
planning to use stack but I am not sure if it is the most appropriate
method.

Thanks in advance,
Lucas S. Silva

Dmitry A. Kazakov

unread,

Nov 11, 2007, 5:02:11 AM11/11/07

to

On Sat, 10 Nov 2007 23:09:57 +0100, Lucas S. Silva wrote:

> The algorithm I am using doesn't deal very well with recursive
> expression such as:
>
> a = f.g.h(a , x.y.o(1,2) )
>
> I am using the visitor to process the AST, so when I get to node f it
> could be a simple function call or a compound which have parameters
> that may also be a function call, an so on.
>
> Is there any good algorithm to deal with this type of situation? I am
> planning to use stack but I am not sure if it is the most appropriate
> method.

I make "." and "()" operations. The above becomes (depending on
priorities), the following tree:

"="
{ a,
"()"
{ "." { "." { f, g }, h }, -- The first argument is f.g.h
a, -- The second argument of "()"
"()" -- The third argument of "()"
{ "." { "." { x, y }, o },
1,
2
} } }

Then "()" can be overloaded by "function call" and "array indexing", at
the semantics analysis stage you will decide what it is depending on the
first argument of. So if f.g.h renders to a function then "()" is a call.
When x.y.o is an array, then its "()" is indexing.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Hans-Peter Diettrich

unread,

Nov 10, 2007, 11:49:41 PM11/10/07

to

Lucas S. Silva wrote:

> The algorithm I am using doesn't deal very well with recursive
> expression such as:
>
> a = f.g.h(a , x.y.o(1,2) )

You can transform this statement (i.e. the AST) into Reverse Polish
Notation (RPN), like

1 2 x y . o . () a f g . h . () a =

Now the sequence of the required actions (machine code) should be quite
obvious.

DoDi

Paul B Mann

unread,

Nov 11, 2007, 12:23:48 AM11/11/07

to

Does anyone know of a lexer generator whose input is a BNF grammar
instead of regular expressions ?

Paul Mann

[DFA's aren't adequate to recognize BNF, which is why parser
generators use a DFA and a stack or other more powerful machines. I
suppose you could limit it to a BNF subset equivalent to regexps, but
what would be the point? -John]

Chris F Clark

unread,

Nov 11, 2007, 5:15:14 PM11/11/07

to

Perhaps, let me explain. Yacc++ allows one to write lexers in [E]BNF
grammars, i.e. yacc notation extended by regular expression operators.
In Yacc++, you can mix LR matching with regular expressions at either
the lexing or parsing level. However, if you are asking if it will
also collapse left-linear or right-linear grammars down to regular
expressions, it doesn't do that. If you use a recursive production,
Yacc++ uses a stack to resolve it. Now, it wouldn't be hard to
reconize the linear cases and transform them. However, I'm not aware
of any advantage to doing so, so yacc++ doesn't do it. In fact,
because of the difference in memory performance (one can use a fixed
sized memory (but one needs to "cons" the output list when desired
immediately) if one has no right recursion), I would be cautious and
make it an optional feature even if it were to be done.

Are you thinking of using BNF based lexers?

Hope this helps,
-Chris

*****************************************************************************
Chris Clark Internet : com...@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
------------------------------------------------------------------------------

George Neuner

unread,

Nov 12, 2007, 12:01:00 AM11/12/07

to

On Sat, 10 Nov 2007 23:23:48 -0600, "Paul B Mann" <p...@oct.net> wrote:

>Does anyone know of a lexer generator whose input is a BNF grammar
>instead of regular expressions ?

Antlr uses a combination of EBNF and regex to define its lexers. It
does _not_ produce a DFA, though, Antlr's lexers are recursive decent.

George
--
for email reply remove "/" from address

Paul B Mann

unread,

Nov 16, 2007, 5:39:58 PM11/16/07

to

>> Does anyone know of a lexer generator whose input is a BNF grammar
>> instead of regular expressions ?
>>

>> [DFA's aren't adequate to recognize BNF, which is why parser
>> generators use a DFA and a stack or other more powerful machines. I
>> suppose you could limit it to a BNF subset equivalent to regexps, but
>> what would be the point? -John]

> I'm not aware of any advantage to doing so, so yacc++ doesn't do it.
>

> Are you thinking of using BNF based lexers?

I know of one product, DFA 1.0, that used to come with LALR 4.3 from
Autom.

It accepted a BNF grammar and decomposed it into a DFA, if it were
possible to remove all nonterminal transitions. If not, it would make
a list of those nonterminals not removed.

If a BNF grammar contains nonterminals, then without some kind of
optimization to remove them from the finite-state machine, it cannot
be processed by a DFA at runtime.

But some grammars can be optimized and have all nonterminals removed.
This concept interests me, because it is a challenge in graph theory,
at least graph theory as it pertains to finite-state machines for
parsing and lexing.

If one can remove nonterminals and also the states they make a
transition to, then one can improve the performance and decrease the
size of the state machine.

My current LRGen 8.2 generates an LALR lexer with it's nonterminal
transitions included. It even removes chain-reductions for unit
reductions.

Removing the non-unit reductions, requires a change in the classical
parsing algorithm, so I haven't done that yet (not sure if it's
possible or useful, since almost every reduction would have to create
a node in the AST anyway in a real-world compiler front end).

I recently compared it's performance to a DFA lexer created by DFA 1.0
and found the DFA lexer to be over 5 times the speed of the LALR
lexer.

The LALR lexer was using up 43% of the runtime of the language
processor which included building an AST in memory.

The DFA lexer is using up only 12% of the runtime.

So I decided to rewrite the decomposition algorithm, since I lost it
many years ago. It is definitely an interesting exercise.

I wonder if one could apply this same decomposition to parser tables
as well and invent an improved algorithm for parsing with these
optimized or partially optimized parser tables? It would require a
different algorithm than the classical push-down stacking operation
currently known.

Since I originally did this in 1986, I thought by now someone else
would be doing it. That's why I asked the question.

Now there is a new troublesome issue that has raised it's ugly head.
The LR(0) state-building algorithm merges states that should not be
merged for easy construction of DFA's.

It is now a another challenge that I face, to undo the merging of
some states that are preventing the removal of certain nonterminal
transitions.

Any comments? Has anybody else delt with this before?

Paul B Mann

Robert A Duff

unread,

Nov 16, 2007, 8:28:41 PM11/16/07

to

"Paul B Mann" <p...@oct.net> writes:

The point (of using BNF notation for a lexer) I think would be to use a
uniform notation for both lexer and parser. Seems reasonable, since the
one is a subset of the other. Some parser tools do this..

- Bob

idba...@semdesigns.com

unread,

Nov 17, 2007, 1:11:34 PM11/17/07

to

On Nov 10, 11:23 pm, "Paul B Mann" <p...@oct.net> wrote:
> Does anyone know of alexergenerator whose input is aBNFgrammar

> instead of regular expressions ?
>
> Paul Mann
>

> [DFA's aren't adequate to recognizeBNF, which is why parser

> generators use a DFA and a stack or other more powerful machines. I

> suppose you could limit it to aBNFsubset equivalent to regexps, but

> what would be the point? -John]

Another perspective is, "why bother using DFAs?" The usual, and good
argument, is "speed". But with modern computers, there seems to be
less emphasis on the latter, and if you aren't processing enormous
amounts of source, then perhaps convenience is the driver.

The ASF+SDF project (also known as "XT" and shipped as part of the
Stratego program transformation tool) uses "scannerless GLR parsers",
a single unified GLR parsing algorithm to lex and parse. See
www.cs.uu.nl/research/techreps/repo/CS-2005/2005-052.pdf In
particular, if I understand it right, they simply let you write the
grammar using BNF rules all the way down to tokens and whitespace.
This is an extremely clean way to specify a langauge.

But the apparant advantage goes beyond that. In many real world
situations, often one embeds one language into another (SQL in COBOL,
Javascript and PHP in HTML, ...). If one has essentially separate GLR
grammars for each, composing this is close to trivial; you simply add
a rule to the containing langauge that introduces an escape indication
and allows a nonterminal from the contained langauge. My personal
opinion is that this is a truly pretty solution.

If the escapes are carefully designed, you really can't end up with
any ambiguities. Most escapes used in real langauges with embedded
sublanguages already have this property; otherwise they wouldn't be
useful as escapes.

This gets one back to the efficiency vs convenience question. We
build a tool (DMS Software Reengineering Toolkit) that also uses GLR
parsing. But we use conventional lexical definitions because our
interest is in large scale applications, where parsing speed matters.
So we give up some convenience in specification for speed. [Of
course, we can always simply use the GLR parser the character
level. And we'll probably go back someday and look at unifying the two
ideas.]

Ira Baxter, CTO
www.semanticdesigins.com