Is LALR(1) or LL(k) parser better for C++

Kari Ikonen

unread,

Jan 22, 1997, 3:00:00 AM1/22/97

to

So may ask from wiser people than byself
Is LALR(1) or LL(k) based parser better for parsing C++ code?
Which one of these is easier to handle?
--
kik...@cs.joensuu.fi /Kari Ikonen, Peltolankatu 11A2,80220 Jns. P:0400 917667
[I suspect the correct answer to this question is "no". -John]
--
Send compilers articles to comp...@iecc.com,
meta-mail to compiler...@iecc.com.

John Lilley

unread,

Jan 22, 1997, 3:00:00 AM1/22/97

to

Kari Ikonen wrote:
> Is LALR(1) or LL(k) based parser better for parsing C++ code?
> Which one of these is easier to handle?

Personally, I am using an LL(1) grammar with backtracking and
predicates. Both of these questions are complex and don't have any
pat answers. It comes down to several factors:

1) How important is performance? I believe that an LL(1) C++ parser
will be forced to backtrack (or perform some other hack such as
deferred ambiguity resolution) more often than LR(1). Backtracking is
only mildly expensive when done well, but a small performance hit may
be significant in a production compiler. More importantly,
backtracking across productions that require actions (such as managing
the scope context) can be tricky. That being said, most commercial
C++ compilers (so I hear) are hand-crafted recursive-descent parsers,
which is more closely related to LL than LR, so go figure...

2) What paradigm are you more comfortable with? I find LL to be much
easier to grasp and debug than LALR, but that's because LALR makes my
brain hurt. Someone smarter than I may find LALR easier to work with,
because fewer ambiguities will be encountered. But an LALR(1)
parser-generator with no backtracking may prove to be unworkable. LL
parsers have a single method to implement each rule, and repetition
will generate while{} loops instead of recursion, so the generated LL
parser will resemble the grammar rather closely. LALR parsers usually
generate a giant state table for the PDA machine, which is difficult
to debug without special tools.

3) How good are the respective tools? I use PCCTS and find it to be
fabulously good (albeit somewhat quirky). PCCTS generates C++ classes
for the lexers and parsers, and makes it easy to attach multiple
parsers to a lexer, hook up streams of tokens as input to a parser,
run more than one parser at the same time, etc. This is important in
C++, because you reparse tokens in at least three cases (unless you
are very clever):

1) The expressions of #if/#elif in the preprocessor
2) The bodies of inline methods declared inside class definitions.
3) Template expansion.

There are some LALR tools that generate parsers as C++ classes (e.g.,
Visual Parse++). There are also some LALR tools that backtrack (e.g.,
btyacc). I'm not sure if any do both.

Obviously, this is not a simple question. To get an idea for the
complexity involved, get Jim Roskind's analysis of an LALR(1) C++
grammar from the comp.compiler archives or
http://www.emapthy.com/pccts/roskind.html. You can get a sample LL(2)
C++ grammar that I've been using from http://www.empathy.com/pccts.
Niether of these grammars is complete. The LALR(1) grammar does not
deal with templates at all. My LL(2) grammar is weak in a number of
ways (although I'm working on an improvement).

I think that in the end, the LL vs LR decision may pale in comparison
to the overall size and complexity of the task. My parser is now 30k
lines of code (and that is low because uses a lot of STL containers).
I suspect that size will double before the parser is complete, and
will double again when enough semantic analysis is added to perform
complete error reporting.

> [I suspect the correct answer to this question is "no". -John]

:) Right -- neither straight LL(k) nor LR(k) work for any finite k.
C++ is such a horrible language to parse.

john lilley

David L Moore

unread,

Jan 25, 1997, 3:00:00 AM1/25/97

to

Kari Ikonen wrote:
> Is LALR(1) or LL(k) based parser better for parsing C++ code?
> Which one of these is easier to handle?

John Lilley wrote:
> 2) What paradigm are you more comfortable with? I find LL to be much
> easier to grasp and debug than LALR, but that's because LALR makes my
> brain hurt. Someone smarter than I may find LALR easier to work with,
> because fewer ambiguities will be encountered. But an LALR(1)
> parser-generator with no backtracking may prove to be unworkable.

I've been enhancing an LALR (YACC) grammar for a (relatively small)
subset of C++. Enhancing it was often a considerable pain. I inherited
the original parser so one might be able to do better by carefully
crafting a grammar oneself.

I have written parsers for other languages, including Pascal and
Modula-2 in the past using recursive descent. A Pascal Recursive
Descent parser is pretty much a no-brainer - you just generate your
first sets and then follow the rail-road tracks. You can probably
manage without actually generating the first sets.

The C++ subset was the first time I had used YACC for a serious
grammar.

Several issues stand out. First, it is much harder to debug LALR
grammars. You can get your average debugger to track through semantic
code embedded in the grammar, but you cannot (at least I cannot) get
it to print out those variables with names like $$. Also, single
stepping will insist on stepping through the parser code as well. Some
of this is a feature of YACC rather than LALR parsing in general.

Second, control of parsing in those situations where a symbol must be
(for example), a type-id in one context or an undefined-id in another
are hard to handle. Either you have weird productions like <var_decl>
= <type_id> <type_id>; or you have globals flags flying back and forth
between the lexer and parser, neither of which is a great solution.

Third, I see ambiguities on productions that I think should not be
ambiguous in a full LR grammar. I might be wrong about this since I
have not tried to run the grammar through a full LR parser nor have I
sat down and tried to grind out the states by hand but my suspicion is
that the power you lose with the LALR simplification is significant.

We shall probably replace this parser with a recursive descent parser
at some time in the fairly near future ("real soon now"). If we
proceed, it will be interesting to compare the ease of producing the
two parsers.

Mike Stump

unread,

Jan 26, 1997, 3:00:00 AM1/26/97

to

> I think that in the end, the LL vs LR decision may pale in
> comparison to the overall size and complexity of the task. My

> parser is now 30k lines of code (and that is low because it uses a

> lot of STL containers). I suspect that size will double before the
> parser is complete, and will double again when enough semantic
> analysis is added to perform complete error reporting.

Gee, ouch! Ours is by no means the perfect C++ parser (g++), but we
have quite a bit of the language in there, and it is only 4,135 lines,
with another 441 lines of random support code. 120k lines for a
parser seems, like a lot, and that is low?

Hum, can I ask what others do for parsing C++? How many lines and at
what level of completeness?

Scott Stanchfield

unread,

Jan 26, 1997, 3:00:00 AM1/26/97

to

> Several issues stand out. First, it is much harder to debug LALR
> grammars. You can get your average debugger to track through
> semantic code embedded in the grammar, but you cannot (at least I
> cannot) get it to print out those variables with names like
> $$. Also, single stepping will insist on stepping through the parser
> code as well. Some of this is a feature of YACC rather than LALR
> parsing in general.

Just a note here -- the "$$ variables" in yacc actually get turned into
references like yypvt[0]. Depends on the yacc you're using. In some, the
$1, $2, $3 are represented by (if you just reduced something like a: b c d)
yypvt[0] -- most recent ($3)
yypvt[-1] -- next back ($2)
yypvt[-2] -- next back from that ($1)

and so on. Basically, yypvt is a pointer into the stack where all the
fun happens. But beware -- this can change based on the yacc you're
using...

That aside, I still think table-driven parsers (whether they are LL,
LR or LALR) are just plain hard to debug. Recursive-descent is much
easier and it's really possible to see a correlation between the code
and the grammar.

> Second, control of parsing in those situations where a symbol must
> be (for example), a type-id in one context or an undefined-id in
> another are hard to handle. Either you have weird productions like
> <var_decl> = <type_id> <type_id>; or you have globals flags flying
> back and forth between the lexer and parser, neither of which is a
> great solution.

This is handled well by a parser-generator such as PCCTS, where you
can add semantic predicates to choose alternatives based on a
condition. Rather than set up a test in the lexer and return
different tokens, you return a single type of token in either case and
predicate the grammar to choose rules based on the token AND a
condition. Glad to see someone who is as dead against parser-->lexer
backtalk as I...

> Third, I see ambiguities on productions that I think should not be
> ambiguous in a full LR grammar. I might be wrong about this since I
> have not tried to run the grammar through a full LR parser nor have
> I sat down and tried to grind out the states by hand but my
> suspicion is that the power you lose with the LALR simplification is
> significant.

> We shall probably replace this parser with a recursive descent
> parser at some time in the fairly near future ("real soon now"). If
> we proceed, it will be interesting to compare the ease of producing
> the two parsers.

Have a look at PCCTS -- look at
http://java.magelang.com/antlr/entry.html

for some great starting info on it. PCCTS (ANTLR/DLG) generates
recursive-descent parsers that are really cool. I also have a tutorial in
progress
http://www.scruz.net/~thetick/pcctstut

As for C++ with PCCTS, see John Lilley's work (via the entry page above).
--
Scott Stanchfield
Santa Cruz, CA

David L Moore

unread,

Jan 29, 1997, 3:00:00 AM1/29/97

to

Scott Stanchfield wrote:

> Just a note here -- the "$$ variables" in yacc actually get turned into
> references like yypvt[0]. Depends on the yacc you're using. In some, the
> $1, $2, $3 are represented by (if you just reduced something like a: b c d)
> yypvt[0] -- most recent ($3)
> yypvt[-1] -- next back ($2)
> yypvt[-2] -- next back from that ($1)
>
> and so on. Basically, yypvt is a pointer into the stack where all the
> fun happens. But beware -- this can change based on the yacc you're
> using...

Indeed, but you have to work out the mapping and then TYPE IT IN
rather than just double clicking.

An alternative is to remove all the #line directives from the x.tab.c
file. Your debugger will then throw up x.tab.c rather than thing.y as
the source code. The only trouble with this is that now it is hard to
work out what production you are in. (If you leave the LINE directives
as comments you can bring up the .y file in a separate window and go
to the line to find the production)

Those $1, $2 etc are a real pain when you change a production - you
have to go through and renumber. It would be much easier if you could
add a label to each item in a production. In an MSP parser I wrote
many years ago (pre-yacc), you could write "Ident.A = TypeId.type
Ident.B" and then use A and B rather than $0 and $2. (A:Ident might
have been better) This might solve the debug problem too.

I wonder how well MSP would work on C++ :-) ?

> Have a look at PCCTS -- look at
> http://java.magelang.com/antlr/entry.html

Funny you should mention that.....