Why LL(1) Parsers do not support left recursion?

DevInstinct

unread,

Jul 16, 2006, 1:05:40 AM7/16/06

to

Hi, first of all, I'm not an expert in the theory of computation.

I've read about LL(1) parsers and I have seen that they do not support
left recursion, because it is said that it would lead to infinite
recursivity.

My question: why is that? In which case a LL(1) parser can enter
infinite recursivity?

I can't find a good example that demonstrates that.

Thanks!

Quinn Tyler Jackson

unread,

Jul 16, 2006, 10:44:06 AM7/16/06

to

expr ::= expr "+" expr;
expr ::= '[0-9]+';

OK, using a top-down approach, parse:

5+2

As soon as you enter expr, you enter expr, which brings you to expr, which
brings you to expr, after which you enter expr, ....

And it only gets deeper.

That can be removed by factoring*, which is an exercise I leave to the
reader.

--
Quinn Tyler Jackson

http://press.ChevalierEditions.com

* Factoring gets you around the issue in a known and legitimate way, but
produces parse trees with many intermediate nodes that must be traversed
before you get to the node type of the actual expression. Something like:

expr
power
mult
div
add
number
integer

Where what you really only need is:

expr
integer

I have found that trimming the junk intermediate nodes "sooner rather than
later" makes for easier traversal during tree interpretation. YMMV.

Tom Copeland

unread,

Jul 18, 2006, 1:17:24 AM7/18/06

to

On Sun, 2006-07-16 at 10:44 -0400, Quinn Tyler Jackson wrote:
> * Factoring gets you around the issue in a known and legitimate way, but
> produces parse trees with many intermediate nodes that must be traversed
> before you get to the node type of the actual expression. Something like:
>
> expr
> power
> mult
> div
> add
> number
> integer
>
> Where what you really only need is:
>
> expr
> integer
>
> I have found that trimming the junk intermediate nodes "sooner rather than
> later" makes for easier traversal during tree interpretation. YMMV.

JJTree (comes with JavaCC) uses a notation for doing this called
conditional node descriptors, e.g. the "#AdditiveExpression(>1)" below:

void AdditiveExpression() #AdditiveExpression(>1):
{}
{
MultiplicativeExpression() (( "+" | "-" ) MultiplicativeExpression())*
}

ensures that AdditiveExpression nodes will only be created if they have
more than one child node.

I wrote a fair bit of ugly AST trimming code - which I shamefacedly
deleted after re-reading the documentation and discovering this
feature :-)

Yours,

Tom

Hans-Peter Diettrich

unread,

Jul 18, 2006, 1:17:09 AM7/18/06

to

Quinn Tyler Jackson schrieb:

>> My question: why is that? In which case a LL(1) parser can enter
>> infinite recursivity?
>>
>> I can't find a good example that demonstrates that.
>
> expr ::= expr "+" expr;
> expr ::= '[0-9]+';
>
> OK, using a top-down approach, parse:
>
> 5+2
>
> As soon as you enter expr, you enter expr, which brings you to expr, which
> brings you to expr, after which you enter expr, ....

This example applies to both LL and LR parsers. It should be clear that
a parser generator should handle such cases appropriately.

In the case of an LL parser, a recursive call can occur only after a
token has been consumed. An LR parser also should not perform an epsilon
move, if something else is possible as well.

DoDi

SM Ryan

unread,

Jul 18, 2006, 1:18:40 AM7/18/06

to

"DevInstinct" <martin_...@videotron.ca> wrote:
# Hi, first of all, I'm not an expert in the theory of computation.
#
# I've read about LL(1) parsers and I have seen that they do not support
# left recursion, because it is said that it would lead to infinite
# recursivity.

LL(k) grammars make parsing decisions based on the first k symbols.

For
S ::= Sb | a
FIRST(Sb,1) = FIRST(S,1) + FIRST(a,1) = {a}
FIRST(a,1) = {a}

So both alternatives have the same FIRST(1) sets, and you can't decide
which alternative to use. It's not an LL(1) grammar.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
I'm not even supposed to be here today.

Rahul Chaudhry

unread,

Jul 18, 2006, 2:06:05 PM7/18/06

to

I think the real problem here can be appreciated with a slightly more
complex example:

S ::= Ab | Ac
A ::= Aa | <epsilon>

Here, the grammar cannot be parsed by a LL(k) parser for any finite k,
as I can always construct a sentence in the language that needs more
than k lookahead to make that decision on the first production from
'S'.

A sentence beginning with k a's needs to look at at least (k+1) symbols
to make the decision on which rule to use to expand 'S'. And expanding
'S' to one of the alternatives is the first step a top down parser has
to do.

The grammar above is just the regular expression a*(b|c). It can easily
be re-written to make it friendly to LL parsing.

An LR (or LALR) parser does bottom up parsing. The reduction to 'S'
from one of the alternatives is that last step for it, and hence it can
handle such a grammar just fine. (Try it with yacc).

Hope this helps,
Rahul

Chris F Clark

unread,

Jul 19, 2006, 2:33:56 PM7/19/06

to

"DevInstinct" <martin_...@videotron.ca> asked:

> My question: why is that? In which case a LL(1) parser can enter
> infinite recursivity?
>
> I can't find a good example that demonstrates that.

Quinn Tyler Jackson answered:

> expr ::= expr "+" expr;
> expr ::= '[0-9]+';
>
> OK, using a top-down approach, parse:
>
> 5+2
>
> As soon as you enter expr, you enter expr, which brings you to expr, which
> brings you to expr, after which you enter expr, ....

Hans-Peter Diettrich <DrDiet...@aol.com> challenged:

> This example applies to both LL and LR parsers. It should be clear that
> a parser generator should handle such cases appropriately.
>
> In the case of an LL parser, a recursive call can occur only after a
> token has been consumed.

An LL parser generator cannot handle the case correctly, as the FIRST
of both alternatives is 0-9. Thus, while it would be nice if an LL
parser would wait to consume a token before making a recursive call,
the LL algorithm requires one to make the recursive call before
consuming the token, because making the call is necessary to allow the
called procedure to consume the token. It also isn't a solution
simply not to make the recursive call, for if you don't make the
recursive call, you simply have eliminated the recursive rule from the
grammar. If you simply wait to make the recursive call until after
the token has been consumed, you have changed the grammar in another
way.

The point is that in an LL parser, you need to determine by looking at
the lookahead (FIRST sets in this case) how many calls must be made,
because you must make exactly the right series of calls before
consuming the token.

However, Dodi's assertion that LR parsers should (will) properly
handle this case is correct.

Hope this helps,
-Chris

*****************************************************************************
Chris Clark Internet : com...@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
------------------------------------------------------------------------------

Hans-Peter Diettrich

unread,

Jul 21, 2006, 2:06:51 PM7/21/06

to

Chris F Clark schrieb:

> An LL parser generator cannot handle the case correctly, as the FIRST
> of both alternatives is 0-9.

What's "correct" handling of such a case? LR parser generators also use
heuristics in ambiguous situations.

> Thus, while it would be nice if an LL parser would wait to consume a
> token before making a recursive call, the LL algorithm requires one
> to make the recursive call before consuming the token, because
> making the call is necessary to allow the called procedure to
> consume the token.

But the procedure already *has* been called, so it IMO is allowed to
consume the token.

> It also isn't a solution simply not to make the recursive call, for
> if you don't make the recursive call, you simply have eliminated the
> recursive rule from the grammar. If you simply wait to make the
> recursive call until after the token has been consumed, you have
> changed the grammar in another way.

I agree that parser generators tend to change the grammar, in order to
resolve conflicts. The sample grammar contains no indication about the
grouping of the operation, but what does that *mean*? Does it mean
that the grammar is useless, invalid, or simply ambiguous? Provided
that the grammar is considered to be ambiguous, a parser generator
either can reject it, or do whatever it likes.

Let me give a more simplified grammar:

expr ::= expr;
expr ::= '[0-9]+';

Here an LL parser might enter infinite recursion *before* consuming a
token, whereas an LR parser might recurse *after* consuming a token.

>
> The point is that in an LL parser, you need to determine by looking at
> the lookahead (FIRST sets in this case) how many calls must be made,
> because you must make exactly the right series of calls before
> consuming the token.
>
> However, Dodi's assertion that LR parsers should (will) properly
> handle this case is correct.

It's not a matter of the parser, but instead of the parser generator. I
see no reason why a parser generator should be able to create an LR
parser for the original grammer, but not an LL parser. And, for
completeness, a parser generator IMO should reject the grammar anyhow.

DoDi

Andru Luvisi

unread,

Jul 21, 2006, 4:01:39 PM7/21/06

to

>>>>> "DevInstinct" == DevInstinct <martin_...@videotron.ca> writes:

DevInstinct> My question: why is that? In which case a LL(1)
DevInstinct> parser can enter infinite recursivity?

DevInstinct> I can't find a good example that demonstrates that.

Consider the grammar

term ::= '[0-9]+';
expr ::= expr '+' term |
term;

and the expression

1 + 2 + 3 + 3 + 5

Ideally, this should be parsed into an abstract syntax tree that looks
something like:

add(add(add(add(1, 2), 3), 4), 5)

Here's the problem. When you're handling the 1, you don't know
whether to recurse into the "expr '+' term" or terminate on the
"term". You actually should recurse four times and then terminate,
but you can't know that without looking way ahead to the end of the
expression, which can be arbitrarily far away, and you're only allowed
one token of lookahead in LL(1).

Andru
--
Andru Luvisi

Quote Of The Moment:
"If you give someone Fortran, he has Fortran.
If you give someone Lisp, he has any language he pleases."
-- Guy Steele

Chris F Clark

unread,

Jul 21, 2006, 5:34:02 PM7/21/06

to

Preamble: excuse this note for being excessively harsh. Dodi normally
posts correct information, and as such deserves to be held to a high
standard, cognizant of the respect he has earned. However, he pushed
a hot button of mine by suggesting that left-recursion was some
hueristic thing. At the same time I made a mistake when originally
writing this assuming that Quinn's grammar was unambiguous, which it
is not. I read Quinn's grammar and assumed it was Andru's. It is a
mistake I commonly make because there are hueristics that change
Quinn's grammar into Andru's, and these hueristics are commonly
applied by LR parser generators and camoflague the fact that Quinn's
grammar is actually ambiguous.

Quinn wrote (q>):

q> expr ::= expr "+" expr;
q> expr ::= '[0-9]+';

Andru corrected (a>);

a> term ::= '[0-9]+';
a> expr ::= expr '+' term |
a> term;

I wrote (> >):

> > An LL parser generator cannot handle the case correctly, as the FIRST
> > of both alternatives is 0-9.

Dodi replied:

> What's "correct" handling of such a case? LR parser generators also use
> heuristics in ambiguous situations.

Now, I mistook Quinn's ambiguous grammar for Andru's correct grammar.
Thus, for Quinn's gramar you are correct. For Andru's grammar, the
situation is different.

Left recursion is not an inherently ambiguous situation. It is
perfectly valid in LR parsing to use left recursion and it does not
require heuristics to do so. The correct handling of non-ambiguous
left recursive cases is perfectly well defined. Ambiguous grammars do
require heuristics and considering them invalid is perfectly rational.

However, even for unambiguous left-recursive grammars the LL algorithm
fails to construct a parser, while the LR algorithm will properly
construct a parser.

> > Thus, while it would be nice if an LL parser would wait to consume a
> > token before making a recursive call, the LL algorithm requires one
> > to make the recursive call before consuming the token, because
> > making the call is necessary to allow the called procedure to
> > consume the token.

> But the procedure already *has* been called, so it IMO is allowed to
> consume the token.

Yes, but now it must make a recursive call. It must make the
recursive call because, if it were the name of a different
non-terminal it would have to call that non-terminal (and that might
lead indirectly back to a recursive call on this non-terminal).

> > It also isn't a solution simply not to make the recursive call, for
> > if you don't make the recursive call, you simply have eliminated the
> > recursive rule from the grammar. If you simply wait to make the
> > recursive call until after the token has been consumed, you have
> > changed the grammar in another way.

> I agree that parser generators tend to change the grammar, in order to
> resolve conflicts. The sample grammar contains no indication about the
> grouping of the operation, but what does that *mean*? Does it mean
> that the grammar is useless, invalid, or simply ambiguous? Provided
> that the grammar is considered to be ambiguous, a parser generator
> either can reject it, or do whatever it likes.

The problem here is there is no conflict in Andru's LR case, because
the LR algorithm correctly handles left-recursion. In Quinn's case,
you are right, there is no indication of the grouping of the operation
and the grammar is ambiguous. (So, if you are chastising me for
seeing Andru's grammar while reading Quinn's, I stand perfectly
chastised.)

By the way, the definition of correct parse is also not technology
specific. Every grammar is either ambiguous or has exactly one
correct parse. Now, some parsing technologies cannot find the correct
parse for some grammars--that includes some gramars which there are no
LL or LR parsers for as well as ones which have LR parsers and not LL
ones and grammars with LL(k) ones and not LR(1) ones, etc. However,
that doesn't change the definition of correct parse.--it's a rigorous,
precise, mathematical concept.

> > The point is that in an LL parser, you need to determine by looking at
> > the lookahead (FIRST sets in this case) how many calls must be made,
> > because you must make exactly the right series of calls before
> > consuming the token.
> >
> > However, Dodi's assertion that LR parsers should (will) properly
> > handle this case is correct.
>
> It's not a matter of the parser, but instead of the parser generator.

Ok, let me make that more clear. No parser generator executing the LL
algorithm for constructing a parser, can construct a parser from a
grammar with left-recursion. A parser generator running any LR
(e.g. LR(0), SLR, LALR, canonical LR, minimal state LR, etc.)
algorithm can construct parsers for left-recursive grammars, presuming
that grammar has no conflicts (and Andru's grammar is conflict-free).

There is a fundamental difference in the LL and LR parser generator
algorithms which allow LR parsers to be constructed for left-recursive
cases, while LL parsers are fundamentally limited from handling such
cases.

And that's the nit which is a hot button for me. It's fundamentally
easier to write an LL parser generator because the algorithm is
simpler, but handles fewer cases, not that writing an LR parser
generator is fundamentally that hard. Moreover, recursive descent
parsers are popular because their output is so "readable". Being able
to use left-recursion (and the oft maligned precedence and
associativity declarations) can make the grammar itself simpler and
more readable. That makes them some of the reasons to prefer LR
parsing. Therefore, I get overly defensive when those advantages are
misunderstood or belittled.

Quinn Tyler Jackson

unread,

Jul 21, 2006, 9:17:51 PM7/21/06

to

Chris noted:

> Quinn's grammar into Andru's, and these hueristics are commonly
> applied by LR parser generators and camoflague the fact that Quinn's
> grammar is actually ambiguous.

Whoops. Yes. I knew it was also ambiguous, but was only looking at the left
recursive aspect of it. That's what I get for running off half-loaded with
my example. A better example might have been:

list ::= list COMMA item | item;
item ::= NUMBER | STRING_LITERAL;

Or some such. This could be expressed right recursively by changing list to:

list ::= item COMMA list | item;

The left recursive form might be the preferred form for a LR parser, since
it could, in theory, parse a very long list without producing a horrendously
deep stack in the process, whereas the right recursive form will tend to
produce a deep stack. The left recursive form precludes parsing by LL
methods, the right form is acceptable to LL (and still results in a very
deep stack with very long lists). A typical work around in the LL world is
extended BNF like this:

list ::= (item COMMA)+ | item

Sometimes written

list ::= {item COMMA} | item

This produces an iterative parser that doesn't build a deep stack, and is
also not recursive. (There's nothing to stop an LR parser generator author
from using those extensions to do the same thing, but since the left
recursive form does the same thing, there's no perceived need for such
extensions, perhaps.)

Now, in a similar tone to "heuristics are commonly applied ... to camouflage
the fact...", Dodi said (in another message in this thread):

> It's not a matter of the parser, but instead of the parser generator. I
> see no reason why a parser generator should be able to create an LR
> parser for the original grammer, but not an LL parser. And, for
> completeness, a parser generator IMO should reject the grammar anyhow.

It's easy to catch (either direct or indirect) left recursion in an LL
generator and then generate a build error. This, IMO, is the preferred
method.

Although in some cases, it's also possible to generate a left-factored
version of the rule and quietly generate an equivalent grammar that doesn't
contain the left-recursion, I've always been against invisible fixes of what
are fundamental grammar construction issues. (Such modifications can produce
inexplicably deep parse trees that contain nodes that were manufactured by
the generator to bypass the issue, for example, and I prefer nodes to come
from explicitly stated rules rather than something the generator dreamed
up.)

Adaptive grammars that use an LL algorithm to effect a parse come up with
another problem with left recursion. It may not be knowable until run time
(in the middle of a parse) whether or not a grammar is left-recursive:

x ::= (y|z) COLON list
y ::= /* something that sets foo to non-lambda */
z ::= /* something that sets foo to lambda */
list ::= foo list COMMA item | item;

x follows the z path, list is left-recursive at run time. Although it is
technically possible to strictly detect this at grammar-compile time "in the
simple case", one cannot do it "in the general case" because of Turing
Power, and foolproof predetermination that foo cannot/can be empty implies a
solution to the Halting problem. The adaptive(k) algorithm allows list but
fails at run-time if foo is lambda. (It does this by using a sanity check on
stack depth when list is entered; if foo is empty, that sanity check will
fail when the stack quickly hits the upper limit).

(Please note that I wrote this email in haste -- all blunders my own.)

Hans-Peter Diettrich

unread,

Jul 22, 2006, 6:23:09 PM7/22/06

to

Chris F Clark schrieb:

> Preamble: excuse this note for being excessively harsh. Dodi normally
> posts correct information, and as such deserves to be held to a high
> standard, cognizant of the respect he has earned. However, he pushed
> a hot button of mine by suggesting that left-recursion was some

> hueristic thing...

Let me summarize my points (John, you may drop my long answer ;-)

1. LL parsers cannot handle left recursion, whereas LR parsers cannot
handle right recursion.

2. Most languages are ambiguous at the syntax level, so that implicit
disambiguation (longest match...) or explicit semantical constraints
must be introduced. (see: dangling else...).

3. Only LL(1) recursive descent parsers are readable, that's why no
LL(k) parser generators exist, in contrast to LR(k) parser generators.

4. When at least one terminal must be consumed, before a recursive
invocation of a rule, no infinite recursion can occur. (every loop will
terminate after all terminals in the input stream have been consumed)

Any objections, so far?

Ad (1+2): We should keep grammars apart from languages. Most languages
require recursive grammars, but allow for both left or right recursive
grammars.
Languages with "expr '+' expr" or "list ',' list" can be parsed in
multiple ways. Unless there exist additional (semantical) restrictions,
correct and equivalent left or right recursive grammars can be
constructed for such languages.
And when a human is allowed to disambiguate a grammar for such a
language himself, a parser generator should be allowed to do the same ;-)

Ad (3): This is the only reason for the preference of LR parsers. Why
spend time with tweaking a grammar to LR(1)/LL(1), when a parser
generator for LR(k) is available?

DoDi

Max Hailperin

unread,

Jul 22, 2006, 6:23:43 PM7/22/06

to

Chris F Clark <c...@shell01.TheWorld.com> writes:
....

> > It's not a matter of the parser, but instead of the parser generator.
>
> Ok, let me make that more clear. No parser generator executing the LL
> algorithm for constructing a parser, can construct a parser from a
> grammar with left-recursion. A parser generator running any LR
> (e.g. LR(0), SLR, LALR, canonical LR, minimal state LR, etc.)
> algorithm can construct parsers for left-recursive grammars, presuming
> that grammar has no conflicts (and Andru's grammar is conflict-free).

....

You seem to be conceding too much here. The essential difference is
in the parsers, not just the parser generators. Recall (as I
explained in http://compilers.iecc.com/comparch/article/06-04-136 )
that an LR parser is a shift/reduce parser whereas an LL parser is a
confirm/expand parser. Each also has a control that decides at each
step which of the two possible actions to take (whether to shift or
reduce in an LR parser, whether to confirm or expand in an LL parser).
The parser generator only influences that control function, not what
the two possible actions are. Yet the issue with left recursion stems
directly from the available actions, not from the control. For a
left-recursive grammar, no control -- no matter how it works or how it
was generated -- can possibly decide between confirming and expanding
given only the previous input and a bounded amount of lookahead into
the future input. Deciding between shifting and reducing, on the
other hand is possible.

-Max Hailperin
Professor of Computer Science
Chair, Mathematics and Computer Science Department
Gustavus Adolphus College
http://www.gustavus.edu/+max/

Etienne Gagnon

unread,

Jul 22, 2006, 9:30:23 PM7/22/06

to

Hans-Peter Diettrich wrote:
> Let me summarize my points (John, you may drop my long answer ;-)
>
> 1. LL parsers cannot handle left recursion, whereas LR parsers cannot
> handle right recursion.

I'm not sure I understand clearly... What do you mean by "LR parsers
cannot handle right recursion"?

There is no LALR(1) conflict in the following "right recursive" grammar
(SableCC-like syntax):

rht_rec_rule =
{one} some token |
{many} some token rht_rec_rule;

As far as I remember, LR parsers can deal with both left and right
recusrsive grammars. Maybe you meant something else?

Etienne

--
Etienne M. Gagnon, Ph.D. http://www.info2.uqam.ca/~egagnon/
SableVM: http://www.sablevm.org/
SableCC: http://www.sablecc.org/
Tokens
some = ; // dummy
tokens = ; // dummy

Productions

rht_rec_rule =
{one} some tokens |
{many} some tokens rht_rec_rule;

SM Ryan

unread,

Jul 23, 2006, 4:18:29 PM7/23/06

to

# 1. LL parsers cannot handle left recursion, whereas LR parsers cannot
# handle right recursion.

A left recursive grammar is not an LL(k) grammar, but the grammar can
be mechanically transformed to rid it of left recursion. The resulting
grammar might still not be LL(k).

LR(k) can handle any deterministic grammar. Left recursive productions
only need one production on the stack; right recursion piles up the
entire recursive nest on the stack and then reduces all of them at
once: right recursion requires a deeper stack.

# 2. Most languages are ambiguous at the syntax level, so that implicit
# disambiguation (longest match...) or explicit semantical constraints
# must be introduced. (see: dangling else...).

Only poorly designed programming languages are ambiguous. (Natural
languages are ambiguous and don't use deterministic language theory.)

Many programming language grammars are ambiguous because the people
writing the grammars are lazy and/or uneducated in these matters. The
dangling else nonproblem was solved some 40 years ago. Anyone who
thinks this is still a problem is just too lazy to write a well known
solution.

# 3. Only LL(1) recursive descent parsers are readable, that's why no
# LL(k) parser generators exist, in contrast to LR(k) parser generators.

Recursive descent is not LL(k). Recursive descent is an implementation
technique not a language class. There are recursive ascent parsers for
LR(k) grammars. LR(k) parsers can be written as recursive finite state
transducers, with right recursion and embedding requiring recursion
and left recursion merely looping; if a language uses left recursion
only (type iii), the LR(k) state diagram is easily convertible to a
finite transducer for the language.

# 4. When at least one terminal must be consumed, before a recursive
# invocation of a rule, no infinite recursion can occur. (every loop will
# terminate after all terminals in the input stream have been consumed)

Confusing implementation with language class. Any grammar that
includes a rule such as A -> A | empty is ambiguous therefore
nondeterministic therefore not LR(k) therefore not LL(k). There are
many other things that keep a grammar or language from being LL(k);
LL(k) does not include all deterministic languages.

# Ad (1+2): We should keep grammars apart from languages. Most languages
# require recursive grammars, but allow for both left or right recursive
# grammars.

A language definition that does not depend on vague handwaving (one or
two such definitions actually exist) bases the semantics on the parse
tree. Since right and left recursion build different parse trees, this
issue is very important in definitions with riguous semantics.

# Languages with "expr '+' expr" or "list ',' list" can be parsed in
# multiple ways. Unless there exist additional (semantical) restrictions,
# correct and equivalent left or right recursive grammars can be
# constructed for such languages.

Or just write an unambiguous production. It's not any harder to do it
right than to do it lazy and wrong.

If the semantics of a subtract production are the value of the right
subtree is subtracted from the value of left subtree, then
3 - 2 - 1
with left recursion is
= (3 - 2) - 1 = 1 - 1 = 0
with right recursion is
= 3 - (2 - 1) = 3 - 1 = 2

Rather different answers but unless your software is controlling a
satellite to Venus, I guess sloppiness can be repaired in the next
patch release.

Algol-60 used left recursion except exponentiation so that the value
could be determined from the parse tree without a lot misreadable
prose.

# And when a human is allowed to disambiguate a grammar for such a
# language himself, a parser generator should be allowed to do the same ;-)

hy bother inserting ambiguity and then remove it again with obscure
rules? Eschew ambiguity from the onset.

--
SM Ryan http://www.rawbw.com/~wyrmwif/

You face forward, or you face the possibility of shock and damage.

Max Hailperin

unread,

Jul 23, 2006, 4:20:19 PM7/23/06

to

Hans-Peter Diettrich <DrDiet...@aol.com> writes:
....

> 1. LL parsers cannot handle left recursion, whereas LR parsers cannot
> handle right recursion.
>
> 2. Most languages are ambiguous at the syntax level, so that implicit
> disambiguation (longest match...) or explicit semantical constraints
> must be introduced. (see: dangling else...).
>
> 3. Only LL(1) recursive descent parsers are readable, that's why no
> LL(k) parser generators exist, in contrast to LR(k) parser generators.
>
> 4. When at least one terminal must be consumed, before a recursive
> invocation of a rule, no infinite recursion can occur. (every loop will
> terminate after all terminals in the input stream have been consumed)
>
> Any objections, so far?

Yes.

Point 1 is incorrect; LR parsers can handle right recursion in
addition to left recursion.

Point 2 is incorrectly stated, at the least. You seem to have meant
not that the languages are ambiguous, but rather than particular
grammars that are commonly used for those languages are ambiguous.
Taking the example of the dangling else problem, there are unambiguous
grammars for this as well as ambiguous ones. The published reference
grammar for Java, for example, is unambiguous, rather than relying
upon an extragrammatical disambiguation rule. (Note that such a
disambiguation rule is not semantic, contrary to what you say, just a
portion of the syntax specification that may, in some cases, be
presented extragrammatically.) Therefore, it seems that your point 2
boils down to a claim that "most" of the time the specifiers of
programming language syntaxes find it more convenient to use an
ambiguous grammar coupled with extragrammatical disambiguation rules,
rather thon to use an unambiguous grammar. This is an empirical
question about the frequency with which different notational
approaches are used, which I would be hesitant to try to answer just
based on what I personally have seen. It certainly is not an
implausible claim, however.

Point 3 can be read two ways; both are incorrect. You may be claiming
that recursive descent parsers are only are readable when based on
LL(1) grammars, or you may be making the stronger claim that no parser
other than an LL(1)-recursive-descent parser is readable, including no
non-recursive-descent parser. By refuting the weaker claim, I can
show that both are false. There are readable recursive descent
parsers that are not based on LL(1) grammars, both ones that are based
on LL(k) grammars for k>1, and ones that are not based on any LL(k)
grammar, but rather require general unbounded backtracking. Taking
the simpler case of k>1, consider, for example, the following grammar,
which is LL(2) but not LL(1):

S -> a b S | a c

The recursive descent parser for this can be written quite readably.
As some evidence of that, I append to the end of this posting a Java
program that embodies such a parser. (It isn't the world's greatest
Java style, but I think it still makes the point.)

> ...

> Ad (3): This is the only reason for the preference of LR parsers. Why
> spend time with tweaking a grammar to LR(1)/LL(1), when a parser
> generator for LR(k) is available?

No, the preference for LR parsers over LL has little to do with point
3, not only because point 3 is incorrect, but also because you are
here addressing the k>1 case, whereas most practical parsers use k=1
in any case. The issue is that LR(1) parsers are more powerful than
LL(1); the grammars suitable for LR(1) parsing are a proper superset
of those suitable for LL(1) parsing.

-Max Hailperin
Professor of Computer Science
Chair, Mathematics and Computer Science Department
Gustavus Adolphus College
http://www.gustavus.edu/+max/

====== a Java LL(2) recursive descent parser follows ==========

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

public class LL2 {
// a simple LL(2) recursive descent parser
// using individual bytes of input as tokens (no lexical analysis)

private static boolean S(BufferedReader input) throws IOException {
input.mark(2); // prepare to put back up to 2 characters

// try production 1
if(input.read() == 'a' && input.read() == 'b')
return S(input);
else
input.reset();

// try production 2
if(input.read() == 'a' && input.read() == 'c')
return true;
else
input.reset();

// no more productions; failure
return false;
}

public static void main(String[] args){
try{
BufferedReader in = new BufferedReader(new FileReader(args[0]));
if(S(in) && in.read() == -1)
System.out.println("Input succesfully parsed.");
else
System.out.println("Input was not in the language.");
} catch(Exception e) {
System.err.println(e);
}
}
}

Carl Barron

unread,

Jul 24, 2006, 1:23:10 PM7/24/06

to

SM Ryan <wyr...@tsoft.org> wrote:

> # 3. Only LL(1) recursive descent parsers are readable, that's why no
> # LL(k) parser generators exist, in contrast to LR(k) parser generators.

What about ANTLR, or spirit in boost.?? Both generate parsers
ANTLR in the classical sense that it generates code to be compiled
like yacc/bison does, and spirit is an embedded template based parser
generator that lets the C++ compiler itself generate the grammar from
the EBNF. There are other LL(k) for fixed k parser generators as
well, but these are the ones I have some familiarity with.

Hans-Peter Diettrich

unread,

Jul 25, 2006, 12:40:16 AM7/25/06

to

SM Ryan schrieb:

> # 1. LL parsers cannot handle left recursion, whereas LR parsers cannot
> # handle right recursion.
>
> A left recursive grammar is not an LL(k) grammar, but the grammar can
> be mechanically transformed to rid it of left recursion. The resulting
> grammar might still not be LL(k).
>
> LR(k) can handle any deterministic grammar. Left recursive productions
> only need one production on the stack; right recursion piles up the
> entire recursive nest on the stack and then reduces all of them at
> once: right recursion requires a deeper stack.

Thanks for your explanations, but I'm still not fully convinced ;-)

So far I only used CoCo/R to create recursive descent parsers, so that
I had to handle all non-LL(1) cases manually. Perhaps hereby I have
applied some modifications to the grammar, when I made work e.g. my C
parser without problems with left recursive rules.

That's why I thought that LR parsers have similar problems with right
recursion (epsilon moves?), where a parser generator also would have
to apply some built-in rules, in order to resolve these problems. But
this only is a feeling, I'm not very familiar with LR parsers, because
I couldn't yet find a working parser generator with Pascal output.

At least I understand now, that right recursion should be *avoided* in
LR grammars, in order to keep the stack size low.

> # 2. Most languages are ambiguous at the syntax level, so that implicit
> # disambiguation (longest match...) or explicit semantical constraints
> # must be introduced. (see: dangling else...).
>
> Only poorly designed programming languages are ambiguous. (Natural
> languages are ambiguous and don't use deterministic language theory.)

Counterexample:
The argument list of a subroutine is nothing but a list, which can be
expressed using either left or right recursion in a grammar.

Perhaps I misused the term "ambiguous" here, when I meant that different
parse trees can be constructed for a sentence of a language?

> Many programming language grammars are ambiguous because the people
> writing the grammars are lazy and/or uneducated in these matters. The
> dangling else nonproblem was solved some 40 years ago. Anyone who
> thinks this is still a problem is just too lazy to write a well known
> solution.

The dangling else problem can be solved by adding implicit general rules
to the interpretation of a language (or grammar?). Of course there exist
ways to prevent the occurence of such problems, just in the language
specification. But AFAIR it's impossible to prove, in the general case,
that a language is inambigous.

>
> # 3. Only LL(1) recursive descent parsers are readable, that's why no
> # LL(k) parser generators exist, in contrast to LR(k) parser generators.
>
> Recursive descent is not LL(k). Recursive descent is an implementation
> technique not a language class.

Okay, but what's the relationship between leftmost/rightmost derivation
and a language?

> There are recursive ascent parsers for
> LR(k) grammars. LR(k) parsers can be written as recursive finite state
> transducers, with right recursion and embedding requiring recursion
> and left recursion merely looping; if a language uses left recursion
> only (type iii), the LR(k) state diagram is easily convertible to a
> finite transducer for the language.

I'm not sure what you want to tell me. AFAIR LR(k) (languages?
grammars?) can be transformed into LR(1), for which a finite state
machine can be constructed easily. I assume that a transformation from
LL(k) into LL(1) would be possible as well, using essentially the same
transformation algorithms.

My point is that table driven parsers are unreadable, due to the lost
relationship between the parser code and the implemented language or
grammar. Do there exist LR parsers or parser generators at all, which
preserve the relationship between the parser code and the grammar?

>
> # 4. When at least one terminal must be consumed, before a recursive
> # invocation of a rule, no infinite recursion can occur. (every loop will
> # terminate after all terminals in the input stream have been consumed)
>
> Confusing implementation with language class. Any grammar that
> includes a rule such as A -> A | empty is ambiguous therefore
> nondeterministic therefore not LR(k) therefore not LL(k).

I wanted to present an proof, whether a given grammar is (non-?)
deterministic, regardless of the reason for that property.

> There are many other things that keep a grammar or language from
> being LL(k); LL(k) does not include all deterministic languages.

> # Ad (1+2): We should keep grammars apart from languages. Most languages
> # require recursive grammars, but allow for both left or right recursive
> # grammars.
>
> A language definition that does not depend on vague handwaving (one or
> two such definitions actually exist) bases the semantics on the parse
> tree. Since right and left recursion build different parse trees, this
> issue is very important in definitions with riguous semantics.

With regards to programming languages, there exist many constructs that
do not impose or require the construction of an specific (unique) parse
tree. As long as the parser must not know about the definition of an
identifier, the placement of the definitions in an parse tree is
irrelevant, when it doesn't change the semantics of the language
(visibility constraints...).

>
> # Languages with "expr '+' expr" or "list ',' list" can be parsed in
> # multiple ways. Unless there exist additional (semantical) restrictions,
> # correct and equivalent left or right recursive grammars can be
> # constructed for such languages.
>
> Or just write an unambiguous production. It's not any harder to do it
> right than to do it lazy and wrong.
>
> If the semantics of a subtract production are the value of the right
> subtree is subtracted from the value of left subtree, then
> 3 - 2 - 1
> with left recursion is
> = (3 - 2) - 1 = 1 - 1 = 0
> with right recursion is
> = 3 - (2 - 1) = 3 - 1 = 2

This is a property of the asymmetric subtraction operation, which
doesn't apply to the symmetric addition or multiplication operations. Of
course it's a good idea to enforce a unique sequence of *numerical*
operations in program code, whereas in mathematical formulas such
additional restrictions should *not* be built into a grammar.

>
> Rather different answers but unless your software is controlling a
> satellite to Venus, I guess sloppiness can be repaired in the next
> patch release.

No doubt, but IMO you want to introduce more restrictions than required.
A compiler is allowed to apply certain *valid* transformations on an
parse tree, in so far I cannot see a reason why any grammar or parser
for that language must enforce the construction of one-and-only-one
valid parse tree.

> Algol-60 used left recursion except exponentiation so that the value
> could be determined from the parse tree without a lot misreadable
> prose.
>
> # And when a human is allowed to disambiguate a grammar for such a
> # language himself, a parser generator should be allowed to do the same ;-)
>
> hy bother inserting ambiguity and then remove it again with obscure
> rules? Eschew ambiguity from the onset.

Here you're talking about the construction of languages, not about the
construction of parsers ;-)

DoDi

Hans-Peter Diettrich

unread,

Jul 25, 2006, 12:40:36 AM7/25/06

to

Max Hailperin schrieb:

>> 1. LL parsers cannot handle left recursion, whereas LR parsers cannot
>> handle right recursion.
>>
>> 2. Most languages are ambiguous at the syntax level, so that implicit
>> disambiguation (longest match...) or explicit semantical constraints
>> must be introduced. (see: dangling else...).
>>
>> 3. Only LL(1) recursive descent parsers are readable, that's why no
>> LL(k) parser generators exist, in contrast to LR(k) parser generators.
>>
>> 4. When at least one terminal must be consumed, before a recursive
>> invocation of a rule, no infinite recursion can occur. (every loop will
>> terminate after all terminals in the input stream have been consumed)
>>
>> Any objections, so far?
>
> Yes.
>
> Point 1 is incorrect; LR parsers can handle right recursion in
> addition to left recursion.

Shame on me :-(

> Point 2 is incorrectly stated, at the least. You seem to have meant
> not that the languages are ambiguous, but rather than particular
> grammars that are commonly used for those languages are ambiguous.

As already stated in a parallel answer, a language can allow for
multiple different parse trees of the same (valid) input, what has been
considered as "ambigous" by several contributors. Can you suggest a
better wording?

> Taking the example of the dangling else problem, there are unambiguous
> grammars for this as well as ambiguous ones. The published reference
> grammar for Java, for example, is unambiguous,

Just looked it up, and the Java grammar does not only look horrible to
me, it also can serve as an example of IMO inappropriate techniques in
the design of a language. I just don't know about the order of this
procedure (exponential?), when it had to be applied to disambiguate more
than one production :-(

IMO the maintenance of such a bloated grammar is at least very
inconvenient and error prone.

At least a much simpler Java grammar could have been constructed, by
dropping certain compatibility with the C language syntax in the fist step.

> rather than relying
> upon an extragrammatical disambiguation rule. (Note that such a
> disambiguation rule is not semantic, contrary to what you say, just a
> portion of the syntax specification that may, in some cases, be
> presented extragrammatically.)

I'm just thinking about the implications, built into any kind of grammar
or grammar notation. Do they exist, beyond the syntax of the meta
language? Or are they only introduced by specific parser generators, for
the sake of shorter grammar notation?

> Therefore, it seems that your point 2
> boils down to a claim that "most" of the time the specifiers of
> programming language syntaxes find it more convenient to use an
> ambiguous grammar coupled with extragrammatical disambiguation rules,
> rather thon to use an unambiguous grammar. This is an empirical
> question about the frequency with which different notational
> approaches are used, which I would be hesitant to try to answer just
> based on what I personally have seen. It certainly is not an
> implausible claim, however.

Just as we use high level programming languages, and let an compiler
translate it into detailed assembly code, we prefer to write compact
grammars, and leave it to an parser generator to produce the equivalent
code or automaton. In both cases we should work on the highest (most
compact) level of notation, in order to keep the sources maintainable.
I'm not really happy with macros and similar extensions, when used to
make source code only shorter, at the cost of transparency. OTOH I
really dislike bloated grammars, with a lot of similar productions,
which have been introduced for more or less obvious purposes, as in the
beforementioned Java grammar.

> consider, for example, the following grammar,
> which is LL(2) but not LL(1):
>
> S -> a b S | a c
>
> The recursive descent parser for this can be written quite readably.

Provided that this grammar really is not LL(1), it can be transformed
easily into LL(1) form. IMO this example clearly demonstrates the
advantages of using higher level notations, like EBNF:

S -> a ( b S | c )

instead of:

S -> a S2
S2 -> b S | c

>> Ad (3): This is the only reason for the preference of LR parsers. Why
>> spend time with tweaking a grammar to LR(1)/LL(1), when a parser
>> generator for LR(k) is available?
>
> No, the preference for LR parsers over LL has little to do with point
> 3, not only because point 3 is incorrect, but also because you are
> here addressing the k>1 case, whereas most practical parsers use k=1
> in any case.

Just a question: is your above grammar LR(1)?
I'm somewhat clueless in answering this question myself :-(

> the grammars suitable for LR(1) parsing are a proper superset
> of those suitable for LL(1) parsing.

Thanks for this enlightenment :-)

DoDi

Dmitry A. Kazakov

unread,

Jul 25, 2006, 5:58:49 PM7/25/06

to

On 25 Jul 2006 00:40:16 -0400, Hans-Peter Diettrich wrote:

> SM Ryan schrieb:

>
>> If the semantics of a subtract production are the value of the right
>> subtree is subtracted from the value of left subtree, then
>> 3 - 2 - 1
>> with left recursion is
>> = (3 - 2) - 1 = 1 - 1 = 0
>> with right recursion is
>> = 3 - (2 - 1) = 3 - 1 = 2
>
> This is a property of the asymmetric subtraction operation, which
> doesn't apply to the symmetric addition or multiplication operations. Of
> course it's a good idea to enforce a unique sequence of *numerical*
> operations in program code, whereas in mathematical formulas such
> additional restrictions should *not* be built into a grammar.

Do you mean that association should be handled after parsing? That would be
a strange language. However, if you merely argue that 3 - 2 - 1 should be
parsed as:

" -"
/|\
3 2 1

or (maybe better) as

"+"
/ | \
/ "-""-"
/ | \
3 2 1

then I would certainly agree with you. But this is also "built in grammar"
to me.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Hans-Peter Diettrich

unread,

Jul 25, 2006, 6:01:48 PM7/25/06

to

Carl Barron schrieb:

Unfortunately none of these generate Pascal code. If I were happy with
C/C++, I could use a broad range of tools immediatel. But since I'm not
happy with these languages, even not with C# or Java, I decided to write
an CtoPascal translator first. Having finished that tool, I'll come back
to those parser generators ;-)

BTW, a template based parser generator for C++ is a very interesting
approach :-)

DoDi
[I see you've never tried typing pascal yacc into Google. John]

Arthur J. O'Dwyer

unread,

Jul 25, 2006, 6:03:01 PM7/25/06

to

On Tue, 25 Jul 2006, Hans-Peter Diettrich wrote:
> SM Ryan schrieb:

[Hans-Peter wrote:]

>> # 2. Most languages are ambiguous at the syntax level, so that implicit
>> # disambiguation (longest match...) or explicit semantical constraints
>> # must be introduced. (see: dangling else...).
>>
>> Only poorly designed programming languages are ambiguous. (Natural
>> languages are ambiguous and don't use deterministic language theory.)
>
> Counterexample:
> The argument list of a subroutine is nothing but a list, which can be
> expressed using either left or right recursion in a grammar.

That's a true statement, but I don't see what it's a "counterexample"
to.

> Perhaps I misused the term "ambiguous" here, when I meant that different
> parse trees can be constructed for a sentence of a language?

Well, that's obvious. I mean, look at the following sentence:
"The dog chases the cat." Now, we can parse that into the tree

chases dog
/ \ or we can make the tree / \
dog cat The the
/ / / \
The the chases cat

Obviously, one of those parse trees is more "natural", and, viewed
through human eyes, expresses something useful about the structure
of the sentence (namely, that a sentence has a verb, and nouns are
more important than articles, and so on).
However, we could probably write one grammar to make the parse tree
on the left, and another grammar to make the parse tree on the right.
Does this mean that something is "wrong" with English, or that the
sentence "The dog chases the cat" admits of "more than one possible
parse tree"? No, not in any useful sense.

In the same way, it's possible to create many different "parse trees"
for a C-language "sentence", indicated here with indentation levels:

if (x) if (x) if (x)
if (y) if (y) if (y)
foo(); foo(); foo();
else else else
bar(); bar(); bar();

Only the leftmost parse tree is useful in understanding the semantics
of the C program; therefore, we call it "correct", and design our
grammars to create this parse tree in preference to the other two.
The middle parse tree demonstrates the "dangling else" non-problem.
It is a /wrong/ parse tree. No correct grammar for C will generate it.
The rightmost parse tree demonstrates the "can't distinguish between
'if' and 'foo'" non-problem. It is also a /wrong/ parse tree; wrong
enough that it is instantly recognizable as nonsensical to human eyes.
Again, no correct grammar for C will generate it.

There is absolutely no qualitative difference between the wrongness
of the middle parse tree and the wrongness of the rightmost parse tree.

In the same way, C has this kind of false "ambiguity" in its lexer.
The string "+++" parses as ++, then +, even though a human might think
it could parse as +, then ++. But any grammar that would let "+++" parse
that way would be a /wrong/ grammar, and wouldn't qualify as a grammar
for C.

>> Many programming language grammars are ambiguous because the people
>> writing the grammars are lazy and/or uneducated in these matters. The
>> dangling else nonproblem was solved some 40 years ago. Anyone who
>> thinks this is still a problem is just too lazy to write a well known
>> solution.
>
> The dangling else problem can be solved by adding implicit general rules
> to the interpretation of a language (or grammar?). Of course there exist
> ways to prevent the occurence of such problems, just in the language
> specification. But AFAIR it's impossible to prove, in the general case,
> that a language is inambigous.

The sentence above is ambiguous. :) If you mean "there's no general
algorithm that will take a grammar G and decide whether it is ambiguous",
I think you're right, just because that seems like it ought to be an
NP-complete problem.
However, if you mean "nobody has ever proven that any grammar is
unambiguous", you're wrong. It's easy to prove that /some/ specific
grammars are unambiguous. I'm sure all the major programming languages'
reference grammars have been vetted for correctness by now, for example.

Those are my main points. Following are nitpicks.

> My point is that table driven parsers are unreadable, due to the lost
> relationship between the parser code and the implemented language or
> grammar. Do there exist LR parsers or parser generators at all, which
> preserve the relationship between the parser code and the grammar?

Again, you seem to confuse human-centered subjective terms like
"preserve the relationship" with measurable terms. Obviously there is
/some/ kind of special relationship between a yacc-generated parser
and its target language --- something that makes it parse C and not
Java or pig Latin. You're complaining that the relationship isn't
close enough to the surface for humans to see --- but why is that
important?
After all, the point of table-driven parsers is that humans never
/need/ to see the generated code. It "just works."

[SM Ryan wrote:]

>> A language definition that does not depend on vague handwaving (one or
>> two such definitions actually exist) bases the semantics on the parse
>> tree. Since right and left recursion build different parse trees, this
>> issue is very important in definitions with riguous semantics.

I see where you're coming from, but that's slightly too dogmatic IMHO.
What would you say to a language definition that defined a comma-separated
list as "a list of N expressions, separated by commas" (or some such)?
That's unambiguous, but doesn't imply left- or right-recursion, or in
fact any recursion at all.

[...]

> No doubt, but IMO you want to introduce more restrictions than required.
> A compiler is allowed to apply certain *valid* transformations on an
> parse tree, in so far I cannot see a reason why any grammar or parser
> for that language must enforce the construction of one-and-only-one
> valid parse tree.

Certainly any language must ensure that all interpretations of a given
sentence /mean the same thing/. The easiest way to do that is to give
an unambiguous grammar, so that there's only one interpretation of each
sentence in the language. You could do weirder things, but AFAIK pretty
much nobody's ever tried, because the easy way is so much easier. :)

-Arthur

SM Ryan

unread,

Jul 28, 2006, 6:42:58 PM7/28/06

to

# > Only poorly designed programming languages are ambiguous. (Natural
# > languages are ambiguous and don't use deterministic language theory.)
#
# Counterexample:
# The argument list of a subroutine is nothing but a list, which can be
# expressed using either left or right recursion in a grammar.

This is not about syntax ambuigity, but operator associativity. The
language definition can define
a@b@c = (a@b)@c = a@(b@c)
for some binary operator @. The syntax should still be unambiguous,
but the language lets the compiler reorder operations.

It's important to keep associativity distinct from syntax. Some
operators are associative, some are ant-associative, and some are
not associative in any form. For example integer adds are usually
associative, but real adds are not. The precise order the operands
are summed can be critical in the stability of real arithmetic.

# The dangling else problem can be solved by adding implicit general rules
# to the interpretation of a language (or grammar?). Of course there exist
# ways to prevent the occurence of such problems, just in the language
# specification. But AFAIR it's impossible to prove, in the general case,
# that a language is inambigous.

LR(1) parser generation can only succeed for an unambiguous grammar,
in which case the language is proven unambiguous. If you avoid bison
shortcuts and resolve all reduce conflicts, you will have an unambiguous
LALR grammar and your language is unambiguous.

The dangling else was solved shortly after it was discovered by defining
open and closed statements. The solution is simple.

# > Recursive descent is not LL(k). Recursive descent is an implementation
# > technique not a language class.
#
# Okay, but what's the relationship between leftmost/rightmost derivation
# and a language?

None. It's the order in which you replace nonterminals in string
derivation. You end up with the same parse tree for the same grammar
in either case.

# I'm not sure what you want to tell me. AFAIR LR(k) (languages?
# grammars?) can be transformed into LR(1), for which a finite state
# machine can be constructed easily. I assume that a transformation from
# LL(k) into LL(1) would be possible as well, using essentially the same
# transformation algorithms.

No. LL(k+1) includes languages which cannot be LL(k). Also I don't think
there's mechanical process to convert LR(k) into LR(1), merely that it
is possible.

# My point is that table driven parsers are unreadable, due to the lost
# relationship between the parser code and the implemented language or
# grammar. Do there exist LR parsers or parser generators at all, which
# preserve the relationship between the parser code and the grammar?

Any parser can be considerred table driven with the CPU interpretting
object code. I don't concern myself with the code the parser generator
creates. I concern myself with the grammar representation I give the
parser generator.

# With regards to programming languages, there exist many constructs that
# do not impose or require the construction of an specific (unique) parse

And many that do. The language definition can declare some operators
commute or associate and some do not; that if a program's evaluation
changes when converted into what the language definition says is
equivalent, then it is the program that is at fault. Thus the
unambiguous parse trees can be transformed in a number ways by the
compiler.

--
SM Ryan http://www.rawbw.com/~wyrmwif/

GERBILS
GERBILS
GERBILS

Chris F Clark

unread,

Jul 28, 2006, 6:43:08 PM7/28/06

to

SM Ryan schrieb:

> > Only poorly designed programming languages are ambiguous. (Natural
> > languages are ambiguous and don't use deterministic language theory.)

Hans-Peter Diettrich <DrDiet...@aol.com> (Dodi) writes:
> Counterexample:
> The argument list of a subroutine is nothing but a list, which can be
> expressed using either left or right recursion in a grammar.

Or with iteration (e.g. one of the closure operators, such as + and
*). I think Dodi's point here is well-taken. The notation in a
grammar should make the problem being solved clearer and not introduce
any extraneous details. If there is no specific associativity in the
language, I would prefer not to introduce artificial associativity
into the grammar. (It's pretty ovbvious that SM Ryan disagrees in
that he doesn't think a grammar shoulod be ambiguous, and he makes
that point several times in his posting. Whether he would consider
the rest of what I said implying accepting ambiguity, I will have to
wait for his judgement.)

To my mind, one specifically choses to write:

expr: expr "+" term
| term;

or:

right "+", "-";
right "*"; "/";
left "**";
expr: expr "+" expr;

or:

expr: term ("+" term)*

depending on what one intends to say.

I would write explicit precedence if I had a specific parsing problem
and I wanted the fact that I was expressing that problem explicitly in
the grammar to emphasize the fact.

I would write ambiguous rules with precedence and associativity
declarations if I wanted to emphasize the fact that the items were
simply expressions and that the precendence and associativity were
associated with the operators.

I would write regular expressions if I wanted to emphasize the fact
that the item corresponds to a "list" and the operators have no
precedence at all are are merely separators between the items of
interest. In fact, we have a pair of binary closure operators &* and
&+ that we specifically support in Yacc++ to make that even more
clear.

expr: term &* "+"; // same as above, but more terse

Dodi again:

> Perhaps I misused the term "ambiguous" here, when I meant that different
> parse trees can be constructed for a sentence of a language?

That is the definition of ambiguous. An ambiguous graqmmar admits to
or more parses (i.e. parse trees) for a given sentence. An unambiguous
grammar admits only one parse.

SM Ryan again:

> > Many programming language grammars are ambiguous because the people
> > writing the grammars are lazy and/or uneducated in these matters. The
> > dangling else nonproblem was solved some 40 years ago. Anyone who
> > thinks this is still a problem is just too lazy to write a well known
> > solution.

However, I don't believe there is an LL grammar (in the strict
technical sense) that solves the dangling else problem. One can only
solve the dangling else problem with a "hack" to the LL solution.
Moreover, if one wants to do that, the "best" grammar to give to the
LL generator (that supports the hack) is the ambiguous one.

Note, that I have nothing against that "hack". It is hack in a good
sense of the word (i.e. an elegant solution that takes a non-obvious
short-cut that correctly handles the cases of interest).

Dodi again:

> I'm not sure what you want to tell me. AFAIR LR(k) (languages?
> grammars?) can be transformed into LR(1), for which a finite state
> machine can be constructed easily. I assume that a transformation from
> LL(k) into LL(1) would be possible as well, using essentially the same
> transformation algorithms.

Unfortunately not true. Moreover, I don't believe there is an
algorithm for transforming an LR(k) grammar into an LR(1) grammar,
just a (non-constructive) proof that such a grammar exists. Thus, if
you have an LR(k) grammar and only an LR(1) tool, you are also
out-of-luck.

> My point is that table driven parsers are unreadable, due to the lost
> relationship between the parser code and the implemented language or
> grammar. Do there exist LR parsers or parser generators at all, which
> preserve the relationship between the parser code and the grammar?

Table driven parsers (at least LR ones) are more unreadable than
recursive descent ones in the same sense that the drawing (or a table
describing the drawing) of an DFA is less-readable than the
corresponding regular expression. I think there is some cognitive
effects that allow a person's mind to ignore the parallelism in a
linear presentation (e.g. a recursive descent routine or a regular
expression) and to focus on the depth-first properities which are
brought to the surface in such representations. People do not seem to
think as well in breath-first representations. I think this has to do
with some anthropomorphising where one inserts oneself into the
current location in the code as it progresses through one case. This
works the same way people often single-step through code in a
debugger.

That being said, one can learn to deal quite effectively with the DFAs
of an LR machine. Moreover, I have find that the single-step approach
often blinds one to things happening in other paths that one "should"
be considering. In fact, one of my biggest complaints about
hand-generated recursive descent is that it lends itself to ad-hack
approaches (hack used pejoratively here), where one twiddles with
local parts of the parser to get it to accept the case one is
interested in, without considering whether one has a consistent
grammar (i.e. that the same code works in another context).

Hope this doesn't confuse the issues any more,
-Chris

*****************************************************************************
Chris Clark Internet : com...@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)

------------------------------------------------------------------------------

Chris F Clark

unread,

Jul 28, 2006, 6:43:40 PM7/28/06

to

Max Hailperin <m...@gustavus.edu> writes:

> You seem to be conceding too much here. The essential difference is
> in the parsers, not just the parser generators. Recall (as I
> explained in http://compilers.iecc.com/comparch/article/06-04-136 )
> that an LR parser is a shift/reduce parser whereas an LL parser is a
> confirm/expand parser.

...

Point well-taken and well-made. Moreover, anyone who wants to
understand LL and LR parsing would be well advised to read Max's
article that he has kindly pointed us to. To my mind, it was a clear
and lucid explanation of the principles behind how the parsing
algorithms work that can help one have an informed intuition.

-Chris

*****************************************************************************
Chris Clark Internet : com...@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)

------------------------------------------------------------------------------

Chris F Clark

unread,

Jul 28, 2006, 6:44:36 PM7/28/06

to

Hans-Peter (Dodi) wrote:
> > My point is that table driven parsers are unreadable, due to the lost
> > relationship between the parser code and the implemented language or
> > grammar. Do there exist LR parsers or parser generators at all, which
> > preserve the relationship between the parser code and the grammar?

"Arthur J. O'Dwyer" <ajon...@andrew.cmu.edu> writes:
> Again, you seem to confuse human-centered subjective terms like
> "preserve the relationship" with measurable terms. Obviously there is
> /some/ kind of special relationship between a yacc-generated parser
> and its target language --- something that makes it parse C and not
> Java or pig Latin. You're complaining that the relationship isn't
> close enough to the surface for humans to see --- but why is that
> important?
> After all, the point of table-driven parsers is that humans never
> /need/ to see the generated code. It "just works."

Arthur, if that were only true!!! When it is true, life is wonderful.
When it isn't true, because your parser generator has given you a
conflict message and one needs to understand why your grammar is
incorrect and what you should do about it, life is painful. In that
case, one really does need to be able to read the tables and
understand them. For many it seems, that task is akin to visiting the
dentist. For that reason alone, I don't think recursive descent will
ever die (and nor should it, at least not in the machine generated
version).

Note, this in fact gives me mixed feelings (because I've worked hard
to make Yacc++ accept a wide class of grammars, wider than traditional
LALR(1)). In general, the more powerful the parsing method, the more
difficult it is to fix problems when the parser is misbehaving. (I'm
reminded of a quote where Mr. Scott (of Star Trek) is standing with 4
ball-bearings in hand looking at an incapacitated starship with
trans-warp drive, saying roughly, "the more complicated the plumbing,
the easier it is to break".) My feeling is that when GLR parsers
break, they are harder to fix than simple LR ones, just as LR ones are
harder to fix than LL ones. Moreover, I'm more certain that it is
harder to "know" that a GLR parser is broken than an LR one, atleast
if by broken one means unanticipated ambiguity, because the whole
point of GLR parsing is to allow certain ambiguities, and I think that
it would be hard to prove that one has only allowed the specific
allowed ambiguities.

Matthias Blume

unread,

Jul 28, 2006, 6:44:53 PM7/28/06

to

Hans-Peter Diettrich <DrDiet...@aol.com> writes:

>> Point 2 is incorrectly stated, at the least. You seem to have meant
>> not that the languages are ambiguous, but rather than particular
>> grammars that are commonly used for those languages are ambiguous.
>
> As already stated in a parallel answer, a language can allow for
> multiple different parse trees of the same (valid) input, what has been
> considered as "ambigous" by several contributors. Can you suggest a
> better wording?

A language in the technical sense is just a set of strings. There is
no notion of a parse tree until you provide a grammar. However, a
language can be "ambiguous" in the sense that there are no
non-ambiguous grammars for it (while there are ambiguous ones). Such
languages are called "inherently ambiguous".

Matthias

Hans-Peter Diettrich

unread,

Jul 28, 2006, 6:51:40 PM7/28/06

to

Dmitry A. Kazakov schrieb:

> Do you mean that association should be handled after parsing?

If there exist no such restrictions, many things can be done after
parsing. All optimizations, which a compiler for a certain language is
allowed to apply, obviously must be made after parsing. Such
transformations must respect the semantics of the language, but not
their syntax, which has gone away after parsing.

[...]

> But this is also "built in grammar" to me.

When a strict formal description is given for a language, e.g. in form
of a grammar, these details of course must be reflected in any grammar
for that language. Otherwise a grammar only must conform to the informal
description of the language.

DoDi

Hans-Peter Diettrich

unread,

Jul 28, 2006, 6:54:20 PM7/28/06

to

Hans-Peter Diettrich schrieb:

> [I see you've never tried typing pascal yacc into Google. John]

A few months ago none of the free tools was in a really usable state. I
just looked again, and found none but the well known tply, dyacclex and
dcg, whose development has been abandoned many years ago.

Did I miss something?

DoDi
[I see something called pyacc that comes with Debian. -John]

Hans-Peter Diettrich

unread,

Jul 28, 2006, 6:54:54 PM7/28/06

to

Arthur J. O'Dwyer schrieb:

>>> # 2. Most languages are ambiguous at the syntax level, so that implicit
>>> # disambiguation (longest match...) or explicit semantical constraints
>>> # must be introduced. (see: dangling else...).
>>>
>>> Only poorly designed programming languages are ambiguous. (Natural
>>> languages are ambiguous and don't use deterministic language theory.)
>> Counterexample:
>> The argument list of a subroutine is nothing but a list, which can be
>> expressed using either left or right recursion in a grammar.
>
> That's a true statement, but I don't see what it's a "counterexample"
> to.

With regards to "poorly designed programming languages".

After all I agree, that a language with a stricter definition leaves
less room for different opinions about the implementation. Okay?

> In the same way, it's possible to create many different "parse trees"
> for a C-language "sentence", indicated here with indentation levels:
>
> if (x) if (x) if (x)
> if (y) if (y) if (y)
> foo(); foo(); foo();
> else else else
> bar(); bar(); bar();
>
> Only the leftmost parse tree is useful in understanding the semantics
> of the C program; therefore, we call it "correct", and design our
> grammars to create this parse tree in preference to the other two.

How will you determine, in this special case or in general, which
grammars or parse trees are "correct"?

> Those are my main points. Following are nitpicks.
>
>> My point is that table driven parsers are unreadable, due to the lost
>> relationship between the parser code and the implemented language or
>> grammar. Do there exist LR parsers or parser generators at all, which
>> preserve the relationship between the parser code and the grammar?
>
> Again, you seem to confuse human-centered subjective terms like
> "preserve the relationship" with measurable terms. Obviously there is
> /some/ kind of special relationship between a yacc-generated parser
> and its target language --- something that makes it parse C and not
> Java or pig Latin. You're complaining that the relationship isn't
> close enough to the surface for humans to see --- but why is that
> important?

Consider that typically you want to add actions to a grammar. Then you
must know about the context (stack contents...), in which your code will
execute.

Finally you may want to debug your grammar and your code. In case of an
recursive descent parser, the call stack contains much information about
the parser state, whereas debugging the state transitions of an
automaton requires additional debugging features.

> After all, the point of table-driven parsers is that humans never
> /need/ to see the generated code. It "just works."

I found some yacc clones, which all just do *not* work. Now tell me how
somebody should figure out, what makes these tools misbehave :-(

DoDi

SM Ryan

unread,

Jul 29, 2006, 2:05:03 PM7/29/06

to

Hans-Peter Diettrich <DrDiet...@aol.com> wrote:
# Arthur J. O'Dwyer schrieb:
#
# >>> # 2. Most languages are ambiguous at the syntax level, so that implicit
# >>> # disambiguation (longest match...) or explicit semantical constraints
# >>> # must be introduced. (see: dangling else...).
# >>>
# >>> Only poorly designed programming languages are ambiguous. (Natural
# >>> languages are ambiguous and don't use deterministic language theory.)
# >> Counterexample:
# >> The argument list of a subroutine is nothing but a list, which can be
# >> expressed using either left or right recursion in a grammar.
# >
# > That's a true statement, but I don't see what it's a "counterexample"
# > to.
#
# With regards to "poorly designed programming languages".
#
# After all I agree, that a language with a stricter definition leaves
# less room for different opinions about the implementation. Okay?

If the grammar is ambiguous then you don't have a guarentee what a
construct means. The compiler is allowed to translate your program
into different object programs. In a few rare cases, the different
object programs produce the same results; in these cases the ambiguity
is acceptable.

If your expression syntax is ambiguous 3-2-1 has two very different
meanings, (3-2)-1 and 3-(2-1) which produce very different results.
Because it is trivial to avoid this kind ambiguity, there is no good
reason to make the syntax ambiguous. (In fact people who do this
kind of doofus syntax are not intending to write ambiguous code: rather
they are depending on bison is straighten things out for them, thus
locking themselves into one out of date tool, becoming lazy in the
process.)

# > In the same way, it's possible to create many different "parse trees"
# > for a C-language "sentence", indicated here with indentation levels:
# >
# > if (x) if (x) if (x)
# > if (y) if (y) if (y)
# > foo(); foo(); foo();
# > else else else
# > bar(); bar(); bar();
# >
# > Only the leftmost parse tree is useful in understanding the semantics
# > of the C program; therefore, we call it "correct", and design our
# > grammars to create this parse tree in preference to the other two.
#
# How will you determine, in this special case or in general, which
# grammars or parse trees are "correct"?

This issue was solved a long time ago. Apparently the solution
has been forgotten.

<statement> ::= <closed statement> | <open statement>
<closed statement> ::= <assignment> | <block> | <closed if> | <closed while>
<open statement> ::= <open if> | <open while>
<open if> ::=
if <predicate> then <statement>
| if <predicate> then <closed statement> else <open statement>
<open while> ::= while <predicate> do <open statement>
<closed if> ::= if <predicate> then <closed statement> else <closed statement>
<closed while> ::= while <predicate> do <closed statement>

--
SM Ryan http://www.rawbw.com/~wyrmwif/

Raining down sulphur is like an endurance trial, man. Genocide is the
most exhausting activity one can engage in. Next to soccer.

Arthur J. O'Dwyer

unread,

Jul 29, 2006, 2:06:22 PM7/29/06

to

On Fri, 28 Jul 2006, Hans-Peter Diettrich wrote:
> Arthur J. O'Dwyer schrieb:
>
>>>> # 2. Most languages are ambiguous at the syntax level, so that implicit
>>>> # disambiguation (longest match...) or explicit semantical constraints
>>>> # must be introduced. (see: dangling else...).
>>>>
>>>> Only poorly designed programming languages are ambiguous. (Natural
>>>> languages are ambiguous and don't use deterministic language theory.)
>>> Counterexample:
>>> The argument list of a subroutine is nothing but a list, which can be
>>> expressed using either left or right recursion in a grammar.
>>
>> That's a true statement, but I don't see what it's a "counterexample"
>> to.
>
> With regards to "poorly designed programming languages".

(This has been better addressed elsewhere in this thread; but when
(SM Ryan?) said "Only poorly designed programming languages are
ambiguous," he was probably talking about /inherently ambiguous/
languages. Comma-separated lists can be parsed by at least two different
/unambiguous/ grammars, so they're not an example of an inherently
ambiguous language.)

> After all I agree, that a language with a stricter definition leaves
> less room for different opinions about the implementation. Okay?

Yes, I think we agree there. :)

[...]

>> In the same way, it's possible to create many different "parse trees"
>> for a C-language "sentence", indicated here with indentation levels:
>>
>> if (x) if (x) if (x)
>> if (y) if (y) if (y)
>> foo(); foo(); foo();
>> else else else
>> bar(); bar(); bar();
>>
>> Only the leftmost parse tree is useful in understanding the semantics
>> of the C program; therefore, we call it "correct", and design our
>> grammars to create this parse tree in preference to the other two.
>
> How will you determine, in this special case or in general, which
> grammars or parse trees are "correct"?

The "correct" parse tree is the one that helps you solve your problem.
If you're compiling C, then I think it's (subjectively) obvious that the
leftmost parse tree is best suited for that task, and so you'd use a
grammar that was fleshed out enough to produce that kind of tree.
If you were just counting the number of tokens in the file, you might
use a "dumber" grammar that only knew about how to parse integer and
string literals, splice backslashes, and so on.
If you were just counting the number of commas in the file, you wouldn't
hardly need a grammar at all.

My point is, and was, that you were complaining about languages
admitting of "different parse trees" or "different grammars" as if that
was unexpected and/or a bad thing. It's not.

[then, on debugging yacc-alike parser generators...]

>> Again, you seem to confuse human-centered subjective terms like
>> "preserve the relationship" with measurable terms. Obviously there is
>> /some/ kind of special relationship between a yacc-generated parser
>> and its target language --- something that makes it parse C and not
>> Java or pig Latin. You're complaining that the relationship isn't
>> close enough to the surface for humans to see --- but why is that
>> important?
>
> Consider that typically you want to add actions to a grammar. Then you
> must know about the context (stack contents...), in which your code will
> execute.
>
> Finally you may want to debug your grammar and your code. In case of an
> recursive descent parser, the call stack contains much information about
> the parser state, whereas debugging the state transitions of an
> automaton requires additional debugging features.

...Which the tool should provide. (If it doesn't, then you have a bad
tool. Which I guess brings us back to "why are yacc-alikes so hard to
use?", which is code for "why are yacc-alikes targeted at system hackers
instead of normal people?", which practically answers itself: hackers
wrote yacc in the first place, and yacc is mainly used by hardcore
programmers who are going to spend a lot of time debugging anyway. :)

>> After all, the point of table-driven parsers is that humans never
>> /need/ to see the generated code. It "just works."
>
> I found some yacc clones, which all just do *not* work. Now tell me how
> somebody should figure out, what makes these tools misbehave :-(

Ah, well, there you have a problem with the /tool/. That's the
tool-maker's fault, not yours, and you shouldn't concern yourself with
it. ;)

I think I see where you're coming from, but the complaint in this
half of this subthread seems to be "yacc is hard," which I already agreed
with. When you say "yacc is hard because my code has bugs and yacc won't
help me find them," I find myself unsympathetic... maybe because I haven't
spent enough time debugging my own code yet! And when you say "yacc is
hard because /its/ code is full of bugs," I'm sympathetic, but don't see
how that's indicative of anything except laziness on /their/ part. (And
you're still better off than if the tools didn't exist at all, right?)

</software-engineering metadiscussion>

-Arthur

Hans-Peter Diettrich

unread,

Jul 29, 2006, 7:37:26 PM7/29/06

to

Arthur J. O'Dwyer schrieb:

> [then, on debugging yacc-alike parser generators...]
[...]

> Which I guess brings us back to "why are yacc-alikes so hard to
> use?", which is code for "why are yacc-alikes targeted at system hackers
> instead of normal people?", which practically answers itself: hackers
> wrote yacc in the first place, and yacc is mainly used by hardcore
> programmers who are going to spend a lot of time debugging anyway. :)

You mean, I simply should use a different tool?
Sounds good, but can you recommend me any one, usable to create an
parser in e.g. Pascal?

I really don't insist in a tool, written in Pascal, but it should at
least allow to output the parser tables and code (including the
semantical actions) in any appropriate language.

My dream were a tool which reads in a grammar, possibly in precompiled
form, and then interprets some input according to that grammar. It would
be sufficient, in the first place, if the engine would create an parse
tree, which then could be inspected and evaluated without language
restrictions. Furthermore it were nice to have hooks, so that parts of
the final parse tree can be processed, and possibly discarded, as
appropriate for the actual application.

> (And you're still better off than if the tools didn't exist at all,
right?)

Well, yes and no. Yes, I'm happy if I don't have to write an parser
generator myself, but no, I'm not really happy with a possibly
unreliable tool. We spend so much time in discussing and constructing
"safe" (unambiguous...) grammars, and in the next step we should rely on
unsafe tools?

> </software-engineering metadiscussion>

DoDi

Hans-Peter Diettrich

unread,

Jul 29, 2006, 7:33:38 PM7/29/06

to

Hans-Peter Diettrich schrieb:

> A few months ago none of the free tools was in a really usable state. I
> just looked again, and found none but the well known tply, dyacclex and
> dcg, whose development has been abandoned many years ago.
>
> Did I miss something?

> [I see something called pyacc that comes with Debian. -John]

IMO pyacc is a (early or later?) version of tply, which also is part of
the fpc distribution.

Perhaps a common problem of these tools is their incompatibility with
newer yacc (grammar) versions, and there might exist problems in
removing the C code from "original" yacc grammars.

Can somebody give me hints about required or recommended procedures, to
remove version and language specific parts from yacc grammars?

DoDi

Hans-Peter Diettrich

unread,

Jul 29, 2006, 7:36:44 PM7/29/06

to

SM Ryan schrieb:

> If the grammar is ambiguous then you don't have a guarentee what a
> construct means. The compiler is allowed to translate your program
> into different object programs.

I agree that there should exist no ambiguities with regards to code
generation. But a programming language covers more than computational stuff.

> # How will you determine, in this special case or in general, which
> # grammars or parse trees are "correct"?
>
> This issue was solved a long time ago. Apparently the solution
> has been forgotten.

I didn't ask for a solution, but instead for the specification, that
allows to determine the "correct" procedure, parse tree, grammar, or
whatsoever.

BTW, do you have a corresponding general solution for "+++"?

I assume that further statements typically will go into <closed
statement>. Can you give a rule, what classifies statements as open,
closed, or possibly as both? Without a general rule it will be difficult
to extend your solution, for use in a different language. I only can
guess that every statement, ending in another <statement>, has to be
split into an open and a closed form. But what about an eventual
embedded <statement>?

As already mentioned with regards to the Java grammar, I think that a
grammar with unrolled implied rules is a maintenance nightmare. How
should any new kind of statement be added correctly, if the reason and
the criteria for the splitting into open and closed statements is not
documented explicitly, but instead is only built into an existing grammar?

That's why I would prefer an explicit syntactical termination of an
if-statement, by e.g. an "endif" or "fi" terminal, over your solution.
This would IMO eliminate the dangling else problem, without the risk of
collissions with other rules, in the current or any future extended grammar.

DoDi
[Every useful programming language I know is defined with a
combination of formal and informal language. Both the C and C++
standards say that the languages are tokenized using the longest
prefix, so +++ means ++ + and any compiler that treats it otherwise
isn't a C or C++ compiler. You can have a separate argument about
whether it's a good idea to define languages that make +++ a synonym
for ++ +. -John]

Chris F Clark

unread,

Jul 29, 2006, 11:55:26 PM7/29/06

to

Hans-Peter Diettrich <DrDiet...@aol.com> writes:

> My dream were a tool which reads in a grammar, possibly in precompiled
> form, and then interprets some input according to that grammar. It would
> be sufficient, in the first place, if the engine would create an parse
> tree, which then could be inspected and evaluated without language
> restrictions. Furthermore it were nice to have hooks, so that parts of
> the final parse tree can be processed, and possibly discarded, as
> appropriate for the actual application.

I think something pretty close to your dream exists and is called
GrammarForge (and was previously called meta-S) written by Quinn Tyler
Jackson. His goal is to have an engine where one can do everything at
the grammar level. Moreover, I think he borrowed the Visual-Parse++
IDE as a basis and has a pretty good interpretive solution.

There are other tools which approach your ideal from different angles,
such as CodeWorker, Harmony ISTB, and the DMS toolkit.

> Arthur J. O'Dwyer schrieb:

> > [then, on debugging yacc-alike parser generators...]
> [...]
> > Which I guess brings us back to "why are yacc-alikes so hard to
> > use?", which is code for "why are yacc-alikes targeted at system hackers
> > instead of normal people?", which practically answers itself: hackers
> > wrote yacc in the first place, and yacc is mainly used by hardcore
> > programmers who are going to spend a lot of time debugging anyway. :)
>
> You mean, I simply should use a different tool?

...

> > (And you're still better off than if the tools didn't exist at all,
> > right?)
>
> Well, yes and no. Yes, I'm happy if I don't have to write an parser
> generator myself, but no, I'm not really happy with a possibly
> unreliable tool. We spend so much time in discussing and constructing
> "safe" (unambiguous...) grammars, and in the next step we should rely on
> unsafe tools?
>
> > </software-engineering metadiscussion>

Not all yacc-alikes are targeted at super-human programmers. Still,
on the one hand it is pretty easy to write a yacc-alike with one or
two bells-and-whistles as a student project. As far as I can tell,
many of the yacc-alikes are simply that. I think it is even easier to
write a recursive descent parser generator, and so there are even more
of them around as student projects.

On the other-hand, having put about 15 years worth of effort into
Yacc++ and still not being happy with the results in all
dimensions--correction hardly being happy with the results in any
dimension, I think I can say that it isn't only laziness that makes
the yacc-alike tools hard to use.

I think the sad truth is after doing the easy part, e.g. implementing
what one can read in the Dragon book or Holub, the terrain gets steep
fairly quickly. Thus, most compiler-compilers get the easy part done,
but leave one struggling with the more subtle points.

However, I agree with Dodi on the point, one shouldn't be stuck with a
poorly debugged tool and no avenue of support.

Hans-Peter Diettrich

unread,

Jul 31, 2006, 2:39:18 AM7/31/06

to

Chris F Clark schrieb:

> I think something pretty close to your dream exists and is called
> GrammarForge (and was previously called meta-S) written by Quinn Tyler
> Jackson. His goal is to have an engine where one can do everything at
> the grammar level. Moreover, I think he borrowed the Visual-Parse++
> IDE as a basis and has a pretty good interpretive solution.

I've participated in the beta test of Meta-S, and this may have
influenced my opinion about useful parser generators. It's really a
great tool, with only few limitations :-)

> There are other tools which approach your ideal from different angles,
> such as CodeWorker, Harmony ISTB, and the DMS toolkit.

Thanks for the links. I understand that most of the mentioned tools have
their price, based on the efforts for development or support, which
unfortunately often is too high for nonprofessional use.

DoDi

Quinn Tyler Jackson

unread,

Jul 31, 2006, 11:33:36 PM7/31/06

to

Chris Clark:

> > I think something pretty close to your dream exists and is called
> > GrammarForge (and was previously called meta-S) written by Quinn Tyler
> > Jackson. His goal is to have an engine where one can do everything at
> > the grammar level.

...

DoDi:

> I've participated in the beta test of Meta-S, and this may have
> influenced my opinion about useful parser generators. It's really a
> great tool, with only few limitations :-)

As I mentioned previously, The Grammar Forge, and all related IP (the Meta-S
parsing engine) were acquired by Thothic Technology Partners some time ago.
I am a principal in TTP, and am the Chief Scientist, but no longer
personally own the technology.

Yes, I believe everything should be do-able at the grammar level. My general
rule of thumb has been based on Miller's 7 +/- 2 theory that human beings
can grok things with 7 +/- 2 items to keep in mind. So, if some semantic
construction can be parsed in 7 +/- 2 productions, I'll do my best to put it
in the grammar, rather than "fake it" in the code. I consider a grammar
broken (until proven otherwise) the moment it *must* call code in a
reduction to know whether or not it should accept something as legal. Call
me old-fashioned. ;-) To date, I have encountered no decidable language that
cannot be expressed entirely in a $-grammar (and that without becoming
opaque*). (I've also found none that take longer than O(n^m) where m < 3 to
parse input, now that so many post-acquisition optimizations and additions
have been made to the engine.)

--
Quinn Tyler Jackson
http://www.thothic.com

* Opaqueness is in the eye of the beholder -- I use Miller's heuristic to
define opaqueness. YMMV. Nothing fun is trivially obvious. ;-)

SLK Parsers

unread,

Jul 31, 2006, 11:34:28 PM7/31/06

to

>As already mentioned with regards to the Java grammar, I think that a
>grammar with unrolled implied rules is a maintenance nightmare. How
>should any new kind of statement be added correctly, if the reason and
>the criteria for the splitting into open and closed statements is not
>documented explicitly, but instead is only built into an existing grammar?

This grammar solution is just an exercise of theoretical
interest. Yacc and most other tools can handle the ambiguous grammar
correctly.

>That's why I would prefer an explicit syntactical termination of an
>if-statement, by e.g. an "endif" or "fi" terminal, over your solution.
>This would IMO eliminate the dangling else problem, without the risk of
>collissions with other rules, in the current or any future extended
>grammar.

There is no need to modify the ambiguous grammar in any way to get a
correct parser. All that is needed is to choose one or the other of
the two possible parses. Yacc does this by preferring the shift over
the reduce. SLK does this by using the first production of the two
alternates as the parse table entry.

SM Ryan

unread,

Aug 1, 2006, 12:31:02 AM8/1/06

to

"SLK Parsers" <parse...@earthlink.net> wrote:
# >As already mentioned with regards to the Java grammar, I think that a
# >grammar with unrolled implied rules is a maintenance nightmare. How
# >should any new kind of statement be added correctly, if the reason and
# >the criteria for the splitting into open and closed statements is not
# >documented explicitly, but instead is only built into an existing grammar?

# This grammar solution is just an exercise of theoretical
# interest. Yacc and most other tools can handle the ambiguous grammar
# correctly.

Of course. Throw garbage at the parser generator and hope you
get a edible meal out of it.

# the reduce. SLK does this by using the first production of the two
# alternates as the parse table entry.

How would I guess which one is that?

--
SM Ryan http://www.rawbw.com/~wyrmwif/

God's a skeeball fanatic.

Hans-Peter Diettrich

unread,

Aug 3, 2006, 11:02:17 AM8/3/06

to

Quinn Tyler Jackson schrieb:

> DoDi:
>> I've participated in the beta test of Meta-S, and this may have
>> influenced my opinion about useful parser generators. It's really a
>> great tool, with only few limitations :-)

> I consider a grammar

> broken (until proven otherwise) the moment it *must* call code in a
> reduction to know whether or not it should accept something as legal.

I'm just curious: could you make the C++ grammar work, so that it works
without an external disambiguation between typenames and other identifiers?

DoDi

Hans-Peter Diettrich

unread,

Aug 3, 2006, 11:02:26 AM8/3/06

to

SLK Parsers schrieb:

> There is no need to modify the ambiguous grammar in any way to get a
> correct parser. All that is needed is to choose one or the other of
> the two possible parses. Yacc does this by preferring the shift over
> the reduce. SLK does this by using the first production of the two
> alternates as the parse table entry.

Hmmm, how can you ever be sure, that the solution for the dangling else
always will be correct, also in other ambiguous situations?

DoDi

SLK Parsers

unread,

Aug 3, 2006, 11:04:31 AM8/3/06

to

>How would I guess which one is that?

The technique is called "controlled ambiguity" in the Dragon book. You can
read about it there. I leave the LL solution as an exercise for the
interested reader, as if there are any.

Hint 1: We want to associate the ELSE with the closest IF, so use the
production that consumes the ELSE sooner.

Hint 2: There are only 2 possibilities, try them both and see which one
works.

The grammar solution is quite clever. If it were not so redundant, and
solved a problem that actually exists, it would be useful.

Forget about yacc for a minute. Assume you were creating the parse table by
hand. When you got to the shift-reduce conflict, you would say "Oh, I
remember this one from my first year compiler course. I will use the shift
as the entry in the parse table". This is what yacc does automatically.

The SLK parser generator: http://home.earthlink.net/~slkpg/

SLK Parsers

unread,

Aug 4, 2006, 4:39:36 PM8/4/06

to

>Hmmm, how can you ever be sure, that the solution for the dangling else
>always will be correct, also in other ambiguous situations?

It is well-known to be correct, and is explained in compiler texts. It
is a very special case. In general, ambiguity is to be avoided unless
you know what you are doing, and probably even then. That said, I used
the technique seven times in a recent translator, and am confident
that it is correct. This was verified by use on about 1000 real
programs.

The SLK parser generator: http://home.earthlink.net/~slkpg/

[My rule of thumb has been that it's OK to use disambiguation for
if/then/else and operator precedence in expressions. Anywhere else
you're likely to get into trouble. -John]

Quinn Tyler Jackson

unread,

Aug 6, 2006, 12:40:14 PM8/6/06

to

I said:

> > I consider a grammar
> > broken (until proven otherwise) the moment it *must* call code in a
> > reduction to know whether or not it should accept something as legal.

DoDi asked:

> I'm just curious: could you make the C++ grammar work, so that it works
> without an external disambiguation between typenames and other
> identifiers?

The addition of named-index trie scoping provided a means to do
that. There were some new constructions also that allow for delayed
and controlled reparsing of substrings when a scope is left, that are
described in Adapting to Babel, that also make some of that more
possible.

The grammar-only-C++ project was a proof-of-concept more than anything
else: it showed that ad hockery could be avoided while still correctly
parsing C++ at the grammar level within near linear time in the
average case. While ideologically possible to do, and possible to do
without becoming *too* opaque, with a language such as C++, there is a
point where the exercise has proven itself and practicalities override
ideology. (Anyone *need* a new C++ parser?)

That said, the two grammars I find most graceful* are for the languages:

L = {a^n b^C(n) | C(n) is the Catalan number of n}
L = {a^n b^phi(n) | phi(n) is Euler's totient of n}

That these parse in less than O(n^2) and that the $-grammar for the Euler's
totient $-grammar has only 4 productions have stolen my attention from the
C++ $-grammar. ;-)

--
Quinn Tyler Jackson

http://www.thothic.com

* Well, these two and the linear-time p-knot-finding $-grammar.