Possibility of an LALR(k) GOLD?

56 views
Skip to first unread message

Nick Sabalausky

unread,
Apr 16, 2011, 10:18:12 PM4/16/11
to GOLD Parsing System
While *technically* LR(k) grammars can be converted to LR(1) grammars,
not only is the algorithm to do so difficult to find, but from what I
gather, the resulting grammar is very complex and bears little to no
resemblance to the structure of the original LR(k) version. And thus,
doing so is of little practical real-world use (unless all you want to
do is check whether a source is *valid* for a given language, rather
than actually doing anything meaningful with it.)

Are there any plans for GOLD to gain any LALR(k) ability? (Possibly by
adding a form of backtracking). Any possibility of it happening at
some point?

A little bit more info is here (granted, the "actions" argument isn't
relevant to GOLD, but the argument that a LR(k)->LR(1) obfuscates the
grammar is definitely relevant to GOLD):

http://www.antlr.org/article/needlook.html

Nick Sabalausky

unread,
Apr 16, 2011, 10:22:50 PM4/16/11
to GOLD Parsing System
On Apr 16, 10:18 pm, Nick Sabalausky <nsetbtg...@mailinator.com>
wrote:
I also meant to point out this is very important for a lot of
important grammars, such as C. For instance, trying to adjust the ANSI
C grammar to accept variable declarations like "MyCustomType
var;" (which is necessary due to typedef) leads to endless ambiguity
problems under LALR(1).

Dave Dolan

unread,
Apr 16, 2011, 10:35:51 PM4/16/11
to gold-pars...@googlegroups.com
The parser need not distinguish between a typedef and a vanilla
identifier. That's a semantic idea... I know that Mr. Parr's et. al's
paper focuses on that, but really the parsing, and the actions, and
especially so in gold's case, are distinct topics... The authors (Mr.
Parr, presumably being one such,) obviously are a bit hung up on
combining the whole thing, as he spent 5 years producing ANTLR...

There is not a reason that a parser should have to make or check a
symbol table until after a recognition has been confirmed... At least
in my humble opinion.. I know there are simple ways to link the
processes together, as has been so recently demonstrated by Mr. von
Wyss's elegant implementation of the BSN engine, but even though the
declarative process he enables allows us to easily merge the two
phases, the recognition and the generation of the AST are separate
things. I think one of the goals of GOLD was to be portable at the
CGT level, and messing around with semantic details so early may start
to muddy those waters.

Please feel free to correct me if I'm wrong or stupid, I wouldn't doubt either.

> --
> You received this message because you are subscribed to the Google Groups "GOLD Parsing System" group.
> To post to this group, send email to gold-pars...@googlegroups.com.
> To unsubscribe from this group, send email to gold-parsing-sy...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/gold-parsing-system?hl=en.
>
>

--
---------------------------------------------------------------
Dave Dolan
http://davedolan.com/blog

Nick Sabalausky

unread,
Apr 17, 2011, 4:28:47 PM4/17/11
to GOLD Parsing System
On Apr 16, 10:35 pm, Dave Dolan <dave.do...@gmail.com> wrote:
> The parser need not distinguish between a typedef and a vanilla
> identifier. That's a semantic idea... I know that Mr. Parr's et. al's
> paper focuses on that, but really the parsing, and the actions, and
> especially so in gold's case, are distinct topics... The authors (Mr.
> Parr, presumably being one such,) obviously are a bit hung up on
> combining the whole thing, as he spent 5 years producing ANTLR...
>
> There is not a reason that a parser should have to make or check a
> symbol table until after a recognition has been confirmed... At least
> in my humble opinion.. I know there are simple ways to link the
> processes together, as has been so recently demonstrated by Mr. von
> Wyss's elegant implementation of the BSN engine, but even though the
> declarative process he enables allows us to easily merge the two
> phases, the recognition and the generation of the AST are separate
> things. I think one of the goals of GOLD was to be portable at the
> CGT level, and messing around with semantic details so early may start
> to muddy those waters.
>
> Please feel free to correct me if I'm wrong or stupid, I wouldn't doubt either.
>

(Disclaimer: I'm not exactly a parsing theory expert either ;) And
sorry for this being such an all-out essay...)

First of all, it turns out my example was a bad one. Suppose you come
across this input in a C-like language:

a*b

That could be either a multiplication or a pointer declaration and the
symbol table is (usually) the only way to distinguish. No amount of
parser lookahead will get around that.

I definitely agree with keeping semantics and actions out of the
grammar (the benefits that allows are what attracted me to GOLD in the
first place). And you're right: the brilliant BSN engine proves that
GOLD's approach can still keep most of the benefits usually associated
with embedded actions.

But, on the other hand, making a grammar a little bit more specific
creates a much larger decrease in the amount of semantics-processing
the user-code needs to do. And I think GLR can allow for more specific
grammars (and more easily/naturally-written grammars), *without*
requiring the grammar to be infected with semantics/actions [1].

For instance, suppose you have something like this:

! <A> is a string of zero or more X's and A's followed by '1'
! <B> is a string of zero or more X's and B's followed by '2'
"Start Symbol" = <S>

<S> ::= <A> | <B>

<A> ::= <XA> '1'
<B> ::= <XB> '2'

<XA> ::= <XA> 'X' | <XA> 'A' |
<XB> ::= <XB> 'X' | <XB> 'B' |

With LALR(1), that's ambiguous and gives a reduce-reduce conflict (It
doesn't know whether to reduce "<XA> ::= {dot}" or "<XB> ::= {dot}").
In fact, I don't think even LALR(k) can do it, AIUI [1]. But GLR can
handle it fine by first reducing to <XA>, and then if that leads to an
error, backtracking and reducing to <XB> instead [2].

Now yes, that simple example *can* be converted to LALR(1). But that
means more work for the grammar-writer, *and* it means loosing
grammatical information that will just have to be shifted into the
user-code:

"Start Symbol" = <S>

<S> ::= <AB>

<AB> ::= <XAB> '1' | <XAB> '2'

<XAB> ::= <XAB> 'X' | <XAB> 'A' | <XAB> 'B' |

So now anyone who needs to use this grammar will have to go through
every <AB> produced and make sure that:

1. A's and B's aren't mixed.
2. A's and '2' aren't mixed.
3. B's and '1' aren't mixed.

And all of this is just because empty strings and X's just happen to
be accepted by both <XA> and <XB>.

Suppose further, that <XA> and <XB> can be used in different, possibly
overlapping, ways. Not only is that even more to verify with the <XAB>
approach, but it can even *create* ambiguities that wouldn't have been
ambiguous if <XA> and <XB> were separate in the first place.

I've actually come across problems like this just trying to adjust the
ANSI C grammar listed on the GOLD site to accept "MyType var;" (which
I've just about given up on - I ended up *completely* conflating the
notions/syntaxes of types and expressions and I'm still just getting
ending up with new ambiguities). Granted, GLR won't be able to solve
this completely, due to the "a*b" problem. But it could help, and it
would at least make trying to work around the "a*b" problem less of a
hair-pulling matter.

Of course, another thing that would help, and *would* provide a clean
solution to the "a*b" problem, would be to have some way to manually
specify some sort of disambiguation rule. The D language, for
instance, has a few things that *would* be ambiguous, except that the
spec dictates "if a section of code matches both pattern X and pattern
Y, than *always* consider it to be pattern X". That makes it
unambiguous, but GOLD currently has no way to specify such
disambiguation rules. Note that this would probably be difficult (if
even possible?) to do in LALR(1), but it could be done in GLR by
simply allowing the grammar writer to prioritize the choices at the
decision-point. (Although I'm not sure offhand *how* the user would
specify that. We certainly wouldn't want them hardcoding any LALR
state numbers into the grammar.) This would also provide a much easier
and more readable solution to the dangling-else problem than the usual
<Then Clause> awkwardness.

One might be able to argue it's better to force the grammar writer to
keep the grammar LALR(1) to prevent time-consuming backtracking. But I
think that would be a false notion since the user's semantics code
would often just have to take the extra time to perform the extra
validation itself.

[1] I know I said LALR(k) in my previous post, but I *think* that and
GLR are distinct. If so, then GLR is the more general of the two and
can probably still make optional use of LALR(k) to speed up some of
the simpler situations.

[2] In general, GLR works (AIUI) by *allowing* all shift/reduce and
reduce/reduce conflicts, and then whenever an LALR state with such a
conflict is reached, it sets a bookmark, makes a guess, then if it
fails, it backtracks to the bookmark and tries another of the choices.
Some sort of cycle-detection might be needed too, to prevent infinite
loops.


Arsène von Wyss

unread,
Apr 18, 2011, 4:53:03 AM4/18/11
to GOLD Parsing System
On Apr 17, 10:28 pm, Nick Sabalausky <nsetbtg...@mailinator.com>
wrote:
The following grammar works just fine as far as I can tell:

"Start Symbol" = <S>

<S> ::= <A>
| <B>

<A> ::= <X> 'A' <A>
| 'A' <A>
| '1'

<B> ::= <X> 'B' <B>
| 'B' <B>
| '2'

<X> ::= 'X' <X>
| 'X'

Empty rules are often problematic regarding ambiguities, so I try to
avoid them whenever possible. Of course, the grammar I have written
here is not really as self-describing as the original one, but I think
it is still good enough to get the job done without creating a big
mess.

I had many similar issues to solve with the T-SQL grammar and I was
able to solve them all so far (except for the hanging else where I
chose to require a BEGIN END to be used on the first block since that
does make it unambiguous also for the reader).

While I agree that some backtracking functionality may be nice, I
think that those are compromises which can also lead to unthoughtfully
written grammars. I quite like the fact that one has to respect the
rules so that there is no backtracking required, which allows me to
make assumptions about the performance of the parsing process. That's
one of the problems which PERL-style regular expressions have: they
may backtrack a lot and you can easily create regexes which will
basically loop forever (not mathematically, but performing so many
iterations that they'l never end before the universe dies).

I'd love to see the possibility to define which rule shall be chosen
in conflict situations though, since that's something which just needs
to be respected while analyzing the grammar and creating the tables;
it doesn't change anything in the algorithm after the table creating
is done.

- Arsène

Dave Dolan

unread,
Apr 18, 2011, 12:10:02 PM4/18/11
to gold-pars...@googlegroups.com
On Mon, Apr 18, 2011 at 4:53 AM, Arsène von Wyss <avon...@gmail.com> wrote:
> I'd love to see the possibility to define which rule shall be chosen
> in conflict situations though, since that's something which just needs
> to be respected while analyzing the grammar and creating the tables;
> it doesn't change anything in the algorithm after the table creating
> is done.

Good point about the resolution specification. I think that any
improvements that are encapsulated in the tables themselves can be
fair game without upsetting the rest of us following along at home :)

Nick Sabalausky

unread,
Apr 18, 2011, 10:40:44 PM4/18/11
to GOLD Parsing System
On Apr 18, 4:53 am, Arsène von Wyss <avonw...@gmail.com> wrote:
>
> The following grammar works just fine as far as I can tell:
>
> "Start Symbol" = <S>
>
> <S> ::= <A>
> | <B>
>
> <A> ::= <X> 'A' <A>
> | 'A' <A>
> | '1'
>
> <B> ::= <X> 'B' <B>
> | 'B' <B>
> | '2'
>
> <X> ::= 'X' <X>
> | 'X'
>

That fails on inputs that lack an 'A' or 'B', and also on inputs that
have an 'X' immediately before the '1' or '2'. But this works:

"Start Symbol" = <S>

<S> ::= <A>
| <B>

<A> ::= 'X' <A>
| 'A' <A>
| '1'

<B> ::= 'X' <B>
| 'B' <B>
| '2'

And yea, while this does work out in this simple case, in larger real-
world grammars these contortions can 1. Result in less-meaningfully-
defined grammars, and 2. Be *very* difficult to figure out how to
adjust it in the first place, because unlike LL, the effects of
changes in an LR grammar can be highly non-localized. So often the
*entire* grammar needs to be thought about when changing one piece,
which can be very difficult to do, and also works against modularity.

Note too, that your solution involves a lot of right-recursion which,
in LR, leads to much deeper recursion during parsing. Not that that's
a deal-breaker, but it does mean you're just trading one potential
inefficiency (backtracking) for another potential inefficiency (deeper
recursion/parse-stack). I can't really say if one is worse than the
other (I suspect it would depend on both the language and the source
being parsed), but my point is I don't think it's quite as simple as
"keeping it LALR(1) is more efficient".

Plus, like I said before, contorting the grammar to keep it LALR(1)
can end up creating more processing work for the semantic side, so
again, the performance savings of avoiding backtracking can end up
being a wash, or even a de-optimization. Keeping it LALR(1) *can* be a
performance savings, but not necessarily. So that's not as simple as
it would seem.

> Empty rules are often problematic regarding ambiguities, so I try to
> avoid them whenever possible. Of course, the grammar I have written
> here is not really as self-describing as the original one, but I think
> it is still good enough to get the job done without creating a big
> mess.
>

Well, again, that was just a trivial example. LR grammars don't lend
themselves particularly well to modularization, so fixing a similar
issue in a bigger grammar can be much harder and lead to much greater
diversion from the "natural" or "intuitive" grammar.

> I had many similar issues to solve with the T-SQL grammar and I was
> able to solve them all so far (except for the hanging else where I
> chose to require a BEGIN END to be used on the first block since that
> does make it unambiguous also for the reader).
>

When designing a new language, adjusting the language itself to fit
the parsing engine can certainly be a viable solution. But when trying
to implement an existing language, that's not usually a realistic
option and, at best, can result in kludgy grammars that are tailored
to a specific use-case rather than being more generally useful. Which,
of course, works against GOLD's whole philosophy.

> While I agree that some backtracking functionality may be nice, I
> think that those are compromises which can also lead to unthoughtfully
> written grammars.

I'm mostly addressed this above, but one additional note: I *do*
believe that the ideal thing to do, if GLR or LALR(k) were allowed,
would be to require a special parameter to enable it. For instance:

"Parse Type" = GLR ! Default would be LALR(1)
"Start Symbol" = <Foo>

That way it can still help you to keep a grammar LALR(1), but if you
decide there are other more important matters, it doesn't force it on
you.

In fact, I would recommend *against* allowing GLR without it being
disableable on a per-grammar basis.

> I quite like the fact that one has to respect the
> rules so that there is no backtracking required, which allows me to
> make assumptions about the performance of the parsing process. That's
> one of the problems which PERL-style regular expressions have: they
> may backtrack a lot and you can easily create regexes which will
> basically loop forever (not mathematically, but performing so many
> iterations that they'l never end before the universe dies).
>

FWIW, the main reason Perl's regex engine has a tendency to scale very
poorly on certain regex patterns is because it traverses the NFA *as*
an NFA instead of traversing it like a DFA on-the-fly (ie, a Thompson
NFA). (Or pre-converting it to a true DFA.) More info here:

http://swtch.com/~rsc/regexp/regexp1.html

(OTOH, I've heard that Thompson NFAs can be slower on certain more
common regex patterns, even if they always scale well. But hybrid
approaches can certainly be used to mitigate that.)

I don't know if a similar problem is likely under GLR or not.

> I'd love to see the possibility to define which rule shall be chosen
> in conflict situations though, since that's something which just needs
> to be respected while analyzing the grammar and creating the tables;
> it doesn't change anything in the algorithm after the table creating
> is done.
>

That would indeed be nice, and I'd welcome it. But note that GLR still
provides for an improvement on that because with LALR(1) the higher-
priority branch is all-or-nothing: either it matches or the input
fails to parse. With GLR, if the higher-priority branch fails, instead
of a total failure, it can fall back on the lower-priority which might
still turn out to succeed. The D language, for instance, has
situations where that "fallback" ability is actually required (As one
example, that's how D solves the "a*b" problem I mentioned in an
earlier post.)

Reply all
Reply to author
Forward
0 new messages