On Apr 16, 10:35 pm, Dave Dolan <
dave.do...@gmail.com> wrote:
> The parser need not distinguish between a typedef and a vanilla
> identifier. That's a semantic idea... I know that Mr. Parr's et. al's
> paper focuses on that, but really the parsing, and the actions, and
> especially so in gold's case, are distinct topics... The authors (Mr.
> Parr, presumably being one such,) obviously are a bit hung up on
> combining the whole thing, as he spent 5 years producing ANTLR...
>
> There is not a reason that a parser should have to make or check a
> symbol table until after a recognition has been confirmed... At least
> in my humble opinion.. I know there are simple ways to link the
> processes together, as has been so recently demonstrated by Mr. von
> Wyss's elegant implementation of the BSN engine, but even though the
> declarative process he enables allows us to easily merge the two
> phases, the recognition and the generation of the AST are separate
> things. I think one of the goals of GOLD was to be portable at the
> CGT level, and messing around with semantic details so early may start
> to muddy those waters.
>
> Please feel free to correct me if I'm wrong or stupid, I wouldn't doubt either.
>
(Disclaimer: I'm not exactly a parsing theory expert either ;) And
sorry for this being such an all-out essay...)
First of all, it turns out my example was a bad one. Suppose you come
across this input in a C-like language:
a*b
That could be either a multiplication or a pointer declaration and the
symbol table is (usually) the only way to distinguish. No amount of
parser lookahead will get around that.
I definitely agree with keeping semantics and actions out of the
grammar (the benefits that allows are what attracted me to GOLD in the
first place). And you're right: the brilliant BSN engine proves that
GOLD's approach can still keep most of the benefits usually associated
with embedded actions.
But, on the other hand, making a grammar a little bit more specific
creates a much larger decrease in the amount of semantics-processing
the user-code needs to do. And I think GLR can allow for more specific
grammars (and more easily/naturally-written grammars), *without*
requiring the grammar to be infected with semantics/actions [1].
For instance, suppose you have something like this:
! <A> is a string of zero or more X's and A's followed by '1'
! <B> is a string of zero or more X's and B's followed by '2'
"Start Symbol" = <S>
<S> ::= <A> | <B>
<A> ::= <XA> '1'
<B> ::= <XB> '2'
<XA> ::= <XA> 'X' | <XA> 'A' |
<XB> ::= <XB> 'X' | <XB> 'B' |
With LALR(1), that's ambiguous and gives a reduce-reduce conflict (It
doesn't know whether to reduce "<XA> ::= {dot}" or "<XB> ::= {dot}").
In fact, I don't think even LALR(k) can do it, AIUI [1]. But GLR can
handle it fine by first reducing to <XA>, and then if that leads to an
error, backtracking and reducing to <XB> instead [2].
Now yes, that simple example *can* be converted to LALR(1). But that
means more work for the grammar-writer, *and* it means loosing
grammatical information that will just have to be shifted into the
user-code:
"Start Symbol" = <S>
<S> ::= <AB>
<AB> ::= <XAB> '1' | <XAB> '2'
<XAB> ::= <XAB> 'X' | <XAB> 'A' | <XAB> 'B' |
So now anyone who needs to use this grammar will have to go through
every <AB> produced and make sure that:
1. A's and B's aren't mixed.
2. A's and '2' aren't mixed.
3. B's and '1' aren't mixed.
And all of this is just because empty strings and X's just happen to
be accepted by both <XA> and <XB>.
Suppose further, that <XA> and <XB> can be used in different, possibly
overlapping, ways. Not only is that even more to verify with the <XAB>
approach, but it can even *create* ambiguities that wouldn't have been
ambiguous if <XA> and <XB> were separate in the first place.
I've actually come across problems like this just trying to adjust the
ANSI C grammar listed on the GOLD site to accept "MyType var;" (which
I've just about given up on - I ended up *completely* conflating the
notions/syntaxes of types and expressions and I'm still just getting
ending up with new ambiguities). Granted, GLR won't be able to solve
this completely, due to the "a*b" problem. But it could help, and it
would at least make trying to work around the "a*b" problem less of a
hair-pulling matter.
Of course, another thing that would help, and *would* provide a clean
solution to the "a*b" problem, would be to have some way to manually
specify some sort of disambiguation rule. The D language, for
instance, has a few things that *would* be ambiguous, except that the
spec dictates "if a section of code matches both pattern X and pattern
Y, than *always* consider it to be pattern X". That makes it
unambiguous, but GOLD currently has no way to specify such
disambiguation rules. Note that this would probably be difficult (if
even possible?) to do in LALR(1), but it could be done in GLR by
simply allowing the grammar writer to prioritize the choices at the
decision-point. (Although I'm not sure offhand *how* the user would
specify that. We certainly wouldn't want them hardcoding any LALR
state numbers into the grammar.) This would also provide a much easier
and more readable solution to the dangling-else problem than the usual
<Then Clause> awkwardness.
One might be able to argue it's better to force the grammar writer to
keep the grammar LALR(1) to prevent time-consuming backtracking. But I
think that would be a false notion since the user's semantics code
would often just have to take the extra time to perform the extra
validation itself.
[1] I know I said LALR(k) in my previous post, but I *think* that and
GLR are distinct. If so, then GLR is the more general of the two and
can probably still make optional use of LALR(k) to speed up some of
the simpler situations.
[2] In general, GLR works (AIUI) by *allowing* all shift/reduce and
reduce/reduce conflicts, and then whenever an LALR state with such a
conflict is reached, it sets a bookmark, makes a guess, then if it
fails, it backtracks to the bookmark and tries another of the choices.
Some sort of cycle-detection might be needed too, to prevent infinite
loops.