"Ambiguous grammar" with @allow whitespace

19 views
Skip to first unread message

Matt Brubeck

unread,
Feb 3, 2009, 12:20:41 PM2/3/09
to gazelle-users
When I try to compile the following grammar with Gazelle 0.4:

@start start;
start -> next;
next -> end;
end -> "X";
whitespace -> .whitespace=/x+/;
@allow whitespace start ... end;

why do I get this error?
gzlc: compiler/ll.lua:726: Ambiguous grammar for state starting in
rule next, paths={{"enter", "whitespace"}, {"term", "whitespace"},
{"return"}, {"return", "start"}, {"return", "eof"}, {"term", {class =
Class: "SingletonEdgeValue"}}} AND {{"return", "start"}, {"enter",
"whitespace"}, {"term", "whitespace"}, {"return"}, {"return", "eof"},
{"term", {class = Class: "SingletonEdgeValue"}}}

The above grammar is a minimal test case; I originally encountered
this error while attempting to update sketches/rtn.gzl to work with
the new @allow syntax in Gazelle 0.4. I ran into it again when
modifying other files in sketches/.
http://github.com/mbrubeck/gazelle/commit/41bdb8e64f8daf3f8c38b309aac7b4990b7de52b

Joshua Haberman

unread,
Feb 3, 2009, 2:46:43 PM2/3/09
to gazell...@googlegroups.com
This is excellent Matt, thanks for the catch and even more thanks for reducing it to such a small test case.  Pull the latest from GitHub for the bugfix.

This bug has to do with the fact that the @allow syntax inherently causes grammar ambiguities.  Since the "whitespace" is allowed at any state in the grammar, it is frequently the case that you could, for a given set of input that includes whitespace, take the "whitespace" transition in several different places.

This ambiguity isn't important, since the place that we attach whitespace to the tree is not significant.  So my general strategy for dealing with this is to just arbitrarily pick the path that consumes the whitespace first:

http://github.com/haberman/gazelle/blob/d2393dbe1ea61c2281bf273761e083970f8a84c1/compiler/ll.lua#L579

But there was a bug in this code that didn't properly notice that paths were whitespace-redundant (redundant *only* by the placement of whitespace in the tree) when they contained EOF.

Josh

Matt Brubeck

unread,
Feb 3, 2009, 6:28:59 PM2/3/09
to gazelle-users
On Feb 3, 11:46 am, Joshua Haberman <jhaber...@gmail.com> wrote:
> This is excellent Matt, thanks for the catch and even more thanks for
> reducing it to such a small test case.  Pull the latest from GitHub for the
> bugfix.

Thanks, that works. I found and fixed another bug, which was
triggered when my grammar had a syntax error on the first line (for
example, the grammar "a: b"). Fix is here:
http://github.com/mbrubeck/gazelle/commit/d94b86d9ef2fb6d6431543d45066c1215d69723d

Now gzlc is hanging at "Doing LL(*) lookahead calculations." You can
see this on my latest attempt:
http://github.com/mbrubeck/gazelle/blob/e17978c21f7513a7b5356c299f5f02118cfc29b1/sketches/rtn.gzl

I can avoid the hang by removing parts of the grammar. Then I get
"Language is probably not LL(k) or LL(*)" or "Ambiguous grammar."
These could probably be fixed with the right ambiguity resolution, but
I haven't tried yet.

And one more minor bug: With "@allow whitespace", blank lines mess up
the line numbers in gzlparse error messages.

Joshua Haberman

unread,
Feb 4, 2009, 4:47:44 PM2/4/09
to gazell...@googlegroups.com
On Tue, Feb 3, 2009 at 3:28 PM, Matt Brubeck <mbru...@limpet.net> wrote:
On Feb 3, 11:46 am, Joshua Haberman <jhaber...@gmail.com> wrote:
> This is excellent Matt, thanks for the catch and even more thanks for
> reducing it to such a small test case.  Pull the latest from GitHub for the
> bugfix.

Thanks, that works.  I found and fixed another bug, which was
triggered when my grammar had a syntax error on the first line (for
example, the grammar "a: b").  Fix is here:
http://github.com/mbrubeck/gazelle/commit/d94b86d9ef2fb6d6431543d45066c1215d69723d

This and the other bug you discovered highlight the benefit of making Gazelle self-hosting sooner rather than later.  :)  Which I see you are attempting to do.  Excellent.
 
Now gzlc is hanging at "Doing LL(*) lookahead calculations."  You can
see this on my latest attempt:
http://github.com/mbrubeck/gazelle/blob/e17978c21f7513a7b5356c299f5f02118cfc29b1/sketches/rtn.gzl

I have a fix for this, but I'm not super satisfied with it.  I think I need to give the "@allow" feature one more rewrite.  I'm not a huge fan of all those explicit self-transitions, and I think I can stay just as simple without them.

I can avoid the hang by removing parts of the grammar. Then I get
"Language is probably not LL(k) or LL(*)" or "Ambiguous grammar."
These could probably be fixed with the right ambiguity resolution, but
I haven't tried yet.

I ran into the same thing, and I think the root of it is this problem:

a -> "X" + ("Y");

This is inherently ambiguous -- is it:

a -> "X" (+("Y"));

or:

a -> ("X"+) "Y";

The current parser chooses the first if there's no space between the + and the (, otherwise it picks the second.  Clearly this is less than ideal.  I'm not willing to give up "X"+, and I'm not willing to give up the inherent *feature* of repetition with separator.  So I'm open to suggestions about a non-ambiguous syntax for repetition with separator.

My best idea at the moment is:

a -> "X" +{"Y"};

Only reluctance about that is that it looks similar to:

a -> "X"{3};

...but the thing in curlies means something very different.  On the other hand, if you get the backwards your grammar probably won't compile, because:

a -> "X"{"Y"};

...makes no sense.

And one more minor bug:  With "@allow whitespace", blank lines mess up
the line numbers in gzlparse error messages.

Yep, Gazelle needs to be self-hosting, pronto.  :)

Josh

Joshua Haberman

unread,
Feb 5, 2009, 2:40:17 AM2/5/09
to gazell...@googlegroups.com
On Wed, Feb 4, 2009 at 1:47 PM, Joshua Haberman <jhab...@gmail.com> wrote:
Now gzlc is hanging at "Doing LL(*) lookahead calculations."  You can
see this on my latest attempt:
http://github.com/mbrubeck/gazelle/blob/e17978c21f7513a7b5356c299f5f02118cfc29b1/sketches/rtn.gzl

I have a fix for this, but I'm not super satisfied with it.  I think I need to give the "@allow" feature one more rewrite.  I'm not a huge fan of all those explicit self-transitions, and I think I can stay just as simple without them.

Scratch that, I thought about it some more, and my other design didn't quite work.  I committed the bugfix, which should let your grammar compile without hanging.  The bugfix actually affected more than just whitespace.

It will still complain about the ambiguity, but if you change the parentheses to be curlies instead the grammar compiles!  And I even got it to parse itself!

By the way, what do you think of my syntax modification for +{"foo"}  ?

Josh

Matt Brubeck

unread,
Feb 5, 2009, 12:46:26 PM2/5/09
to gazelle-users
On Feb 4, 11:40 pm, Joshua Haberman <jhaber...@gmail.com> wrote:
> On Wed, Feb 4, 2009 at 1:47 PM, Joshua Haberman <jhaber...@gmail.com> wrote:
> It will still complain about the ambiguity, but if you change the
> parentheses to be curlies instead the grammar compiles!  And I even got it
> to parse itself!
>
> By the way, what do you think of my syntax modification for +{"foo"}  ?

Looks good to me. None of the alternatives I've considered (+["foo"],
+|"foo"|...) seem particularly better.

I was also able to get gazelle.gzl to parse itself after changing to
the curly-brace syntax and adding a rule for comments. I think the
only thing still missing in my local copy is prioritized choice.

Random thought: What about "||" for prioritized choice? Think of the
analogy to the short-circuiting "or" operator...

Joshua Haberman

unread,
Feb 6, 2009, 2:33:28 AM2/6/09
to gazell...@googlegroups.com
Hmm.  Not bad.  I would rather like to get rid of my current resolution to the ambiguity -- though it's true that every regex needs to be named, the error message will be quite unintuitive if someone writes:

foo -> bar /baz/;

It's true that "/" is really well established in the PEG literature as prioritized choice.  But "/" is even more established as a regular expression delimiter in text processing culture, and I think the number of people familiar with that convention is far far greater than the number of people who read PEG papers.

x -> a || b;

I kinda like it!  And I also really like the analogy with short-circuiting "or".  Let me sleep on it.

One other language issue I noticed as I was trying to update gazelle.gzl just now.  I would kind of like to be able to say:

derivations -> derivations +{"|" | "||"};

In other words, let the thing inside the modifier be any expression, not just a single component.  Before I think I rejected this because I thought the slot numbering could get confusing, but I think the same rule can apply: you just count from the left.  So the slots in this case would be:

slotnum=0
slotname=derivations

slotnum=1
slotname="|"

slotnum=2
slotname="||"

Also I just realized that my earlier worry that the curlies would get confused was mistaken: the syntax:

a -> "X"{3};

...is not even supported -- counted repetition is only available *inside* regular expressions.

One last thing -- in your update to rtn.gzl (gazelle.gzl), I see you using "term" to refer to things that can appear on the right hand side of a rule.  I'm trying to standardize the terminology to call these "components", and to call terminals "terminals", to remove the ambiguity you pointed out before.

Josh
Reply all
Reply to author
Forward
0 new messages