test -> "X" "YA" | "XY" "B" ; parses "XYB" but not "XYA"

Catherine Rodgers

unread,

Jul 9, 2008, 3:47:15 PM7/9/08

to gazelle-users

Say I have test.gzl:

test -> "X" "YA" | "XY" "B" ;

If I try to parse "XYB", it works correctly. However, "XYA" produces
this output:

gzlparse: interpreter.c:503: do_intfa_transition: Assertion `terminal'
failed.
{"parse_tree":
{"rule":"test", "start": 0, "children": [
{"terminal": "XY", "slotname": "XY", "slotnum": 2, "offset": 0,
"len": 2}

Is Gazelle *supposed* to be able to handle this example grammar, or is
this a case like in the "Caveats" section of the manual that isn't
actually parseable with the kind of parser Gazelle is?

--
Catherine

Joshua Haberman

unread,

Jul 10, 2008, 1:12:13 AM7/10/08

to gazell...@googlegroups.com

This is a great question. The answer is that this grammar has a property that Gazelle does not currently attempt to support: namely that you have one token ("X") that is a substring of another token ("XY"). Lexers traditionally use longest-match semantics, where they'll always attempt to keep adding characters to the current token until it can't add any more.

Most lexer software does things this way:

"When the generated scanner is run, it analyzes its input looking for strings which match any of its patterns. If it finds more than one match, it takes the one matching the most text [...]"
http://www.gnu.org/software/flex/manual/html_chapter/flex_8.html

"If one or more state machines reache the ACCEPTANCE state, then the pattern that has eaten the most characters, i.e. lived the longest is considered the winner (the matcher)."
http://quex.sourceforge.net/quex-doc-draft.html

"The lexer resolves conflicts among rules by choosing the rule with the longest match, and in the case two rules match the same string, choosing the rule listed first in the specification."
http://www.smlnj.org/doc/ML-Lex/manual.html

It's theoretically possible to support grammars like the one you posted. I don't *think* it will pay off, but I'm leaving my options open.

Josh

Catherine Rodgers

unread,

Jul 10, 2008, 3:59:42 PM7/10/08

to gazell...@googlegroups.com

On Wed, Jul 9, 2008 at 10:12 PM, Joshua Haberman <jos...@reverberate.org> wrote:
> This is a great question. The answer is that this grammar has a property
> that Gazelle does not currently attempt to support: namely that you have one
> token ("X") that is a substring of another token ("XY"). Lexers
> traditionally use longest-match semantics, where they'll always attempt to
> keep adding characters to the current token until it can't add any more.

That's what I suspected. So the only way to get such rules to parse is
to rewrite them? For the particular example I have in this thread,
that's not too hard:

test -> "XY" ("A" | "B") ;

Thanks for the detailed response!

--
Catherine

Joshua Haberman

unread,

Jul 10, 2008, 6:24:57 PM7/10/08

to gazelle-users

Yes, any of the following would work also:

(1) test -> "XY" "A" | "XY" "B";
(2) test -> "X" "Y" "A" | "X" "Y" "B";
(3) test -> "X" "Y" ("A" | "B");

It all depends on what your tokens are, and whether the differences
are significant between "XY" prior to an "A" and "XY" prior to a "B".

For example, your proposed alternate grammar and my grammar (3)
consider "XY" the same whether it's followed by an "A" or a "B". So
if you feed it just "XY", it will fully parse those two characters,
yield them as tokens, and assign them slot number (which I don't
describe in the manual very well yet).

On the other hand, my alternate grammars (1) and (2) consider "XY"
preceding an "A" to be something totally different from an "XY"
preceding a "B". So if you just feed it "XY" it won't know which of
the two alternatives to choose yet since it hasn't seen a character or
token that would distinguish them. So not until it sees "A" or "B"
will it yield anything.