Questions about error handling in Earley parsers

26 views
Skip to first unread message

Maarten

unread,
Mar 10, 2017, 3:17:23 PM3/10/17
to marpa-...@googlegroups.com
With great interest I have read the paper on the Marpa parser. This summer I have created my own Earley parser, a probabilistic one, and I am looking to implement some of Marpa's innovations in my own.

Right now I am working on error handling. One person suggested I implement synchronizing rules like in shift-reduce parsers. I figured panic mode should behave a bit differently for probabilistic ambiguous grammars, so I settled on the following (text taken from issue on Github):

Consider the following grammar:

S -> AB C; (0.5)
S -> A B ; (0.5)

A -> a a ; (0.5)
B -> b b ; (0.5)
C -> c c ; (0.5)

AB -> a a ; b b ; (0.5)

A -> a <error> ; (0.5)
B -> b <error> ; (0.5)
AB -> a <error> ; (0.5)

On the input a a ; b z ; c c ; with goal S. When encountering z, the parser could apply both B -> b <error> ; and AB -> a <error> ;. We would like to consider AB because that's the only way to make S. So we need to consider both error rules.

Now consider the same grammar, except the first rule is S -> AB; on the same input. We have established that we want to consider both error rules, and in this case both make a valid parse. However, we want the probability of AB -> a <error> ; being applied to be lower than that of B -> b <error> ;, because we want to keep the number of ignored tokens minimal.

That's why I have decided to make the error rule probability multiply every time a token is ignored. The inner score of applying B -> <error> ; is then be 0.5^2, and the inner score of AB -> <error> ; is 0.5^4.


I suspect that you have more sophisticated knowledge than I do about Earley parsing, so I seek out your opinion: do you think this is a good idea? It also makes me a bit uncomfortable that above grammar fails terminally on a a ; b z ; c c ; cc ; , because the parser gets out of panic mode after encountering a semicolon. But maybe that should be a problem for another error handling strategy.

Which bring me to the Ruby slippers error handling. I don't know if I have a good idea of how this works. The way I interpret is as follows: the parser might stall on input which prohibits it from succeeding (no states follow from predict, no states follow from scan or there are no states to complete). Because the chart will always contain some active states, you know what the parser expects to succeed and you could let the user decide whether to manipulate the input stream so that the parser may proceed given this information (eg, drop token, insert a token, etc). Is that correct?

Thanks for your help!

Maarten Trompper


Jeffrey Kegler

unread,
Mar 10, 2017, 3:24:28 PM3/10/17
to Marpa Parser Mailing LIst
Thanks for your interest.  One approach might be to do a raw, non-probabilistic parse in Marpa, and then use its Abstract Syntax Forests to do the probabilistic analysis.

I'm about to run errands, but I've printed your email out to take with me.  I'll let you know if I have other thoughts.

-- jeffrey

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeffrey Kegler

unread,
Mar 10, 2017, 8:19:30 PM3/10/17
to Marpa Parser Mailing LIst
I'm now looking at the Stolcke article.  I am sorry to confess that I was not aware of it before.

It's not relevant to a probabilistic approach, but I think you might find your error detection grammar fits Ruby Slippers parsing rather well.  Define an <error> token, one which is never found in actual input.  When the parser is unable to accept the current input, on a Ruby Slippers basis, feed it an <error> token.  This will detect errors, and minimize them if proceeding left-to-right minimizes.

It may happen that a strict left-to-right minimization does not produce the fewest <error>'s.  This would be in cases were the parser can accept the current input, but it's a "garden path", and the parser would be better off recognizing an error instead -- one <error> now will save 2 or more later.  Here, with Marpa, going to an ASF-based strategy might be better.

After looking at Stolcke, I may have more to say.
Reply all
Reply to author
Forward
0 new messages