Thanks, Jeffrey, for answering my question about empty rules. I haven't decided if I'll eliminate them the same way Marpa does, but now I feel much more confident that I can do it if necessary.
Right now, I'm studying Leo reduction, via
Loup Vaillant's tutorial and also
a PEG-and-Earley tutorial somebody pointed me to. I haven't gotten to actual
parsing yet, I'm still trying to nail down reduction, but I'm sure I'll get there eventually.
Consider the following grammar, whose starting symbol is <A>:
<A> ::= "x" | <B>
<B> ::= <A>
When fed the input "x", an Earley recogniser without right-reduction would wind up with these items in the final set:
<A> ::= "x" · (0)
<A> ::= <B> · (0)
<B> ::= <A> · (0)
...and sure enough, if I don't attempt Leo reduction, that's what I get.
If I do attempt Leo reduction, I have a problem. I'm looking for the "topmost" reduction, but because the grammar has a cycle, there's no actual top. This cycle isn't fatal; since I'm memoising this reduction, I can drop a sentinel in the cache before I recurse, and if I find a sentinel I know I'm done. However, this means that the "topmost" item will be determined by the order in which I iterate over Earley items to perform completion steps; whichever one I grab first will become the topmost.
If I happen to pick <A> ::= <B> · (0) as my topmost, it will survive into the final Earley set, and I will have two items for the start symbol spanning the entire input, and therefore I can build two parse trees: one for 0 iterations of the cycle, and one for 1 iteration. That seems tidy.
On the other hand, if I happen to pick <B> ::= <A> · (0) as my topmost, the final Earley set will only have one item for the start symbol spanning the entire input, and therefore I can build only one parse tree: the one for 0 iterations of the cycle.
I'd rather not have a recogniser that behaves inconsistently; Earley parsing is supposed to be the "Just Works™" algorithm, after all. I can think of a few alternatives:
- I could declare that cyclic grammars are Bad, and automatically reject them. Seems a bit heavy-handed.
- I could declare that cyclic grammars are Unsupported, and allow them but make no guarantees about finding all the parse trees. I notice that libmarpa detects cyclic grammars and warns about them, but I can't find any documentation saying cyclic grammars are any less reliable than proper ones, just that they're less efficient.
- If there were some principled way to choose a "topmost" item, I could ensure I always picked the "right" one, and then I would be able to make guarantees like "only the simplest parse tree is found" or "only the simplest parse tree plus one iteration of the cycle".
What should I do?