I have just been debugging Lepl's support for left-recursion (which,
unfortunately, appears to have been broken in Lepl 4).
I have also tried to understand the paper by Mark Johnson called "Memoization
in Top Down Parsing" which was kindly recommended to me by Benedict Kavana -
http://citeseer.ist.psu.edu/580468.html
I do not fully understand how that paper matches Lepl. My confusion comes, I
think, from the fact that Lepl is not completely lazy: within a particular
parse, results are not lazy streams, and although co-routines are used, the
evaluation order cannot change from what would be expected in a normal
recursive implementation.
Instead, I believe that Lepl implements a less efficient approach based on
Frost and Hafiz 2006. But given the complications of Lepl's implementation, I
am not sure that even that is correct.
So for Lepl 5 I am going to modify claims about support for left recursion.
Lepl appears to handle test cases correctly, but the support is based on the
algorithm described here and is not claimed to be correct (not even by me).
I will also modify Lepl so that by default left-recursive grammars are flagged
as an error.
To understand how Lepl handles left-recursion it is necessary to understand
how Lepl is implemented. It is a recursive descent parser, with some small
(and generally Python-specific) modifications:
- The recursive functions are objects and form a DAG that can be inspected
and modified before the parser is invoked.
- Python does not optimise tail-recursive calls. So instead of recursive
function calls, matchers are converted to coroutines. A stack of
coroutines replaces Python's call stack.
- Backtracking arises naturally from the above, as coroutines return a
sequence of alternative results (irrelevant to the discussion here, but
via further indirection and weak references it is possible to discard
state in "deep, unused" coroutines and so reduce memory use at the expense
of incomplete backtracking).
Give the background above, the implementation of memoisation for
left-recursive grammars works as follows:
- Wrappers (LMemo class) are added to critical points in the matcher DAG.
By default only the loops that pass through the leftmost matchers are
adjusted (other paths will presumably consume input in other matchers on
each loop), but specifying auto_memoisation(conservative=True) in the
configuration will instrument all loops that can left-recurse.
- When the parser is invoked, LMemo wrappers create "per-stream caches"
(PerStreamCache class) for each input stream. Repeated calls to a
particular wrapper with the same input will be delegated to the same
per-stream cache (Note that "per-stream cache" is a misnomer as there is
another layer of indirection to come).
- The per-stream cache has two states. In the initial state, on each call,
it generates a new "per-call cache" that delegates to the underlying
matcher and caches the results. At this point results are cached but the
cache is not used to restrict calls.
- In this state it is possible for a left-recursive call to repeatedly
generate per-call caches as it loops without consuming input. The
per-stream cache detects this and restricts the number of possible loops
to the length of the available input stream. This is based on the
approach described in Frost and Hafiz 2006. Calls after this limit
immediately fail to match.
- When the *first* per-call cache completes (ie the cache contains all
available results) the per-stream cache transitions to the second state.
In this state it returns values from the completed per-call cache. No
more calls are made to the underlying matcher.
- In the second state, the limiting on stream length is no longer necessary.
I hope this statement, and the changed handling of left-recursion in Lepl 5
(treating it as an error in the default configuration) help clarify what is
happening. And I apologise for the buggy implementation in Lepl 4.
Andrew
Hi,
- When the *first* per-call cache returns a result (ie the cache contains a
single result) the per-stream cache transitions to the second state. In
this state it returns values from the completed per-call cache. No more
calls are made to the underlying matcher.
- In the second state, the limiting on stream length is no longer necessary.
- In the second state, data may be read from the cache (by other calls)
before all results are available (from the first call). A warning is
printed to the log if this results in a secondary call missing data
(because it was not yet available to the first call).