Now that I'm pretty
sure that it will work, let me report on what I'm
up to. I'm working on (and have completed phase one of) a major
rewrite
of Marpa's parse engine, one that will result is a new theory
paper.
Before I get into the changes, let me describe very briefly and
at a
very high level, the construction of Marpa's parse engine. It
is my
re-combination of ideas which are in four sets.
Idea set 1.) The basic Earley parse engine (from Jay Earley);
Idea set 2.) An improvement that makes Earley's algorithm linear
for
right recursion (Joop Leo);
Idea set 3.) A reworking by Aycock&Horspool.
Idea set 4.) My own reworking of the algorithm, which may
enables you
to "pause" the parse engine. You could stop the engine and
resume it
efficiently and at a point where you have full knowledge of what
you've
done so far and what is possible going forward. At the point
you can
actually change things in response to what you see, before
resuming.
(This particular property is totally new with Marpa, and is what
makes
many of its special features possible.)
The current reworking keeps the other ideas from the other 3
sources, but
drops the most visible idea from Aycock & Horspool: using
LR(0) states
to group Earley items together, in order to make Earley sets
smaller.
Earley parsing is table-driven and traditionally each "Earley
set"
in the table had one entry per dotted rule. (A dotted rule is a
rule,
with a location marked with a dot. The location may be before
the rule,
after the rule, or between two symbols.) Aycock & Horspool
noted that the
same dotted rules often occur together, a fact which yacc has
leveraged
in the past, and that Earley's algorithm could be modified to
use this,
making Earley tables smaller.
Getting to the point: I am dropping use of the LR(0) states and
in part
going back to traditional Earley dotted-rules, in part going on
to a
new and different idea.
As I worked with Marpa, I noted that while the Aycock &
Horspool
LR(0) based items did have advantages, they came with a number
of big
disadvantages: when it came to evaluation, the LR(0) states had
to be
pulled apart, back into dotted rule form. Transitions from one
LR(0)
state to another were not nearly as simple as moving the dot
forward in
a dotted rule. Worse, they could result in duplication of
dotted rules
in different states, necessitating special code to deal with it.
And, finally, the LR(0) states made it very difficult to take
advantage
of the fourth idea set: examining and altering the parse tables
during
"pauses". This was an major obstacle to innovation. Events,
the "Ruby
Slippers" and ability to mix procedural parsing with Earley
parsing are
all current results of this fourth idea set. And there is more
to come,
but the LR(0) states would have been a continual obstacle.
States in the parser are of two types: One type is predicted,
meaning
you'd determined that a rule could start at the current
position.
The other type means that you've actually recognized some
symbols that
could be part of the rule. I call those states "discovered".
Experience with Marpa has given me evidence on how the Aycock
and
Horspool states work. I'll give numbers from parsers for Perl,
HTML
and C. The average number of dotted rules in every "predicted"
LR(0)
states is 83.59 (Perl), 14.60(HTML), and 54.81 (C). The average
number
of dotted rules in "discovered" LR(0) states is 1.5235 (Perl),
1.1904
(HTML), and 1.52 (C). LR(0) states turned out to be a huge
implementation
nuisance, but for prediction states they were clear advantages.
On the
other hand, for practical grammars, discovered states turned out
to not
be worth the trouble.
I will be eliminating use of LR(0) states for *both* predicted
and
discovered states, but in different ways. For discovered
states,
I will simple go back to the traditional "one dotted rule per
item"
approach. Every other consideration was already against LR(0)
states.
When discovered LR(0) states turned out to be of little use for
practical
grammars, no justification was left for keeping them.
For predicted states, I will take another approach. Predicted
Earley
items are quite special. In addition to dotted rule, Earley
items have
to track the starting location of their rule and, to order to
assist
evaluation, they have to track history. Predictions are special
in that
they have no history, and the starting location of their rule is
always
the same as the current location. Finally, the dot in their
dotted
rules is always at the beginning of the rule.
The number of rules is a constant, based on the grammar, and
this means
that predictions in each Earley set can be tracked as a set of
integers,
where the integers have a very reasonable maximum. A bit vector
would
work well, though currently I plan an alternative
representation.
Ok. So where am I in all this. I'm working in phases. Phase
one is
changing discovered Earley items so that they always contain
exactly one
dotted rule -- in effect eliminating use of the LR(0) states for
them.
This part is coded and tested.
In the next phases, I will implement my new ideas for handling
predictions. I expect that these phases will allow me to
introduce new
Marpa features. Results visible to the user may appear as soon
as the
second phase.
I plan to bring out a developer's version based on Phase 1 in a
few days.
Phase 1 is complete and tested, but in the process of
development, I
rearranged my test suite. This rearrangement worked fine for
development,
but it needs to be put back in proper order for CPANtesters.
Many thanks for your support and patience, Jeffrey