The Marpa engine rewrite

Jeffrey Kegler

unread,

Jan 15, 2014, 1:18:27 PM1/15/14

to marpa-...@googlegroups.com

Now that I'm pretty sure that it will work, let me report on what I'm
up to. I'm working on (and have completed phase one of) a major rewrite
of Marpa's parse engine, one that will result is a new theory paper.

Before I get into the changes, let me describe very briefly and at a
very high level, the construction of Marpa's parse engine. It is my
re-combination of ideas which are in four sets.

Idea set 1.) The basic Earley parse engine (from Jay Earley);

Idea set 2.) An improvement that makes Earley's algorithm linear for
right recursion (Joop Leo);

Idea set 3.) A reworking by Aycock&Horspool.

Idea set 4.) My own reworking of the algorithm, which may enables you
to "pause" the parse engine. You could stop the engine and resume it
efficiently and at a point where you have full knowledge of what you've
done so far and what is possible going forward. At the point you can
actually change things in response to what you see, before resuming.
(This particular property is totally new with Marpa, and is what makes
many of its special features possible.)

The current reworking keeps the other ideas from the other 3 sources, but
drops the most visible idea from Aycock & Horspool: using LR(0) states
to group Earley items together, in order to make Earley sets smaller.
Earley parsing is table-driven and traditionally each "Earley set"
in the table had one entry per dotted rule. (A dotted rule is a rule,
with a location marked with a dot. The location may be before the rule,
after the rule, or between two symbols.) Aycock & Horspool noted that the
same dotted rules often occur together, a fact which yacc has leveraged
in the past, and that Earley's algorithm could be modified to use this,
making Earley tables smaller.

Getting to the point: I am dropping use of the LR(0) states and in part
going back to traditional Earley dotted-rules, in part going on to a
new and different idea.

As I worked with Marpa, I noted that while the Aycock & Horspool
LR(0) based items did have advantages, they came with a number of big
disadvantages: when it came to evaluation, the LR(0) states had to be
pulled apart, back into dotted rule form. Transitions from one LR(0)
state to another were not nearly as simple as moving the dot forward in
a dotted rule. Worse, they could result in duplication of dotted rules
in different states, necessitating special code to deal with it.

And, finally, the LR(0) states made it very difficult to take advantage
of the fourth idea set: examining and altering the parse tables during
"pauses". This was an major obstacle to innovation. Events, the "Ruby
Slippers" and ability to mix procedural parsing with Earley parsing are
all current results of this fourth idea set. And there is more to come,
but the LR(0) states would have been a continual obstacle.

States in the parser are of two types: One type is predicted, meaning
you'd determined that a rule could start at the current position.
The other type means that you've actually recognized some symbols that
could be part of the rule. I call those states "discovered".

Experience with Marpa has given me evidence on how the Aycock and
Horspool states work. I'll give numbers from parsers for Perl, HTML
and C. The average number of dotted rules in every "predicted" LR(0)
states is 83.59 (Perl), 14.60(HTML), and 54.81 (C). The average number
of dotted rules in "discovered" LR(0) states is 1.5235 (Perl), 1.1904
(HTML), and 1.52 (C). LR(0) states turned out to be a huge implementation
nuisance, but for prediction states they were clear advantages. On the
other hand, for practical grammars, discovered states turned out to not
be worth the trouble.

I will be eliminating use of LR(0) states for *both* predicted and
discovered states, but in different ways. For discovered states,
I will simple go back to the traditional "one dotted rule per item"
approach. Every other consideration was already against LR(0) states.
When discovered LR(0) states turned out to be of little use for practical
grammars, no justification was left for keeping them.

For predicted states, I will take another approach. Predicted Earley
items are quite special. In addition to dotted rule, Earley items have
to track the starting location of their rule and, to order to assist
evaluation, they have to track history. Predictions are special in that
they have no history, and the starting location of their rule is always
the same as the current location. Finally, the dot in their dotted
rules is always at the beginning of the rule.

The number of rules is a constant, based on the grammar, and this means
that predictions in each Earley set can be tracked as a set of integers,
where the integers have a very reasonable maximum. A bit vector would
work well, though currently I plan an alternative representation.

Ok. So where am I in all this. I'm working in phases. Phase one is
changing discovered Earley items so that they always contain exactly one
dotted rule -- in effect eliminating use of the LR(0) states for them.
This part is coded and tested.

In the next phases, I will implement my new ideas for handling
predictions. I expect that these phases will allow me to introduce new
Marpa features. Results visible to the user may appear as soon as the
second phase.

I plan to bring out a developer's version based on Phase 1 in a few days.
Phase 1 is complete and tested, but in the process of development, I
rearranged my test suite. This rearrangement worked fine for development,
but it needs to be put back in proper order for CPANtesters.

Many thanks for your support and patience, Jeffrey

Durand Jean-Damien

unread,

Jan 15, 2014, 2:04:27 PM1/15/14

to marpa-...@googlegroups.com

Many thanksfor these explanations, in which I recognize pedagogic efforts for non language-theorist!

Looking forward the dev. releases.

Which is the git branch ?

Thanks / JD.

Jeffrey Kegler

unread,

Jan 15, 2014, 2:11:34 PM1/15/14

to marpa-...@googlegroups.com

I'm doing this all on the master branch. "safe" is a branch which does not include any of this, though I'd probably revert to the last developer's release if it came to that. -- jeffrey

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Deyan Ginev

unread,

Jan 15, 2014, 3:25:39 PM1/15/14

to marpa-...@googlegroups.com

Hi Jeffrey,

Very interesting read, thanks for sharing the full details!

I wonder how you maintain your "developer sanity" in practice - do you have a Marpa benchmark over a variety of test grammars and input strings, which helps you to evaluate if and by what degree the Marpa performance (i.e. runtime, rather than just coverage) has improved/deteriorated?

Apart from that, the way you have described the LR(0) manipulation, it seems quite clearly an unpleasant mechanism to both develop and maintain in the long run. So best of luck making Marpa better, both for us to use and for you to maintain!

Deyan

Jeffrey Kegler

unread,

Jan 15, 2014, 3:52:38 PM1/15/14

to marpa-...@googlegroups.com

With Libmarpa itself, my goal is maintaining the time complexity -- the claims stated in my paper -- rather than speed. I pay a good deal of attention to avoiding any O(n log n) logic where I am claiming O(n), for example. Also, I do write all the code in an optimized style, but that is as opposed to actually optimizing it against measurement. This is because at the moment it only runs inside Perl, and Perl overhead swamps all but the biggest changes in Libmarpa's speed. Measurement at this point only makes statements about the Perl overhead.

I expect, by the way that phase one, now complete, improved speed -- many pages of codes were eliminated and entire complex objects elimated, along with their storage, the CPU needed to allocate memory for them, sort them, etc. But note I made no claim to that effect in my report -- I am not sure the improvement could be measured against the "background" noise of Perl overhead. Libmarpa takes less than 10% of the time, so a 10% change in its speed is a 1% change in the overall result, which on Linux I do not think is measureable.

Currently the only useable speed metric I have is that Libmarpa is much faster than it needs to be for its current uses, and it's hard to measure changes against that.

By the way, running my test suite on the forthcoming version may be significantly faster, but that's because I've dropped an expensive but no longer useful test.

-- jeffrey

Ron Savage

unread,

Jan 15, 2014, 4:45:21 PM1/15/14

to marpa-...@googlegroups.com

Well done! This sounds like a big win, and having the nerve to make significant changes like this is usually the sign of a great programmer.

It means something else, of course, for those people who just change things to fabricate the impression that they are 'doing something.

Deyan Ginev

unread,

Jan 15, 2014, 5:02:30 PM1/15/14

to marpa-...@googlegroups.com

" Libmarpa takes less than 10% of the time, so a 10% change in its speed is a 1% change in the overall result, which on Linux I do not think is measureable."

Well isn't this a big problem? From my basic understanding, it seems that libmarpa does all the heavy lifting parsing-wise (for grammars without Perl semantic actions). So 90% is lost in "cosmetic" overheads (such as compiling SLIF down to a libmarpa-compatible grammar object, doing passes back-and-forth through XS and maybe doing the recognizer traversal).

Would this 1:9 ratio change if a single grammar is provided with a vast set of parse inputs? E.g. process 10,000 strings in a single pass, so that the initialization overhead melts down. I could of course be off my guess, and the 90% overhead is not mostly in initialization but in some other parts of the processing that I am currently falsely assuming to be done in libmarpa.

But making the 10% to 90% statement certainly makes Perl look bad in this case (and alerts my curiosity as to why that's the case).

Deyan

On Wed, Jan 15, 2014 at 10:45 PM, Ron Savage <r...@savage.net.au> wrote:

Well done! This sounds like a big win, and having the nerve to make significant changes like this is usually the sign of a great programmer.

It means something else, of course, for those people who just change things to fabricate the impression that they are 'doing something.

Jeffrey Kegler

unread,

Jan 15, 2014, 5:27:07 PM1/15/14

to marpa-...@googlegroups.com

I do from time to time go back and test long strings. I have a JSON benchmark that's a favorite. If someone wants to do benchmarking, I'd consider it a favor. I've considered asking on this list, in fact, but I thought other priorities were higher. And this time in particular, with Libmarpa's parse engine in the middle of a rewrite, is a bad time to do it -- it'd be quite possible that your results would be irrelevant in a few days. Once the algorithm settles benchmarking it would be interesting.

Note that the rewrite, although I expect it to improve speed, is not speed motivated. It is about offering more capabilities, simplifying the code, and simplifying the math. I would certainly do it if it were break even, or even a at small cost in speed.

Perl's overhead is understandable if you look at the services it provides. '1 + 2' in perl requires creating two SV's (moderately complex data structures), allocating memory, etc. with probably well over a dozen subroutine calls. "1 + 2" in C would be optimized out at compile time. "1 + x" would be more complicated -- perhaps as many as 2 instructions. This is considerably less time than the amortized garbage collection overhead of one SV. Given all the extra services it provides, it's quite an accomplishment that Perl is only 10-to-1, and that speaks well of how it is programmed.

C is fast because it expects you to do all the thinking, protects you from almost nothing and, when thing go wrong, expects you to figure out why.

-- jeffrey

Reply all

Reply to author

Forward