Packrat parsers and memoization

Kay Schluehr

unread,

Jun 7, 2009, 9:41:28 AM6/7/09

to

Hello compiler list,

I have a question concerning packrat parsing and memoization. Suppose
one has following grammar

G -> A1 b / A2 c
A1 -> a
A2 -> a

and an input string 'a c'. Since the rule on the left of the ordered
choice fails to parse the string the second one is tried. Now the
major claim is that packrat parsing avoids exponential complexity in
backtracking by means of memoization of intermediate results. However
I fail to see how this helps here because 'A1 c' would simply be an
unknown rule and create the wrong parse tree.

Thanks

Quinn Tyler Jackson

unread,

Jun 8, 2009, 10:36:45 AM6/8/09

to

Kay Schluehr said:

> I have a question concerning packrat parsing and memoization. Suppose
> one has following grammar
>
> G -> A1 b / A2 c
> A1 -> a
> A2 -> a
>
> and an input string 'a c'. Since the rule on the left of the ordered
> choice fails to parse the string the second one is tried. Now the
> major claim is that packrat parsing avoids exponential complexity in
> backtracking by means of memoization of intermediate results. However
> I fail to see how this helps here

Human beings are amazing producers of pathological counter-examples and
worst-cases. In the above example, "a" will indeed be examined in the input
twice by a top-down backtracking parser, whether or not memoization is used,
assuming no other optimizations.*

Now, consider this variant (which may or may not be in Packrat format):

S -> A3 d / A4 e

G -> A1 b / A2 c
A1 -> a

A2 -> b
A3 -> H
A4 -> H
H -> G+

Given the input:

acacacacacacacacace

If you double the length of the string, making it:

acacacacacacacacacacacacacacacacacace

will the time taken to parse more than double?

It has been my experience with memoization that grammars written by humans
for top-down parsing tend to be filled with constructions that result in
linear or near-linear performance. In many cases with the more complex
grammars (such as NLP), turning memoization off results in egregious
non-linear performance.

--
Quinn Tyler Jackson
Chief Scientist,
Thothic Technology Partners, LLC

* Memoization can be optimized beyond the trivial.

Armel

unread,

Jun 8, 2009, 11:18:30 AM6/8/09

to

>[...]

> G -> A1 b / A2 c
> A1 -> a
> A2 -> a

>[...]

> I fail to see how this helps here because 'A1 c' would simply be an
> unknown rule and create the wrong parse tree.

Packrat/memoization parser would probably help with rules such as:

G -> A b | A c
A -> a

by recognizing that the first "A" of second rule immediately on "ac",
because it was already found in first alternate.

also rules such as:

G -> A b | A c
A -> a | a A

could then be greatly enhanced on "aaaaaaaaaaaaaac", for the same reason: A
of "G->A c" will be skipped, testing immediately the "c".

I do not think that memoization will do any good to your original case
because the parser would have to "deduce" that a A1 and a A2 are indeed the
same. Grammar preparation could do this computation and actually macth one
for the other, then disambiguate on G rules acceptation but I think its
beyond simple packrat/memoization.

Regards
Armel

Kay Schluehr

unread,

Jun 10, 2009, 2:07:16 AM6/10/09

to

I did a little more research and Bryan Fords O(n) claim about packrat
parsing refers to work of Birman and Ullmann about backtracking
parsers which was made in the 1970s. The paper requires payment
to the IEEE so I haven't read it.

http://www2.computer.org/portal/web/csdl/doi/10.1109/SWAT.1970.18

So I assume now it is a true claim but one has to expect various
constants depending on the structure of the grammar which makes
predictions on actual effort quite difficult.

I'm interested myself in memoization techniques for the CFG top down
parsers I build. Those have quite good left factorization capabilities
and use techniques analog to those of K.Thompson for parsing regular
expressions but there are occasions where they run into backtracking
mode as well.

Quinn Tyler Jackson

unread,

Jun 11, 2009, 10:06:59 AM6/11/09

to

Kay Schluehr said:

> I'm interested myself in memoization techniques for the CFG top down
> parsers I build. Those have quite good left factorization capabilities
> and use techniques analog to those of K.Thompson for parsing regular
> expressions but there are occasions where they run into backtracking
> mode as well.

The following Meta-S grammar was constructed to test the Meta-S engine's
memoization system.

grammar X
{
S ::= #grammar(memoize) // setting this to nomemoize would turn off
memoization

#memo_enter
(S1 "^") | (S1 "#") | (S1 "@") | (S1 "!") | (S1 "$")
#memo_leave ; // S

S1 ::= S2<+>; // S1
S2 ::= K; // S2
S3 ::= K; // S3
Z ::= ((A [G] B [G] C) | (A [G] B [G] D [G] F) | (A [G] B [G] D [G] E))
"."; // Z
Y ::= (( G | A | B | C | D | E | F)<+>) "!"; // Y
X ::= ((A [G] B [G] C) | (A [G] B [G] D [G] F) | (A [G] B [G] D [G] E))
"@"; // X
K ::= (Y | X | Z) "."; // K
A ::= aa; // A
aa ::= pp pp; // aa
pp ::= "a"; // pp
B ::= "b"; // B
C ::= "c"; // C
D ::= "d"; // D
E ::= "e"; // E
F ::= "f"; // F
G ::= H | "G"; // G
H ::= A; // H
};

Given the input:

aabde..aabdf..aabde..aabde..aabdf..aabde..aabde..aabdf..$

a worst-case backtracking situation is created, since $ as the terminating
symbol is the final check.

For the purposes of the following a "parse step" is an actual descent/match
rather than a memo-table fetch. Obviously a memo-fetch is not
computationally free, but one can assume, using hashing, that a memo-fetch
occurs in nearly constant time.

With memoization turned off, the parse was effected in 1745 actual parse
steps. With memoization turned on, the parse was effected in 17 actual parse
steps. The input was then doubled to:

aabde..aabdf..aabde..aabde..aabdf..aabde..aabde..aabdf..aabde..aabdf..aabde.
.aabde..aabdf..aabde..aabde..aabdf..$

With memoization turned off, the number of actual parse steps was 3485. With
memoization turned on, the number of actual parse steps was 17. (As it was
with the shorter string.)

The input was then doubled yet again to:

aabde..aabdf..aabde..aabde..aabdf..aabde..aabde..aabdf..aabde..aabdf..aabde.
.aabde..aabdf..aabde..aabde..aabdf..aabde..aabdf..aabde..aabde..aabdf..aabde
..aabde..aabdf..aabde..aabdf..aabde..aabde..aabdf..aabde..aabde..aabdf..$

Non-memoized parse steps: 6965
Memoized parse steps: 17

To give an idea of times, the last case takes 12 milliseconds
unmemoized and 1 millisecond memoized -- this due to the fact that
memoization is not computationally free, even though it is causing
descent/match logic to be bypassed.

Wait a second -- the number of parse steps in the memoized version is
constant -- how is that possible? ;-)

Yes, indeed -- the non-memoized version of the parse displays the
expected complexity behavior, but the memoized version displays
constant complexity with this grammar and these inputs.

The enhancements that allow for this behavior are collectively code
named Meta-Janus and the implementation details/algorithms are a
company trade secret, but the point is that memoization indeed can be
a very useful parsing optimization.

--
Quinn Tyler Jackson
Chief Scientist

Thothic Technology Partners, LLC