Earley Algorithm things and Marpa things

Rocky

unread,

Oct 28, 2017, 5:26:56 PM10/28/17

to marpa parser

Fairly recently I've been interested in using decompilation as a means for getting more exact location information of where a program is when it is stopped or hits an error. See http://rocky.github.io/NYC-Hackntell/#/ for 5 minute talk I gave to a very general audience. (I bet you could get it in 3 minutes though).

Somewhat mirroring your experience with Perl (vs Python), I figured this out with the help of Perl Monks, and did this first in Perl interprets via an AST tree that it creates.

Having mastered this aspect in Perl, I went on to Python, and that's where I discovered the AH Earley Algorithm parser from John Aycock. As that was a bit old and unmaintained, I even packaged it for Python.

To get more broader interest in this I am in the process of submitting to conferences papers on this aspect.

Some questions

First there is the issue of whether I could use Marpa instead of Aycock/Horspool SPARK. Well, one handicap for me here is that I know the python parser and have mucked with it to give me debug and trace information that I so sorely need in debugging grammars.

What I'll probably do here is use Marpa in my Perl Debugger to parse the various command arguments of the "list", "disassemble", and various "break" commands. I currently have this done as custom code but I had promised users that I would follow gdb conventions unless there was good reason not to. (Complicated code is no longer a good reason since this can be easily expressed in a grammar)

So back to the broader question of Marpa for Python. When I installed Marpa, it looks like it uses Bison/YACC and does a lot of work in C. This kind of thing seems like it is not going to be easy going.

There there is the broader question of how the Python decompiler currently works. At runtime, for each method custom rules are created for instructions that have variable length.

That is instead of

call_expr :: call | expr CALL | expr expr CALL | expr expr expr CALL | expr expr expr expr CALL

The operand of the CALL opcode gives how many parameters it takes. So a rule like:

call_expr ::= call_4

call_4 ::= expr expr expr expr CALL

is created when we see there is a call with 4 arguments. This cute idea goes back to the earliest versions of the decompiler by John Aycock. I can't find it mentioned in the literature though.

And I think somewhere I read that Marpa doesn't prefer to add rules at runtime per method. I suspect if the precompile phase or Marpa is fast enough this is moot.

In spark, I also added the ability to remove grammar rules since there is a lot of sharing of grammars going on between different Python versions.

Somewhere I read that the Earley parser is cubic. However ambiguous grammars can have an exponential number of derivations.

In particular consider this:

E ::= E E

E ::= a

Doesn't his amount to the number of ways you can parenthesize an expression of length n which is on the order of n choose 2 which is on the order of the upper part of a factorial which is exponential? Perhaps what they mean by cubic the time is cubic in the number of possible derivations although the number of derivations for a string of length n may be exponential?

One other technique of late in SPARK that I've been using is doing additional checks at grammar reduction time. For example two expressions might for some special kind of tuple if they they are expressions of constants of some kind. That kind of check might be too cumbersome to code in the grammar by changing opcodes.

Does Marpa allow for a programmable check to be made before performing a reduction?

Going the other way, may be beneficial: adding some Marpa goodness into the spark parser? Thoughts about that?

Jeffrey Kegler

unread,

Oct 28, 2017, 6:30:38 PM10/28/17

to Marpa Parser Mailing LIst

Re Marpa and AH, you might want to look at this earlier reply: https://groups.google.com/d/msg/marpa-parser/eKSSioagswU/uKm_AgxGAQAJ

In creating Marpa, I may have acquired as much experience with the AH algorithm as anybody, possible even Aycock/Horspool themselves. I found the special LR(0) items they create to be counter-productive, and ripped them out of the Marpa algorithm converting back to a more conventional Earley's parser. But I did keep some other ideas, and I still consider the AH article to be one of the sources of Marpa. For more, see that link above.

I did not work from SPARK, and I've heard it was buggy, though I don't know. I do know the algorithm in the original article has a problem in that it can produce two sets of AH-items for the same parse, so that unless you go to some trouble to look for this situation, you get duplicate parses of ambiguous grammars. I successfully worked around this (I don't think SPARK does), but by that time I had enough experience to get numbers on the benefit I was getting for all this additional complexity and realized the AH-LR(0) states, in practice, just don't produce enough of a payoff.

Marpa does *NOT* use Bison or Yacc or LALR-parsing. You might have gotten that impression if, when installing Marpa, you also had to compile Perl from source. Perl does use Bison for its parsing.

Marpa is cubic in the general case for BNF, but so is every other general BNF parsing ever implemented. (Valiant's is O(n**2.4), but has never had IIRC even a toy implementation.) If you compare apples-to-apples, if another grammar parses a well-defined grammar class in linear time, Marpa also parses that grammar class in linear time.

Wrt "checks at reduction time". Marpa::R2's ASF facility might allow you to do what you have in mind.

I don't think this addressed everything, but I hope it's a helpful start -- jeffrey

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rocky

unread,

Oct 28, 2017, 8:37:37 PM10/28/17

to marpa parser

I had looked at the link in the earlier reply for the first post but don't see how that addresses any of the questions.

With respect to incomplete parses or duplicate ambiguous parses and so on... in my specific situation I am not sure that matters. In decompilation, ambiguity is pretty much a given. Perhaps in this situation it helps to think of this as akin to human translation: the translation doesn't have to be exact or elegant to be useful .

With respect to time complexity, the question is that the parse time is 2.4 or cubic with respect to exactly what? I am pretty sure that there can be an exponential number of valid derivations or parse trees for E ::= E E | a with respect to the length of the number of tokens in the input.

Although this isn't rigorous here is a general idea by induction. Suppose I have a list of all the parses for a**n. When adding another "a", to get all the new possible derivations I would tack the "a" onto the beginning as (a (i)) or onto the end as ((i) a) There may be duplicate parenthesizations or parse trees that occur between in the two sets of ways to parse, but I suspect that the list of different derivations still grows by order 2n. And when I run some examples by hand indeed it seems the number of valid parses seems to grow exponentially:

-------------

input aa

1. (E E)

-------------

input aaa

(E (E E))

((E E) E)

-------------

input aaaa

(E (E (E E)))

(E ((E E) E))

((E (E E)) E)

((E E) E)) E)

I looked again at libmarpa_build or engine and although there's no Bison used (parsing of the marpa's grammar rules seems to be described in lib/Marpa/R2/meta/metag.bnf) there does seem to be a lot of C code one way or another. Not necessary bad, but just is something to be aware of if porting to another programming language.

So in the back of my mind I still am thinking about if there is good stuff in Marpa that I could to the python SPARK, perhaps the ruby slippers error handling. And what would it be like to change the decompiler uncompyle6 to use Marpa.

I'll get it sorted out somehow.

Thanks for the information.

To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser...@googlegroups.com.

Jeffrey Kegler

unread,

Nov 12, 2017, 8:45:25 PM11/12/17

to Marpa Parser Mailing LIst

With respect to time complexity, the question is that the parse time is 2.4 or cubic with respect to exactly what?

The tradition is to measure time with respect to n, which n is the length of the input in characters of some alphabet, and the time only includes parsing time, not evaluation time. This is how Earley's algorithm can be cubic, when the number of parses can be super-exponential, and so simply listing every parse would be far worse than cubic. The idea here is that you don't know how complex or simple an evaluation the applications wants, so it would confuse things to include evaluation time. It's possible, for example, that parsing is simply recognition -- all you want is a "yes" or "no" as to whether the input matches the grammar.

Again, best of luck, jeffrey

Rocky Bernstein

unread,

Nov 12, 2017, 9:25:12 PM11/12/17

to marpa-...@googlegroups.com

Thanks for the explanation. Yes, makes sense from the perspective of analysis.

But as someone using context-free grammar parsers for practical problem, I need to be concerned at what the guy on the street sees, independent of how messy it might make things for the theorist or analyst.

Recall that I am adapting conventional compiler technology for a slightly different use. By doing this I hope to capture a more rigor than I have seen in some (perhaps most) decompilers. One of the weird things in this Alice through the looking-glass world is that instead of designing a language and checking that input string are valid, we start from a set of strings that are known to be valid and then need to design a grammar that covers that.

And it is okay if the designed language covers too much - it is okay to allow the grammar to recognize or accept strings that never be input.

So In the upside-down world, context free ambiguous grammars simplify finding a covering set. But... we need to pay attention to exponential derivations in designing the grammar. To a large extend, I think various grammar-design rules go a long way to reducing the possibility of exponential run-time (and space).

--
You received this message because you are subscribed to a topic in the Google Groups "marpa parser" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/marpa-parser/LSo32mQTQlw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to marpa-parser+unsubscribe@googlegroups.com.

Jeffrey Kegler

unread,

Nov 12, 2017, 9:37:46 PM11/12/17

to Marpa Parser Mailing LIst

I did look at some of the articles you mentioned. That idea -- starting with a space of possible parses and narrowing it down -- is what Marpa's ASF's do. Perhaps you'll find them useful.

Hope this helps, jeffrey

On Sun, Nov 12, 2017 at 6:25 PM, Rocky Bernstein <rocky.b...@gmail.com> wrote:

Thanks for the explanation. Yes, makes sense from the perspective of analysis.

But as someone using context-free grammar parsers for practical problem, I need to be concerned at what the guy on the street sees, independent of how messy it might make things for the theorist or analyst.

Recall that I am adapting conventional compiler technology for a slightly different use. By doing this I hope to capture a more rigor than I have seen in some (perhaps most) decompilers. One of the weird things in this Alice through the looking-glass world is that instead of designing a language and checking that input string are valid, we start from a set of strings that are known to be valid and then need to design a grammar that covers that.

And it is okay if the designed language covers too much - it is okay to allow the grammar to recognize or accept strings that never be input.

So In the upside-down world, context free ambiguous grammars simplify finding a covering set. But... we need to pay attention to exponential derivations in designing the grammar. To a large extend, I think various grammar-design rules go a long way to reducing the possibility of exponential run-time (and space).

On Sun, Nov 12, 2017 at 8:45 PM, Jeffrey Kegler <jeffreykegler@jeffreykegler.com> wrote:
With respect to time complexity, the question is that the parse time is 2.4 or cubic with respect to exactly what?

The tradition is to measure time with respect to n, which n is the length of the input in characters of some alphabet, and the time only includes parsing time, not evaluation time. This is how Earley's algorithm can be cubic, when the number of parses can be super-exponential, and so simply listing every parse would be far worse than cubic. The idea here is that you don't know how complex or simple an evaluation the applications wants, so it would confuse things to include evaluation time. It's possible, for example, that parsing is simply recognition -- all you want is a "yes" or "no" as to whether the input matches the grammar.

Again, best of luck, jeffrey

--
You received this message because you are subscribed to a topic in the Google Groups "marpa parser" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/marpa-parser/LSo32mQTQlw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to marpa-parser+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "marpa parser" group.

To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser+unsubscribe@googlegroups.com.

Rocky Bernstein

unread,

Nov 12, 2017, 9:48:13 PM11/12/17

to marpa-...@googlegroups.com

Again thanks. I have a backlog of things to do, but I will look into Marpa ASFs. I still have a lot to learn.

Reply all

Reply to author

Forward