Parsing Timeline Suggestions

117 views
Skip to first unread message

nor...@gmail.com

unread,
Feb 19, 2019, 5:47:01 AM2/19/19
to marpa parser
(Following up from https://twitter.com/jeffreykegler/status/1097628806091816960)

Dear Jeffrey,

I was delighted when I found about your parsing timeline a couple of months ago.
I'm a PhD student researching parsing myself and it's great to see resources
like this bringing awareness to what happened in the field, and it's a great
reference for people in the field as well. It also got me more interested in
Earley and your own work of it, which is quite cool.

That being said, I do have some insatisfactions about the timeline, especially
the end of it. It's always easy to criticize, so I would like to offer my help
instead, to see if I can contribute something useful to the timeline.

According to me, there are two points that could use improvement. First, I feel
top-down recursive-descent formalizations, especially PEG, are being unfairly
dismissed. Second, it seems to me like there are quite a few new developments
that would deserve inclusion at the end of the timeline (especially by
comparison to the level of details towards the start of the timeline).

I don't pretend to have the objective truth, but the goal is to have a
productive conversation that can hopefully lead to improved accuracy.

---

Regarding PEG, I find the whole description minus the first paragraph to be
rather contestable and (I feel) uncharitable.

It's entirely correct that PEG does not cover BNF. But BNF is essentially a
notation for the CFG formalism, which PEG is not.

It seems your main beef with PEG is the problem that has been described as
"language hiding" or "prefix capture": the fact that a rule like `A ::= a | aa`
will not match the input "aa".

This is indeed an important problem of PEGs. But the way I see it, it's a dual
to the problem of ambiguity in CFGs. Both preclude one another: PEGs are
unambiguous but have prefix capture, CFGs do not have prefix capture but are
ambiguous. The detection of both issue is provably intractable (statically) and
may be more related than we know, as a relatively similar machinery can be put
to work to do partial detection of both ambiguity in CFGs and prefix capture in
PEGs, cf. "Modular Syntax Demands Verification" by Sylvain Schmitz (*).

(*) http://www.i3s.unice.fr/~mh/RR/2006/RR-06.32-S.SCHMITZ.pdf

I have written a fair deal of both PEGs and CFGs and both seem about equally
difficult to write. Non-general CFG formalism (LALR, LL) are much harder, and so
are PEG tools that have no support to build left-associative parse trees.

I also would never advise anyone not to test a grammar, whatever the formalism.
I've certainly never written a big chunk of grammar without making a mistake
whatever the formalism. And in practice, the overwhelming majority of these
mistakes weren't about either prefix capture nor ambiguity.

If your practical experience differs, I'd be interested to hear about it.

---

Regarding new developments, here are a couple of ideas of the top of my head.
The first two items of the list are not random, since those were things I worked
on in my research (http://norswap.com/publications/)

- Left-recursion and left-associativity handling in PEG, since it fixes the
  foremost problem (in my opinion) with PEGs

- Context-sensitive parsing

  You already already mention monadic parsing (but do not emphasize its
  context-sensitive properties). Another oldie that deserves to be mentionned
  are Prolog-style definite clause grammars (DCGs).

  There are some other works worth mentioning if this is a topic of interest for
  you.

  My own work in the area is essentially pushing an idea which expressed in a
  lovely manner:

  > Recursive descent does have a huge advantage, one which, despite its severe
  > limitations, will save it from obsolescence time and again. Hand-written
  > recursive descent is essentially calling subroutines. Adding custom
  > modification to recursive descent is very straight-forward.

  But making it so that these subroutines may manipulate state in a way that is
  safe in the presence of backtracking, enabling context-sensitive parsing.

- Packrat parsing, if only because it enables O(1) parsing of PEG, a class of
  languages that may still strictly include the whole of CFG — as far as I know,
  we do not know of a simple language that can be expressed in CFG but not in
  PEG (we do know language expressible in PEG but not in CFGs).

  Note that, even if PEG does indeed include CFG, that does not imply the
  existance of an algorithm to convert from one to the other.

- Tackling the problem of ambiguity formally,
  "Disambiguating Grammars with Tree Automata"
  https://michaeldadams.org/papers/disambiguating-grammars/

- GLL parsing probable deserves a mention for mention for taking a LL-style
  machinery to the point of finally parsing fully general CFGs through the use
  of Tomita-style graph structured stacks
  http://www.cs.rhul.ac.uk/research/languages/csle/GLLparsers.html

- Similarly, and while I'm not a fan of the way this work is marketed, parsing
  using the Brzozowski derivative is certainly an interesting direction in
  parsing: http://matt.might.net/articles/parsing-with-derivatives/

- This might be outside the scope of your timeline, but there is also a whole
  lot of work (some quite old) on "semi-parsing", i.e. fuzzy or partial parsing:
  http://grammarware.net/text/2014/semiparsing.pdf

---

There you have it. Let me know what you think, and I'm looking forward to
talking with you! I hope I can help the project in any capacity that I'm able.

Cheers,

Nicolas LAURENT

les...@dubiel.pl

unread,
Feb 19, 2019, 9:08:00 AM2/19/19
to marpa parser

If PEGs are not safe to use, because one not always know what language it is parsing, then why:

-- Larry Wall went for PEGs in Perl6
-- why Roman Radziejowski created Mouse parser that is said to:


"In view of these facts, it is useful to construct PEG parsers where the complexity of packrat technology is abandoned in favor of simple and transparent design. This is the idea of Mouse: a parser generator that transcribes PEG into a set of recursive procedures that closely follow the grammar. "




Nicolas Laurent

unread,
Feb 19, 2019, 12:28:04 PM2/19/19
to marpa-...@googlegroups.com
Hi Leszek,

While I do appreciate the support for PEGs, I feel obligated to point out that appealing to Larry Wall's authority is perhaps not the best way to show that they are very usable in practice.

And I'm not sure what Redziejowski's quote shows with that regard.

That being said, PEGs are definitely "safe to use" and "well-defined". I expect the dispute is purely about how "intuitive" the behaviour is.

The big difference with CFGs is that the semantics of PEGsis recognition-based (i.e., PEG grammars formalizes a recognizer) instead of generative (CFG grammars can generate by sentences by recursive substitution of non-terminal by their right-hand side).

I don't feel like being overly harsh on people who say CFGs are more intutive, because I used to say that PEGs were more intuitive and proceeded to annoy a few people that way. And I understand them! Honestly, after working with both formalism, they're really about identical in terms of difficulty and pitfalls.

What *I expect* is that it mostly depends on what one is used to. I dabbled in hand-written recursive-descent parsers, so PEGs felt more intuitive, while people who are used to writing CFGs would probably say they are more intuitive.

In any case, it shouldn't be the place of a history to make this subjective judgement, especially when the success of PEG libraries show that it isn't a consensus opinion, nor maybe a majority opinion — I reckon most people are pretty neutral on the whole matter and don't care about parsing quite as much as we weird people do.

Cheers,

Nicolas LAURENT


--
You received this message because you are subscribed to a topic in the Google Groups "marpa parser" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/marpa-parser/8EEq92TjR4E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to marpa-parser...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeffrey Kegler

unread,
Feb 19, 2019, 7:53:31 PM2/19/19
to Marpa Parser Mailing LIst
Thanks for your kind words.  In this, I'll just respond to the PEG points.

That part of my timeline has caused more blowback than any other part, not unexpectedly.  The quips are aimed, not at people like yourself who, while PEG optimists, know its difficulties and do not mislead others.  Unfortunately, at the time I wrote that version of the timeline (and still I suspect) you are the minority of pro-PEG voices out on the Web.

Many presentations of PEG present it as a "throw anything at me" solution.  These less careful voices were the loudest and cause a lot of folks to waste time pushing a rope.  My quips were aimed to catch the attention of these folks and they did their job.

I've done this before, with yacc, for the same reason, and caught the same kind of heat.  People routinely were recommending yacc for "serious parsing" either in ignorance or reckless disregard of the actual experience with it.  So I did a "bovicide" series of blog posts.  Nowadays I don't diss yacc, because it's no longer the danger to parsing newbies that it was.

If you read my entry carefully, and get beyond the quips (as I am sure you did), you see that I *do* say there are safe uses of PEG, and in the footnote concede that my description in the text is "negative".  I go on to point the reader toward the pro-PEG literature -- a literature of which the less careful PEG advocates are usually unaware.

So to my careless reader, a warning about PEG gets across, and to my careful reader, I point out that, while I am more optimistic about Earley/Leo, there is safe use of PEG and a literature worth consulting.

In fact, I think there are 5 "forever" algorithms, parsing algorihms that are of permanent significance, and PEG is one of them.  (The other 4 are regular expressions, recursive descent, Earley/Leo and Sakai's algorihm, aka CYK).

Getting to the merits, I've heard before that some find the PEG syntax more intuitive than BNF.  Everyone is the ultimate authority on their own intuitions, so I won't dispute this.  And you are more optimistic about research into PEG than I am, clearly.  But as Yogi Berra says, "predictions are difficult, especially about the future".  My timeline does not engage in predictions, although it certainly reflects my optimism wrt Earley/Leo.  I do claim to be entitled to greater confidence in my predictions on the grounds, at 65, I am much less likely to be around if they fall flat. :-)

I'll tackle the points about why the timeline includes/excludes what it does separately.

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser...@googlegroups.com.

nor...@gmail.com

unread,
Feb 27, 2019, 5:12:36 PM2/27/19
to marpa parser
Thanks for your answer!

I do mostly agree, PEG is not a cure-all. But to make my opinion clear (and I do
like to think it's an informed opinion rooted in experience) neither is it
significantly worse in usability than CFGs.

I must admit I missed that note in the footnotes. It doesn't do much for me in
terms of nuance, honestly. It's in a reference after a phrase that reads "PEG,
in safe use, would essentially be LL(1)". If that's the part of saying that "PEG
has safe uses" then it isn't nuance, but rather a judgement that says that PEG
is either dangerous or crippled (as LL(1) is).

It all comes back to this idea that PEG is "dangerous". I'm open to discussing
that — what I currently think is that it really means that it is **unintuitive**
to some people (as I do explain in my answer to Leszek above).

One of the (multiple) things that make me believe this is precisely this
reference you cited in support of "PEG, in safe use, would essentially be
LL(1)". I think this is a gross misrepresentation of the claims of the paper.
I'd stake 99% confidence that if you'd ask the authors if the paper says or
shows that, they'd reply with an emphatic "no".

Consider the abstract:

> Context-Free Grammars (CFGs) and Parsing Expression Grammars (PEGs)have
> several similarities and a few differences in both their syntax and semantics,
> but they are usually presented through formalisms that hinder a proper
> comparison. In this paper we present a new formalism for CFGs that highlights
> the similarities and differences between them. The new formalism borrows from
> PEGs the use of parsing expressions and the recognition-based semantics. We
> show how one way of removing non-determinism from this formalism yields a
> formalism with the semantics of PEGs. We also prove, based on these new
> formalisms, how LL(1) grammars define the same language whether interpreted as
> CFGs or as PEGs, and also show how strong-LL(k), right-linear, and LL-regular
> grammars have simple language-preserving translations from CFGs to PEGs. Once
> these classes of CFGs can be automatically translated to equivalent PEGs, we
> can reuse classic top-down grammars in PEG-based tools.

This essentially says that LL(1) grammars are the subset of both PEG and CFG
that share the same behaviour.

Hence your definition of "safe" seems to be "behaving similarly to CFGs".

But like I said, and you said in turn, everything is entitled to their
own opinions.

I do however think that if a sizeable portion of the parsing community would
agree that PEG are valuable, it might be worthwhile to point out that your
opinion is in fact, an opinion that does is not consensual by any stretch.

Similarly, why not mention some work building on PEG? In particular
left-recursion and packrat parsing. Even if you do think they build on a flawed
base, certainly we can agree these results are very significant for those of us
who do use PEG?

About quips, while I don't mind them in general, they seem to me here to be
precisely an example of the "rope-pulling" that you decry. Wouldn't it be more
high and mighty to simply list facts and publications and let people form their
own opinion?

Now, why do I care? I don't mind much about people being "wrong on the internet"
(™) — that ship sailed long ago — but the idea of a parsing timeline is
exciting to me. I'd love to recommend it and advertise it, but I find the bias
towards the end prevents me to do so. I think many people might share this
opinion.

The way I tend to perceive "timelines" and other "histories" is as somewhat
authoritative and unbiased — at least, that's the platonic ideal. Think about
the standard that Wikipedia tries to enforce for instance. Wikipedia does not
only list consensus — opinions and disputes are relevant, after all — but
clearly does mark them as such.

Do you think there is a way for the timeline to move in that direction?

Jeffrey Kegler

unread,
Feb 27, 2019, 10:12:52 PM2/27/19
to Marpa Parser Mailing LIst
Your PEG work should be interesting and I plan to follow it.  You made quite a number of other significant (non-PEG) points and I am aware that I still owe you responses to them.  That's underway.  It's taking some time because I am doing some reading for it.

best, jeffrey

Jeffrey Kegler

unread,
Mar 29, 2019, 7:11:23 PM3/29/19
to marpa parser
@Nicholas: Just to confirm that I have not forgotten and that I *do* intend to reply to the other points in your post.  But since you're not the only one who raises some of these questions, I thought it worthwhile to do a careful, well-researched job of it, so I've been doing a bit of reading.  Not too much longer, I hope.

Thanks, jeffrey
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser+unsubscribe@googlegroups.com.

Jeffrey Kegler

unread,
Apr 1, 2019, 10:04:04 AM4/1/19
to Marpa Parser Mailing LIst
@Nicholas: I've posted the 2nd part of my reply to my blog: "Sherlock Holmes and the Case of the Missing Parsing Solution".

Many of your questions involved the timeline's methodology and I felt I had to clarify myself on that issue first.  Hopefully this post does that.

I still owe you answers to some of the specific questions.



On Wed, Feb 27, 2019 at 5:12 PM <nor...@gmail.com> wrote:
Reply all
Reply to author
Forward
0 new messages