implementation for Parsing Expression Grammar?

xah...@gmail.com

unread,

May 10, 2008, 1:52:30 AM5/10/08

to

In the past weeks i've been thinking over the problem on the practical
problems of regex in its matching power. For example, often it can't
be used to match anything of nested nature, even the most simple
nesting. It can't be used to match any simple grammar expressed by
BNF. Some rather very regular and simple languages such as XML, or
even url, email address, are not specified as a regex. (there exist
regex that are pages long that tried to match email address though)

I wrote out a more elaborate account of my thoughts here:
http://xahlee.org/cmaci/notation/pattern_matching_vs_pattern_spec.html

----------------

After days of researching this problem, looking into parsers and its
theories etc, today i found the answer!!

What i was looking for is called Parsing Expression Grammar (PEG).

See
http://en.wikipedia.org/wiki/Parsing_expression_grammar

It seems to me it's already in Perl6, and there's also a
implementation in Haskell. Is the perl6 PEG is in a usable state?

Thanks.

Xah
x...@xahlee.org
∑ http://xahlee.org/

☄

Lars Rune Nøstdal

unread,

May 10, 2008, 6:36:38 AM5/10/08

to

Hi,
Finite automata works for "nested things".

http://en.wikipedia.org/wiki/Automata_theory

--
Lars Rune Nøstdal
http://nostdal.org/

Kay Schluehr

unread,

May 10, 2008, 9:41:36 AM5/10/08

to

On 10 Mai, 07:52, "xah...@gmail.com" <xah...@gmail.com> wrote:
> In the past weeks i've been thinking over the problem on the practical
> problems of regex in its matching power. For example, often it can't
> be used to match anything of nested nature, even the most simple
> nesting. It can't be used to match any simple grammar expressed by
> BNF. Some rather very regular and simple languages such as XML, or
> even url, email address, are not specified as a regex.

Well formed XML cannot be fully specified within BNF as well because
it is context sensitive: in order to recognize a tag/endtag pair one
has to maintain a stack. That's not a big deal in practice if one
wants to write an XML parser but one can't use an arbitrary LL or LR
parser generator to produce a parse tree representing the XML.

Barb Knox

unread,

May 10, 2008, 7:25:43 PM5/10/08

to

In article <48257ab7$0$29576$c83e...@nn1-read.tele2.net>,

Lars Rune Nøstdal <larsn...@gmail.com> wrote:

> Hi,
> Finite automata works for "nested things".

Only in the special case when the depth of nesting is bounded ahead of
time. If it's unbounded then there is an unbounded amount of "stack"
information that the automaton needs to remember, therefore a finite
automaton cannot do it.

<pedantry>
That should be "Finite automata WORK ...", since "automata" is plural.
</pedantry>

> http://en.wikipedia.org/wiki/Automata_theory

George Neuner

unread,

May 10, 2008, 9:26:54 PM5/10/08

to

On Fri, 9 May 2008 22:52:30 -0700 (PDT), "xah...@gmail.com"
<xah...@gmail.com> wrote:

>In the past weeks i've been thinking over the problem on the practical
>problems of regex in its matching power. For example, often it can't
>be used to match anything of nested nature, even the most simple
>nesting. It can't be used to match any simple grammar expressed by
>BNF. Some rather very regular and simple languages such as XML, or
>even url, email address, are not specified as a regex. (there exist
>regex that are pages long that tried to match email address though)

What's your point? The limitations of regular expressions are well
known.

>After days of researching this problem, looking into parsers and its
>theories etc, today i found the answer!!
>
>What i was looking for is called Parsing Expression Grammar (PEG).

PEG has its own problems - it's very easy with PEG to create subtly
ambiguous grammars for which quite legal looking input is rejected.
And there are no good tools to analyze a PEG and warn you of subtle
problems.

Chris Clark (YACC++) has posted at length about the merits, problems
and limitations of various parse techniques - including PEG - in
comp.compilers. Before you consider doing anything with PEG I suggest
you look up his posts and read the related threads.

>It seems to me it's already in Perl6, and there's also a
>implementation in Haskell. Is the perl6 PEG is in a usable state?
>
>Thanks.
>
> Xah
> x...@xahlee.org

>? http://xahlee.org/

George
--
for email reply remove "/" from address

Robert Maas, http://tinyurl.com/uh3t

unread,

May 11, 2008, 7:00:39 PM5/11/08

to

(You posted to too many newsgroups, trimmed.)

18TH MODEM DROPPAGE TODAY AT 15:15, SINCE FIRST LOGIN TODAY AT 10:49.
(During one of my earlier messages I lost count by one, sorry.)
MEAN TIME BETWEEN MODEM FAILURES <15 MINUTES.

> From: "xah...@gmail.com" <xah...@gmail.com>
> In the past weeks i've been thinking over the problem on the
> practical problems of regex in its matching power. For example,
> often it can't be used to match anything of nested nature, even the
> most simple nesting. It can't be used to match any simple grammar
> expressed by BNF. Some rather very regular and simple languages
> such as XML, or even url, email address, are not specified as a
> regex. (there exist regex that are pages long that tried to match
> email address though)

A couple years ago I wrote a table-driven parser in Lisp that could
deal with nested stuff to some degree. The result of the parse was
a structure that matched the parse tree, so that further processing
could be done on that result. My test case was Received lines in spam.
The syntax was essentially the properly-structured version of
something like a regex crossed with BNF.

> I wrote out a more elaborate account of my thoughts here:
> http://xahlee.org/cmaci/notation/pattern_matching_vs_pattern_spec.html

Looking there now ...

For example, in computing, one would like to say that email address has
the form xyz, where xyz is a perl regex. (e.g. "[A-z0-9]+@[A-z]+\.com")

Per my system, that would be something more like
(:ANYNUMBEROF1 (:CHOICE (:RANGE A Z) (:RANGE a z) (:RANGE 0 9)))
(:EXACTLY "@")
(:ANYNUMBEROF1 (:CHOICE (:RANGE A Z) (:RANGE a z)))
(:EXACTLY ".com")
I forget how I specified characters, such as in ranges, so I left
that unspecified there. But I'm pretty sure I used strings when
specifying exact text, so I put quote marks in there. No, I don't
feel like looking up the details right now. My general idea above
should be enough for your immediate interest.

19TH MODEM DROPPAGE TODAY AT 15:34, SINCE FIRST LOGIN TODAY AT 10:49.
(During one of my earlier messages I lost count by one, sorry.)
MEAN TIME BETWEEN MODEM FAILURES <15 MINUTES.

It is quite desirable to have a general grammar language, designed in
a human-readible way, and concise. With such a language, we could use
it to verify if a text is a valid form. We could use it for human
communication.

I think my nested notation is sufficiently clear. Also, typically I
break my pattern into multiple production rules, with a meaningful
name for each. Thus the above might be instead:

Address = (User Atsign Host)
User = (:ANYNUMBEROF1 (:CHOICE (:RANGE A Z) (:RANGE a z) (:RANGE 0 9)))
Atsign = (:EXACTLY "@")
Host = (Hostname Dotcom)
Hostname = (:ANYNUMBEROF1 (:CHOICE (:RANGE A Z) (:RANGE a z)))
Dotcom = (:EXACTLY ".com")

20TH MODEM DROPPAGE TODAY AT 15:39, SINCE FIRST LOGIN TODAY AT 10:49.

> After days of researching this problem, looking into parsers and its
> theories etc, today i found the answer!!
> What i was looking for is called Parsing Expression Grammar (PEG).
> See
> http://en.wikipedia.org/wiki/Parsing_expression_grammar

Looking at that now ...

Unlike CFGs, PEGs are not ambiguous; if a string parses, it has
exactly one valid parse tree.

When using BNF or other formal specification of a grammar (syntax),
you have to be careful not to present overlapping rules, or you
need a precedence that one rule applies before another if both
match. Is the claim that with PEGs there is no such problem in the
first place??

+ Ordered choice: e[1] / e[2]

What if both e[1] and e[2] match exactly the same sub-string?
That would seem to me to create an ambiguity as to how to parse
that sub-string. How does a PEG resolve this ambiguity??

The choice operator e[1] / e[2] first invokes e[1], and if e[1]
succeeds, returns its result immediately. Otherwise, if e[1] fails,
then the choice operator backtracks to the original input position at
which it invoked e[1], but then calls e[2] instead, returning e[2]'s
result.

OK, there's the answer, linear precedence. Basically they've
re-invented BNF with the caveat that the production rules are in
linear sequence and earlier rules override later rules, using the
choice operator to syntactically condense multiple sequential
production rules into a single mega-rule with multiple sequential
clauses. (Actually BNF already had that condensation IIRC, it just
didn't have the linear precedence on clauses.)

I'd need to look up whether my own structured parse-grammar used
precedence too. Maybe some other day if anybody asks.

21ST MODEM DROPPAGE TODAY AT 16:01, SINCE FIRST LOGIN TODAY AT 10:49.

22ND MODEM DROPPAGE TODAY AT 16:03.

George Neuner

unread,

May 12, 2008, 3:00:48 AM5/12/08

to

On Sun, 11 May 2008 16:00:39 -0700,
usenet2.3...@SpamGourmet.Com (Robert Maas,
http://tinyurl.com/uh3t) wrote:

PEGs ca have arbitrary length ambiguous prefixes so they can have
complexity approaching CFG. The reason for staying within LR(1) (or
the overlapping LL(k)) whenever possible is to keep ambiguity to
reasonable levels - even people start to have problems with more
complex languages.

--