Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Is pyparsing really a recursive descent parser?

690 views
Skip to first unread message

Just Another Victim of the Ambient Morality

unread,
Nov 2, 2007, 2:05:13 AM11/2/07
to
Is pyparsing really a recursive descent parser? I ask this because
there are grammars it can't parse that my recursive descent parser would
parse, should I have written one. For instance:


from pyparsing import *

grammar = OneOrMore(Word(alphas)) + Literal('end')
grammar.parseString('First Second Third end')


Amazingly enough, this will fail to parse!
Now, maybe I have no idea what I'm talking about but shouldn't a
recursive descent parser recursively descend through all possible rule
combinations to discover one that works? Why does pyparsing fail to parse
this? Is there a way to get pyparsing to parse a grammar like this?
Thank you...


Marc 'BlackJack' Rintsch

unread,
Nov 2, 2007, 6:47:13 AM11/2/07
to
On Fri, 02 Nov 2007 06:05:13 +0000, Just Another Victim of the Ambient
Morality wrote:

> Is pyparsing really a recursive descent parser? I ask this because
> there are grammars it can't parse that my recursive descent parser would
> parse, should I have written one. For instance:
>
>
> from pyparsing import *
>
> grammar = OneOrMore(Word(alphas)) + Literal('end')
> grammar.parseString('First Second Third end')
>
>
> Amazingly enough, this will fail to parse!
> Now, maybe I have no idea what I'm talking about but shouldn't a
> recursive descent parser recursively descend through all possible rule
> combinations to discover one that works? Why does pyparsing fail to parse
> this?


Pyparsing is no recursive descent parser. It doesn't go back in the input
stream. The ``OneOrMore(Word(alphas))`` part "eats" the 'end' and when it
can't get more, the parser moves to the ``Literal('end')`` part which
fails because the 'end' is already gone.

> Is there a way to get pyparsing to parse a grammar like this?

Negative lookahead maybe:

grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
+ Literal('end'))

Ciao,
Marc 'BlackJack' Rintsch

Paul McGuire

unread,
Nov 2, 2007, 8:47:38 AM11/2/07
to
On Nov 2, 5:47 am, Marc 'BlackJack' Rintsch <bj_...@gmx.net> wrote:
>
> Pyparsing is no recursive descent parser. It doesn't go back in the input
> stream. The ``OneOrMore(Word(alphas))`` part "eats" the 'end' and when it
> can't get more, the parser moves to the ``Literal('end')`` part which
> fails because the 'end' is already gone.
>
> > Is there a way to get pyparsing to parse a grammar like this?
>
> Negative lookahead maybe:
>
> grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
> + Literal('end'))
>
> Ciao,
> Marc 'BlackJack' Rintsch- Hide quoted text -
>
> - Show quoted text -

Well I'll be darned! All this time, I thought "recursive descent"
described the recursive behavior of the parser, which pyparsing
definitely has. I never knew that backing up in the face of parse
mismatches was a required part of the picture.

In pyparsing, each construction gets composed from more fine-grained
constructions, and they are called recursively to match or not match
against the input string.

For example, taking your working parser example:

grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
+ Literal('end'))

This creates the following data structure:

- And
- OneOrMore
- And
- NotAny
- Literal('end')
- Word(alphas)
- Literal('end')

Every instance in this structure derives from the base class
ParserElement, which defines a method parse(). parse() in turn calls
preParse(), parseImpl(), and postParse(). If parseImpl succeeds, it
returns a ParseResults object and the next location within the input
string to continue parsing.

The parseImpl methods are most often overridden in the subclasses (a
few override postParse as well), such as:
- And.parseImpl invokes parse() (note the recursion) on each of the
expressions in its list. All must succeed or And.parseImpl throws an
exception.
- OneOrMore.parseImpl invokes parse() on its contained expression,
which must succeed at least once; if not, the exception is rethrown.
If the contained expression succeeds once, then its parse() method is
called again and again until it fails, but no exceptions are rethrown,
since only one match was actually required.
- NotAny inverts the success/failure of its contained expression. If
the expression's parse() method succeeds, NotAny.parseImpl throws an
exception. If the contained expression's parse() method throws an
exception, NotAny returns successfully. (NotAny does not advance the
parse location, nor does it return any tokens.)
- Literal and Word are terminals, in that they do not invoke any other
expression's parse() method. Literal.parseImpl tests whether its
string exists at the current parse location, and throws an exception
if it doesn't. Word.parseImpl tests whether the current parse
location contains a letter in the Word instance's set of valid initial
characters - if so success; if not, throws an exception. It then
advances through the input string, matching characters in the Word
instance's set of valid body characters. The entire matched string is
then returned, along with an updated string index at which to continue
parsing.

In my concept of "recursive descent" parsing, I was under the
impression that pyparsing's use of this data structure, and
*recursive* calls of parse() as it *descends* through the data
structure, constituted a recursive descent parser. What the OP
requested was a more regular expression-ish matcher, with backup (or
backtracking).

Your example did not include either of the alternation classes,
MatchFirst or Or. These classes implement a form of backtracking on
the stack, that if we descend into an expression on the list of
alternatives, and that expression fails, then MatchFirst or Or will
back up to the same starting location and try the next alternative.
(MatchFirst is an "eager" matcher, in that it will pick the first
matching expression in its list - Or is an "exhaustive" matcher, in
that it will evaluate all of the alternatives, and select the longest
match.)

The Wikipedia entry for "Recursive Descent Parser" describes this
parser model as a "predictive parser", and later goes on to say that
some (uncited) authors equate "predictive parser" with "recursive
descent parsers". The article makes a special distinction for a
"recursive parser with backup", which is what I believe the OP was
asking for. Yes, pyparsing is definitely *not* a "recursive descent
parser with backup." The solution, as you correctly posted, is to add
a negative lookahead. NotAny is also represented using the '~'
operator.

By the way, there are a couple of other places where pyparsing
diverges from the standard model:
- implicit whitespace skipping
- user-definable comment skipping
- attachment of parse actions to expressions
These features, while atypical, provide some added capability and ease-
of-use in creating quick and simple parsers.

So I will take my stance with the uncited authors who lump predictive
parsers in with recursive descent parsers.

-- Paul

Just Another Victim of the Ambient Morality

unread,
Nov 3, 2007, 1:33:41 AM11/3/07
to

"Paul McGuire" <pt...@austin.rr.com> wrote in message
news:1194007658.2...@57g2000hsv.googlegroups.com...

> On Nov 2, 5:47 am, Marc 'BlackJack' Rintsch <bj_...@gmx.net> wrote:
>>
>> Pyparsing is no recursive descent parser. It doesn't go back in the
>> input
>> stream. The ``OneOrMore(Word(alphas))`` part "eats" the 'end' and when
>> it
>> can't get more, the parser moves to the ``Literal('end')`` part which
>> fails because the 'end' is already gone.
>>
>> > Is there a way to get pyparsing to parse a grammar like this?
>>
>> Negative lookahead maybe:
>>
>> grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
>> + Literal('end'))
>>
>> Ciao,
>> Marc 'BlackJack' Rintsch- Hide quoted text -
>>
>> - Show quoted text -
>
> Well I'll be darned! All this time, I thought "recursive descent"
> described the recursive behavior of the parser, which pyparsing
> definitely has. I never knew that backing up in the face of parse
> mismatches was a required part of the picture.

It has recursion in it but that's not sufficient to call it a recursive
descent parser any more than having a recursive implementation of the
factorial function. The important part is what it recurses through...

Thank you for the detailed description of pyparsing's implementation.


> In my concept of "recursive descent" parsing, I was under the
> impression that pyparsing's use of this data structure, and
> *recursive* calls of parse() as it *descends* through the data
> structure, constituted a recursive descent parser. What the OP
> requested was a more regular expression-ish matcher, with backup (or
> backtracking).

In my concept of "recursive descent" parsing, I was under the impression

that one should recurse through all rule combinations to ensure that the
grammar is fully applied. As I have mentioned before, merely having
recursion in your algorithm is insufficient. What you recurse through is
key. pyparsing recurses through rules but what's important is to recurse
through rule combinations. Otherwise, the parser won't be correct. That is
to say, there will be strings that are grammatically correct for which the
parser will fail to parse. Honestly, what is the point of a recursive
descent parser if it doesn't parse correct string? If you're willing to
have correct strings unparsed, you might as well go for LALR parsing, or
some such...
In my opinion, the rule set I mentioned in my original post:


grammar = OneOrMore(Word(alphas)) + Literal('end')


...should be translated (internally) to something like this:


word = Word(alphas)
grammar = Forward()
grammar << ((word + grammar) | (word + Literal(end)))


This allows a recursive search to find the string correct without any
need for "backtracking," if I understand what you mean by that. Couldn't
pyparsing have done something like this?


> Your example did not include either of the alternation classes,
> MatchFirst or Or. These classes implement a form of backtracking on
> the stack, that if we descend into an expression on the list of
> alternatives, and that expression fails, then MatchFirst or Or will
> back up to the same starting location and try the next alternative.
> (MatchFirst is an "eager" matcher, in that it will pick the first
> matching expression in its list - Or is an "exhaustive" matcher, in
> that it will evaluate all of the alternatives, and select the longest
> match.)

I guess I was hoping that I could simply specify, mathematically, a
grammar and have the software Do The Right Thing(tm)...


> The Wikipedia entry for "Recursive Descent Parser" describes this
> parser model as a "predictive parser", and later goes on to say that
> some (uncited) authors equate "predictive parser" with "recursive
> descent parsers". The article makes a special distinction for a
> "recursive parser with backup", which is what I believe the OP was
> asking for. Yes, pyparsing is definitely *not* a "recursive descent
> parser with backup." The solution, as you correctly posted, is to add
> a negative lookahead. NotAny is also represented using the '~'
> operator.
>

> So I will take my stance with the uncited authors who lump predictive
> parsers in with recursive descent parsers.

Well, the good thing about Wikipedia is that, if you don't like the
definition on the page, you're welcome to change it! Although, I'd
recommend discussing it on the talk page before you do, just to make sure
there isn't a good reason for the page to be as it is...
Thank you...


Kay Schluehr

unread,
Nov 3, 2007, 4:32:23 AM11/3/07
to
On Nov 3, 6:33 am, "Just Another Victim of the Ambient Morality"

I think the confusing aspect of pyparsing for someone like me coming
from an (E)BNF and formal language background is that scanning and
parsing are merged into one. pyparsing is scannerless and where a
tokenizer such as Pythons performs a longest match tokenization but
interprets CFGs correctly ( in the bounds of the parsers power ) one
has to disambiguate grammars in pyparsing where you expect a CFG to be
established.

Note that I don't like pyparsings approach and it is clearly not for
everyone. I rather tend to make the experience that token definitions
can be reused across different languages. I count 65 token for Python
and I added 10 token for an experimental C++ parser and deactivated
some others. The distance between both languages on a lexical level is
not that huge.

What pyparsing has been established in the Python community is
something different IMO: readable and verbose pattern matching syntax
which is not regexp line noise. Around 8 years ago when I started
using Python I found a little module called reverb.py. It was actually
nothing but a front end for regular expressions. It didn't make any
career but I think it is the only one survived the anagrammatical
release hopping from 1.5.2 to 2.5.1 in my own case:

from reverb import*
>>> p = required(required(alphanum)+whitespace)+text("end")
>>> p
<reverb.raw instance at 0x00C0D260>
>>> RE(p).pattern
'(?:\\w+\\s)+end'


Paul McGuire

unread,
Nov 3, 2007, 11:49:31 AM11/3/07
to
On Nov 3, 12:33 am, "Just Another Victim of the Ambient Morality"

<ihates...@hotmail.com> wrote:
> It has recursion in it but that's not sufficient to call it a recursive
> descent parser any more than having a recursive implementation of the
> factorial function. The important part is what it recurses through...

<snip>

> In my opinion, the rule set I mentioned in my original post:
>
> grammar = OneOrMore(Word(alphas)) + Literal('end')
>
> ...should be translated (internally) to something like this:
>
> word = Word(alphas)
> grammar = Forward()
> grammar << ((word + grammar) | (word + Literal(end)))
>
> This allows a recursive search to find the string correct without any
> need for "backtracking," if I understand what you mean by that. Couldn't
> pyparsing have done something like this?
>

Dear JAVotAM -

This was a great post! (Although I'm not sure the comment in the
first paragraph was entirely fair - I know the difference between
recursive string parsing and recursive multiplication.)

You really got me thinking more about how this recursion actually
behaves, especially with respect to elements such as OneOrMore. Your
original question is really quite common, and up until now, my stock
answer has been to use negative lookahead. The expression you posted
is the first time I've seen this solution, and it *does* work.

I was all set to write a cocky response on why your expression
wouldn't work. I've seen it many times before, where people (usually
coming from EBNF experience) implement listOfXs = OneOrMore(x) as:

listOfXs = Forward()
listOfXs << ( x + listOfXs | x )

Actually, what they usually write is:

listOfXs << ( listOfXs + x )

but this sends pyparsing into a recursive tailspin.

So I fired up SciTE and copy/pasted your code into the editor and ran
it, and it worked just fine - this was a shock! I studied this for a
few minutes, and then realized what was happening. First of all, I
misread what you posted. You posted this:

grammar << ((word + grammar) | (word + Literal(end)))

which works. I *thought* you posted this:

grammar << ((word + grammar) | word) + Literal(end)

which doesn't work. In fact this behaves the same as your original
post, except it iterates over the input string's words recursively,
vs. repetition ins a for-loop, as is done by OneOrMore.

So going back to your working version, I had to see why it works.
Initially, the first term in the MatchFirst (the '|' operator creates
MatchFirst instances) is just the same, and by grammar referencing
itself, it just goes word by word through the input trying to find a
match. I'll try to visualize the steps:

level "First Second Third end"
1 word grammar
2 word grammar
3 word grammar
4 word grammar <- fails!
4 word end <- fails!
(therefore level 3 grammar fails!)
3 word end <-- success!!!

grammar has 2 options: to match a word followed by a grammar, or to
match a word followed by 'end'. At 4 levels deep into the Forward's
recursion, the first option fails, because after parsing "end" as the
initial word, there is no more text to try to match against grammar.
Level 4's Forward then also tries to match a word followed by 'end',
but this fails for the same reason. So at this point, the 4th level
Forward fails to match either of its options, so it throws its
exception back up to level 3, indicating that the first alternative,
word followed by grammar, failed. Level 3 then moves on to see if
word followed by the literal 'end' matches, and it does - success!

Here's where I am stuck now. In the original grammar that you posted,
which you want to render into this behavior, grammar is defined as:

grammar = OneOrMore(Word(alphas)) + Literal('end')

Unfortunately, when the OneOrMore gets constructed, it does not have
any visibility beyond knowing what is to be repeated. Again, here is
the data structure that is being built:

- And
- OneOrMore


- Word(alphas)
- Literal('end')

Only at the level of the And is there any awareness that the OneOrMore
is followed by anything, let alone by something which could be
misinterpreted as something matching the OneOrMore's repetition
expression.

Can you suggest a way I could generalize this, so that OneOrMore stops
matching before it gets to 'end'?

-- Paul

Neil Cerutti

unread,
Nov 3, 2007, 3:00:36 PM11/3/07
to

Is there not an ambiguity in the grammar?

In EBNF:

goal --> WORD { WORD } END

WORD is '[a-zA-Z]+'
END is 'end'

I think it is fine that PyParsing can't guess what the composer
of that grammar meant.

--
Neil Cerutti

Just Another Victim of the Ambient Morality

unread,
Nov 3, 2007, 6:40:39 PM11/3/07
to

"Paul McGuire" <pt...@austin.rr.com> wrote in message
news:1194104971....@d55g2000hsg.googlegroups.com...

> On Nov 3, 12:33 am, "Just Another Victim of the Ambient Morality"
> <ihates...@hotmail.com> wrote:
>> It has recursion in it but that's not sufficient to call it a
>> recursive
>> descent parser any more than having a recursive implementation of the
>> factorial function. The important part is what it recurses through...
>
> <snip>
>
>> In my opinion, the rule set I mentioned in my original post:
>>
>> grammar = OneOrMore(Word(alphas)) + Literal('end')
>>
>> ...should be translated (internally) to something like this:
>>
>> word = Word(alphas)
>> grammar = Forward()
>> grammar << ((word + grammar) | (word + Literal(end)))
>>
>> This allows a recursive search to find the string correct without any
>> need for "backtracking," if I understand what you mean by that. Couldn't
>> pyparsing have done something like this?
>>
>
> Dear JAVotAM -
>
> This was a great post! (Although I'm not sure the comment in the
> first paragraph was entirely fair - I know the difference between
> recursive string parsing and recursive multiplication.)

I often use hyperbole to emphasize a point. I honestly mean no offense.
That comment wasn't even meant to be fair, I was hoping to be provocative...


> You really got me thinking more about how this recursion actually
> behaves, especially with respect to elements such as OneOrMore. Your
> original question is really quite common, and up until now, my stock
> answer has been to use negative lookahead. The expression you posted
> is the first time I've seen this solution, and it *does* work.

I'm glad I got you thinking! I'd hate to be another newbie with another
of a thousand questions answered in the FAQ....
Hey, are you the author of pyparsing?


> I was all set to write a cocky response on why your expression
> wouldn't work. I've seen it many times before, where people (usually
> coming from EBNF experience) implement listOfXs = OneOrMore(x) as:
>
> listOfXs = Forward()
> listOfXs << ( x + listOfXs | x )
>
> Actually, what they usually write is:
>
> listOfXs << ( listOfXs + x )
>
> but this sends pyparsing into a recursive tailspin.
>
> So I fired up SciTE and copy/pasted your code into the editor and ran
> it, and it worked just fine - this was a shock! I studied this for a
> few minutes, and then realized what was happening. First of all, I
> misread what you posted. You posted this:
>
> grammar << ((word + grammar) | (word + Literal(end)))
>
> which works. I *thought* you posted this:
>
> grammar << ((word + grammar) | word) + Literal(end)
>
> which doesn't work. In fact this behaves the same as your original
> post, except it iterates over the input string's words recursively,
> vs. repetition ins a for-loop, as is done by OneOrMore.

I'm grateful that you actually tested my code before posting your cocky
response!


> So going back to your working version, I had to see why it works.
> Initially, the first term in the MatchFirst (the '|' operator creates
> MatchFirst instances) is just the same, and by grammar referencing
> itself, it just goes word by word through the input trying to find a
> match. I'll try to visualize the steps:
>
> level "First Second Third end"
> 1 word grammar
> 2 word grammar
> 3 word grammar
> 4 word grammar <- fails!
> 4 word end <- fails!
> (therefore level 3 grammar fails!)
> 3 word end <-- success!!!
>
> grammar has 2 options: to match a word followed by a grammar, or to
> match a word followed by 'end'. At 4 levels deep into the Forward's
> recursion, the first option fails, because after parsing "end" as the
> initial word, there is no more text to try to match against grammar.
> Level 4's Forward then also tries to match a word followed by 'end',
> but this fails for the same reason. So at this point, the 4th level
> Forward fails to match either of its options, so it throws its
> exception back up to level 3, indicating that the first alternative,
> word followed by grammar, failed. Level 3 then moves on to see if
> word followed by the literal 'end' matches, and it does - success!

This is, literally, what it's doing. I'm not exactly a programming whiz
so I think of it a little more abstractly. In pyparsing's implementation,
it recurses through the first rule, OneOrMore, then iteratively moves to the
next rule, only to fail. So, obviously that rule must be part of the
recursion if we're not to use backtracking or lookaheads. If it's part of
the recursion, it has to be the terminal case. We know that the OneOrMore
rule is necessarily followed by the literal, so we can safely conclude that
the terminal case will necessarily be followed by the literal, hence the
production mentioned...


> Here's where I am stuck now. In the original grammar that you posted,
> which you want to render into this behavior, grammar is defined as:
>
> grammar = OneOrMore(Word(alphas)) + Literal('end')
>
> Unfortunately, when the OneOrMore gets constructed, it does not have
> any visibility beyond knowing what is to be repeated. Again, here is
> the data structure that is being built:
>
> - And
> - OneOrMore
> - Word(alphas)
> - Literal('end')
>
> Only at the level of the And is there any awareness that the OneOrMore
> is followed by anything, let alone by something which could be
> misinterpreted as something matching the OneOrMore's repetition
> expression.
>
> Can you suggest a way I could generalize this, so that OneOrMore stops
> matching before it gets to 'end'?

Again, I'm not an elite hacker so it's hard to give a good suggestion.
I might have implemented the parser by having its input be a string
specifying the grammar but, then again, I'd need a parser to prase that
string! Oh, the irony...
Probably, I would have constructed the rule tree differently. I like
how pyparsing uses the Python interpreter to parse the input string, the
Python code, itself, as pyparsing's input. That's worth preserving...
So, instead of organizing the tree by rule categories, as pyparsing
does:


-And
-OneOrMore
-Word(alphas)
-Literal('end')


...I would have organized it by what the grammar actually expects. It's
hard for me to describe what I'm thinking as a simple hierarchy. I could
use a directed graph but that's hard to type, so I'll just use "labels":


start:
call OneOrMore:

OneOrMore:
Or(
And(
Word(alphas),
call OneOrMore:),
And(
Word(alphas),
Literal('end')))


...hopefully, that indentation makes some sort of sense. I would
imagine that it would be constructed like this:


# This translates to temp << ((word + temp) | word)
temp = OneOrMore(word)

# Instead of wrapping this in an And structure,
# tell temp to And this with its terminal case...
grammar = temp + literal


In my parser, everything would be recursive, hence the term "recursive
descent" parser. So, a list of Ands like:


grammar = Literal('a') + Literal('b') + Literal('c')


...would look like this:


start:
call And_a:

And_a:
Literal('a') + call And_b:

And_b:
Literal('b') + call And_c:

And_c:
Literal('c')


...notice the utter lack of iteration.
Now, this example isn't, strictly speaking, recursive, but it descends
through a set of function calls that are, potentionally, recursive...
Now, obviously things like Or will have iterate through its cases but
that's unavoidable and should be okay. The other constructions should need
some thought, too. However, hopefully, this technique will render some
constructions unnecessary...
So, in my parser, the grammar is represented by a directed graph (the
temporary variable flowing through operations like a + b + c + d) and all
operations modify this graph. Obviously, some research needs to be done on
how feasible this implementation is, especially with the rich set of
operations available in pyparsing, but I'm hopeful...
Hopefully this all makes sense. Thanks for reading...


Just Another Victim of the Ambient Morality

unread,
Nov 3, 2007, 8:41:28 PM11/3/07
to

"Neil Cerutti" <hor...@yahoo.com> wrote in message
news:oz3Xi.39812$G23....@newsreading01.news.tds.net...
> Is there not an ambiguity in the grammar?
>
> In EBNF:
>
> goal --> WORD { WORD } END
>
> WORD is '[a-zA-Z]+'
> END is 'end'
>
> I think it is fine that PyParsing can't guess what the composer
> of that grammar meant.

First, I don't know if that constitutes an ambiguity in the grammar.
'end' is a word but it's unambiguous that this grammar must end in a literal
'end'. You could interpret the input as just a sequence of words or you
could interpret it as a sequence of words terminated by the word 'end'. One
interpretation conforms to the grammar while the other doesn't. You would
assume that the interpretation that agrees with the grammar would be the
preferable choice and so should the program...
Secondly, even if it is an ambiguity... so what? pyparsing's current
behaviour is to return a parse error, pretending that the string can't be
parsed. Ideally, perhaps it should alert you to the ambiguity but, surely,
it's better to return _a_ valid parsing than to pretend that the string
can't be parsed at all...

Neil Cerutti

unread,
Nov 3, 2007, 10:07:24 PM11/3/07
to
On 2007-11-04, Just Another Victim of the Ambient Morality

Yeah. If it were a regex, it would be: '[ab]+b'. That is a fine
regex, because a regex is generally just a recognizer; the
ambiguity doesn't have to do with recognizing the language. But
PyParsing is parser. The ambiguity is in deciding what's a
Word(alphas) and what's an 'end'.

> One interpretation conforms to the grammar while the other
> doesn't. You would assume that the interpretation that agrees
> with the grammar would be the preferable choice and so should

> the program. Secondly, even if it is an ambiguity... so what?


> pyparsing's current behaviour is to return a parse error,
> pretending that the string can't be parsed. Ideally, perhaps
> it should alert you to the ambiguity but, surely, it's better
> to return _a_ valid parsing than to pretend that the string
> can't be parsed at all...

I wouldn't characterize it as pretending. How would you parse:

hello end hello end

"WORD END WORD END" and "WORD WORD WORD END" are both valid
interpretations, according to the grammar.

As soon as you remove the ambiguity from the grammar, PyParsing
starts to work correctly.

Consider writing a recursive decent parser by hand to parse the
language '[ab]+b'.

goal --> ab_list 'b'
ab_list --> 'a' list_tail
ab_list --> 'b' list_tail
list_tail --> 'a' list_tail
list_tail --> 'b' list_tail
list_tail --> null


The above has the exact same bug (and probably some others--I'm
sorry unable to test it just now) as the PyParsing solution.

The error is in the grammar. It might be fixed by specifying that
'b' must be followed by EOF, and then it could be coded by using
more than one character of lookahead.

class ParseError(Exception):
pass

class Parser:
def __init__(self, s):
self.s = s
self.i = 0
self.c = self.s[self.i]

def match(self, c):
if self.c != c:
raise ParseError('expected %s got %s' % (c, self.c))
self.i += 1
if self.i == len(self.s):
raise ParseError('unexpected EOF')
self.c = self.s[self.i]

def goal(self):
self.ab_list()
self.match('b')

def ab_list(self):
if self.c in 'ab':
self.match(self.c)
self.list_tail()
else:
raise ParseError('expected list of a or b, got %s' % self.c)

def list_tail(self):
if self.c in 'ab':
self.match(self.c)
self.list_tail()
else:
raise ParseError('expected a list of a or b, got %s' % self.c)

>>> Parser('aaab').goal()
Traceback (most recent call last):
File "py.py", line 37, in ?
Parser('aaab').goal()
File "py.py", line 20, in goal
self.ab_list()
File "py.py", line 26, in ab_list
self.list_tail()
File "py.py", line 33, in list_tail
self.list_tail()
File "py.py", line 33, in list_tail
self.list_tail()
File "py.py", line 32, in list_tail
self.match(self.c)
File "py.py", line 16, in match
raise ParseError('unexpected EOF')
__main__.ParseError: unexpected EOF

It's not a coincidence that is has the same bug as the PyParsing
solution. You can remove the ambiguity in several ways, perhaps
by specifying where EOF should be (you seem to be assuming this
in your interpretation of the grammar, but PyParsing does not).

goal --> ab_list 'b' EOF
ab_list --> 'a' list_tail
ab_list --> 'b' list_tail
list_tail --> 'a' list_tail
list_tail --> 'b' list_tail
list_tail --> null

I need to change goal, match and list_tail for this new grammar,
and add an EOF object and a peek method.

...
EOF = object()
...
def match(self, c):
if self.c != c:
raise ParseError('expected %s got %s' % (c, self.c))
if self.i >= len(self.s)-1:
self.c = EOF
self.i = len(self.s)
else:
self.i += 1
self.c = self.s[self.i]

def peek(self, lookahead):
if self.i+lookahead >= len(self.s):
return EOF
else:
return self.s[self.i + lookahead]
...
def list_tail(self):
if self.c == 'a':
self.match('a')
self.list_tail()
elif self.c == 'b':
if self.peek(1) != EOF:
self.match('b')
self.list_tail()
else:
raise ParseError('expected list of a or b, got %s' % self.c)

The grammar now has the problem that it's LL(2) rather than
LL(1). PyParsing can handle that type of grammar with negative
lookahead assertions, I think.

--
Neil Cerutti

Kay Schluehr

unread,
Nov 4, 2007, 4:49:20 AM11/4/07
to
On 4 Nov., 03:07, Neil Cerutti <horp...@yahoo.com> wrote:
> On 2007-11-04, Just Another Victim of the Ambient Morality
>
>
>
> <ihates...@hotmail.com> wrote:
>
> > "Neil Cerutti" <horp...@yahoo.com> wrote in message

I think you are correct about this. But I'm not sure how much it shall
matter. Just take a look at Pythons Grammar

http://svn.python.org/view/python/trunk/Grammar/Grammar?rev=55446&view=markup

Without special keyword treatment this grammar would be ambigous and
couldn't be parsed using an LL(1) parser. The "grammar compiler" which
builds the parser tables creates a special label for each keyword.
This label is filtered when a NAME token is feeded into the parser.
With the label that belongs to e.g. 'if' or 'while' the correct
statement can be selected in constant time. Same happens when I use
the parser generator with your EBNF grammar. With a little more
adaption also NUMBER token could be filtered. But this would be
overdesign.

Theoretical beauty is compromised here using reasonable default
assumptions for keeping the grammar simple ( "convention over
configuration" to borrow a Rails slogan ).

Tokenization is another issue in Python. It is indeed somewhat special
due to significant whitespace and line continuation but tokenization
is conceptually much simpler and one would likely throw all kinds of
options and special case handling in the lexical analysis phase.

Neil Cerutti

unread,
Nov 4, 2007, 7:54:34 AM11/4/07
to
On 2007-11-04, Kay Schluehr <kay.sc...@gmx.net> wrote:
> On 4 Nov., 03:07, Neil Cerutti <horp...@yahoo.com> wrote:
>> I wouldn't characterize it as pretending. How would you parse:
>>
>> hello end hello end
>>
>> "WORD END WORD END" and "WORD WORD WORD END" are both valid
>> interpretations, according to the grammar.
>>
>> As soon as you remove the ambiguity from the grammar, PyParsing
>> starts to work correctly.
>
> I think you are correct about this. But I'm not sure how much
> it shall matter. Just take a look at Pythons Grammar
>
> http://svn.python.org/view/python/trunk/Grammar/Grammar?rev=55446&view=markup
>
> Without special keyword treatment this grammar would be
> ambigous and couldn't be parsed using an LL(1) parser.

I agree. I don't know how easy it is to create keywords using
PyParsing, or if the grammar in question would still be
considered correct by the author.

> The "grammar compiler" which builds the parser tables creates a
> special label for each keyword. This label is filtered when a
> NAME token is feeded into the parser. With the label that
> belongs to e.g. 'if' or 'while' the correct statement can be
> selected in constant time. Same happens when I use the parser
> generator with your EBNF grammar. With a little more adaption
> also NUMBER token could be filtered. But this would be
> overdesign.
>
> Theoretical beauty is compromised here using reasonable default
> assumptions for keeping the grammar simple ( "convention over
> configuration" to borrow a Rails slogan ).

Keywords are practically ubiquitous. I never thought of them as
unbeautiful before.

> Tokenization is another issue in Python. It is indeed somewhat
> special due to significant whitespace and line continuation but
> tokenization is conceptually much simpler and one would likely
> throw all kinds of options and special case handling in the
> lexical analysis phase.

It might be a quick fix in PyParsing, which includes a Keyword
type, but without the semantics that are needed in this case. You
have to use (as suggested earlier) negative lookahead in either a
Regex or with the NotAny type.

>>> goal = OneOrMore(NotAny(Literal('end')) + Word(alphas)) + Literal('end')
>>> goal.parseString('hello hello hello end')
(['hello', 'hello', 'hello', 'end'], {})
>>> goal.parseString('hello end hello end')
(['hello', 'end'], {})

No scanner/tokenizer needed! ;)

--
Neil Cerutti

Just Another Victim of the Ambient Morality

unread,
Nov 4, 2007, 9:55:46 AM11/4/07
to

"Neil Cerutti" <hor...@yahoo.com> wrote in message
news:wP9Xi.39822$G23....@newsreading01.news.tds.net...

> On 2007-11-04, Just Another Victim of the Ambient Morality
> <ihat...@hotmail.com> wrote:
>>
>> "Neil Cerutti" <hor...@yahoo.com> wrote in message
>> news:oz3Xi.39812$G23....@newsreading01.news.tds.net...
>>>
>>> Is there not an ambiguity in the grammar?
>>>
>>> In EBNF:
>>>
>>> goal --> WORD { WORD } END
>>>
>>> WORD is '[a-zA-Z]+'
>>> END is 'end'
>>>
>>> I think it is fine that PyParsing can't guess what the composer
>>> of that grammar meant.
>>
>> One interpretation conforms to the grammar while the other
>> doesn't. You would assume that the interpretation that agrees
>> with the grammar would be the preferable choice and so should
>> the program. Secondly, even if it is an ambiguity... so what?
>> pyparsing's current behaviour is to return a parse error,
>> pretending that the string can't be parsed. Ideally, perhaps
>> it should alert you to the ambiguity but, surely, it's better
>> to return _a_ valid parsing than to pretend that the string
>> can't be parsed at all...
>
> I wouldn't characterize it as pretending. How would you parse:
>
> hello end hello end
>
> "WORD END WORD END" and "WORD WORD WORD END" are both valid
> interpretations, according to the grammar.

...and it would be nice if the parser were to parse one of them since
they are both right. Having more than one right answer is not the same as
having no answer, which is what pyparsing claims...


> As soon as you remove the ambiguity from the grammar, PyParsing
> starts to work correctly.

This is simply not true. Try this:


grammar = OneOrMore(Word(alphas)) + Literal('end') + Literal('.')
grammar.parseString('First Second Third end.')


...again, this will fail to parse. Where's the ambiguity?
Besides, parsing ambiguous grammars is a useful feature. Not all
grammars being parsed are designed by those doing the parsing...


> Consider writing a recursive decent parser by hand to parse the
> language '[ab]+b'.
>
> goal --> ab_list 'b'
> ab_list --> 'a' list_tail
> ab_list --> 'b' list_tail
> list_tail --> 'a' list_tail
> list_tail --> 'b' list_tail
> list_tail --> null
>
>
> The above has the exact same bug (and probably some others--I'm
> sorry unable to test it just now) as the PyParsing solution.
>
> The error is in the grammar. It might be fixed by specifying that
> 'b' must be followed by EOF, and then it could be coded by using
> more than one character of lookahead.

I don't exactly understand the syntax you used to describe the
productions of your recursive descent parser so not only did I not follow it
but I couldn't make out the rest of your post. Could you explain in a
little more detail? The last part that points to 'null' is especially
confusing...
As demonstrated earlier, it's not just the grammar. There are
situations that are unambiguous that pyparsing can't parse simply and
there's no reason for it.
Besides, ambiguous grammars are a fact of life and some of us need to
parse them. It's usually okay, too. Consider a previous example:


grammar = OneOrMore(Word(alphas)) + Literal('end')


While you may consider this inherently ambiguous, it's usually not.
That is to say, as long as it is rare that 'end' is used not at the end of
the string, this will simply parse and, yet, pyparsing will consistently
fail to parse it...

Neil Cerutti

unread,
Nov 4, 2007, 1:02:59 PM11/4/07
to
On 2007-11-04, Just Another Victim of the Ambient Morality
<ihat...@hotmail.com> wrote:
>> Consider writing a recursive decent parser by hand to parse
>> the language '[ab]+b'.
>>
>> goal --> ab_list 'b'
>> ab_list --> 'a' list_tail
>> ab_list --> 'b' list_tail
>> list_tail --> 'a' list_tail
>> list_tail --> 'b' list_tail
>> list_tail --> null
>>
>>
>> The above has the exact same bug (and probably some others--I'm
>> sorry unable to test it just now) as the PyParsing solution.
>>
>> The error is in the grammar. It might be fixed by specifying that
>> 'b' must be followed by EOF, and then it could be coded by using
>> more than one character of lookahead.
>
> I don't exactly understand the syntax you used to describe the
> productions of your recursive descent parser so not only did I not follow it
> but I couldn't make out the rest of your post. Could you explain in a
> little more detail? The last part that points to 'null' is especially
> confusing...

It's the BNF spelling of

goal --> ab_list 'b'
ab_list --> ab { ab }
ab --> 'a' | 'b'

The null is to say that list_tail can match nothing, i.e, an
empty string.

Then, in the Parser class, every method (except for match, which
is used as a central place to consume characters) corresponds to
one of the productions in the BNF. Breaking things down into
BNF-based productions often makes implementation, debugging and
code generation easier.

PyParsing saves me that stop, since I can often directly
implement the EBNF using PyParsing.

> As demonstrated earlier, it's not just the grammar. There are
> situations that are unambiguous that pyparsing can't parse
> simply and there's no reason for it.

Yes, many parser generators have many more limitations than just
the requirement of an unambiguous grammar.

> Besides, ambiguous grammars are a fact of life and some of us
> need to parse them. It's usually okay, too. Consider a
> previous example:
>
> grammar = OneOrMore(Word(alphas)) + Literal('end')
>
> While you may consider this inherently ambiguous, it's usually
> not. That is to say, as long as it is rare that 'end' is used
> not at the end of the string, this will simply parse and, yet,
> pyparsing will consistently fail to parse it...

I believe there's no cure for the confusion you're having except
for implementing a parser for your proposed grammar.
Alternatively, try implementing your grammar in one of your other
favorite parser generators.

--
Neil Cerutti

Just Another Victim of the Ambient Morality

unread,
Nov 4, 2007, 4:44:32 PM11/4/07
to

"Neil Cerutti" <hor...@yahoo.com> wrote in message
news:nPnXi.39845$G23....@newsreading01.news.tds.net...

Okay, I see that now, thank you.
Your statement from the previous post:


>> Consider writing a recursive decent parser by hand to parse
>> the language '[ab]+b'.
>>
>> goal --> ab_list 'b'
>> ab_list --> 'a' list_tail
>> ab_list --> 'b' list_tail
>> list_tail --> 'a' list_tail
>> list_tail --> 'b' list_tail
>> list_tail --> null
>>
>>
>> The above has the exact same bug (and probably some others--I'm
>> sorry unable to test it just now) as the PyParsing solution.


...merely demonstrates that this grammar is similarly ambiguous. There
are many ways to parse this correctly and pyparsing chooses none of these!
Instead, it returns the same error it does when the string has no
solutions...


>> As demonstrated earlier, it's not just the grammar. There are
>> situations that are unambiguous that pyparsing can't parse
>> simply and there's no reason for it.
>
> Yes, many parser generators have many more limitations than just
> the requirement of an unambiguous grammar.

Yes, but a recursive descent parser? I expect such things from LALR and
others, but not only do I expect a recursive descent parser to correctly
parse grammars but I expect it to even parse ambiguous ones, in that it is
the only technique prepared to find more than one solution...


>> Besides, ambiguous grammars are a fact of life and some of us
>> need to parse them. It's usually okay, too. Consider a
>> previous example:
>>
>> grammar = OneOrMore(Word(alphas)) + Literal('end')
>>
>> While you may consider this inherently ambiguous, it's usually
>> not. That is to say, as long as it is rare that 'end' is used
>> not at the end of the string, this will simply parse and, yet,
>> pyparsing will consistently fail to parse it...
>
> I believe there's no cure for the confusion you're having except
> for implementing a parser for your proposed grammar.
> Alternatively, try implementing your grammar in one of your other
> favorite parser generators.

I believe there is a cure and it's called recursive descent parsing.
It's slow, obviously, but it's correct and, sometimes (arguably, often),
that's more important the execution speed.

I spent this morning whipping up a proof of concept parser whose
interface greatly resembles pyparsing but, baring unknown bugs, works and
works as I'd expect a recursive descent parser to work. I don't know Python
very well so the parser is pretty simple. It only lexes single characters
as tokens. It only supports And, Or, Optional, OneOrMore and ZeroOrMore
rules but I already think this is a rich set of rules. I'm sure others can
be added. Finally, I'm not sure it's safely copying all its parameter input
the same way pyparsing does but surely those bugs can be worked out. It's
merely a proof of concept to demonstrate a point.
Everyone, please look it over and tell me what you think.
Unfortunately, my news client is kind of poor, so I can't simply cut and
paste the code into here. All the tabs get turned into single spacing, so I
will post this link, instead:


http://theorem.ca/~dlkong/new_pyparsing.zip


I hope you can all deal with .zip files. Let me know if this is a
problem.
Thank you...

Neil Cerutti

unread,
Nov 4, 2007, 6:27:18 PM11/4/07
to
On 2007-11-04, Just Another Victim of the Ambient Morality
<ihat...@hotmail.com> wrote:
> "Neil Cerutti" <hor...@yahoo.com> wrote in message
> news:nPnXi.39845$G23....@newsreading01.news.tds.net...
>> I believe there's no cure for the confusion you're having except
>> for implementing a parser for your proposed grammar.
>> Alternatively, try implementing your grammar in one of your other
>> favorite parser generators.
>
> I believe there is a cure and it's called recursive descent
> parsing. It's slow, obviously, but it's correct and, sometimes
> (arguably, often), that's more important the execution speed.
>
> I spent this morning whipping up a proof of concept parser
> whose interface greatly resembles pyparsing but, baring unknown
> bugs, works and works as I'd expect a recursive descent parser
> to work. I don't know Python very well so the parser is pretty
> simple. It only lexes single characters as tokens. It only
> supports And, Or, Optional, OneOrMore and ZeroOrMore rules but
> I already think this is a rich set of rules. I'm sure others
> can be added. Finally, I'm not sure it's safely copying all
> its parameter input the same way pyparsing does but surely
> those bugs can be worked out. It's merely a proof of concept
> to demonstrate a point.
> Everyone, please look it over and tell me what you think.
> Unfortunately, my news client is kind of poor, so I can't
> simply cut and paste the code into here. All the tabs get
> turned into single spacing, so I will post this link, instead:
>
> http://theorem.ca/~dlkong/new_pyparsing.zip

Your program doesn't necessarily address the ambiguity in the
grammar in question, since right now it is only a recognizer.
Will it be hard to get it to return a parse tree?

The grammar in your implementation is:

>>> goal = OneOrMore(RuleAnd('a') | RuleAnd('b')) + RuleAnd('b')
>>> goal.parse(0, 'ab')
True
>>> goal.parse(0, 'ba')
False
>>> goal.parse(0, 'b')
False
>>> goal.parse(0, 'aaab')
True
>>> goal.parse(0, 'abc')
True

So far so good. :)

--
Neil Cerutti

Just Another Victim of the Ambient Morality

unread,
Nov 4, 2007, 7:38:25 PM11/4/07
to

"Neil Cerutti" <hor...@yahoo.com> wrote in message
news:qzsXi.39852$G23....@newsreading01.news.tds.net...

Hey, it's only a proof of concept. If you can parse the tree, surely
you can record what you parsed, right?
Did you notice that the parse() functions have the rather serious bug of
not returning how much of the string they could parse? It just so happens
that the contstructions that I made only ever had to increment the matches
by one, so they just happen to work. That's an easy bug to fix but a pretty
major one to have overlooked. Hence, my enthusiasm for input...


> The grammar in your implementation is:
>
>>>> goal = OneOrMore(RuleAnd('a') | RuleAnd('b')) + RuleAnd('b')
>>>> goal.parse(0, 'ab')
> True
>>>> goal.parse(0, 'ba')
> False
>>>> goal.parse(0, 'b')
> False
>>>> goal.parse(0, 'aaab')
> True
>>>> goal.parse(0, 'abc')
> True
>
> So far so good. :)

Good! Keep hammering at it!
More importantly, study it to understand the idea I'm trying to convey.
This is what I thought a recursive descent parser would do...

Kay Schluehr

unread,
Nov 4, 2007, 7:50:37 PM11/4/07
to
On Nov 4, 10:44 pm, "Just Another Victim of the Ambient Morality"
<ihates...@hotmail.com>

> I believe there is a cure and it's called recursive descent parsing.
> It's slow, obviously, but it's correct and, sometimes (arguably, often),
> that's more important the execution speed.

Recursive decendent parsing is not necessarily slow but from your
remarks above I infer you want a general RD parser with backtracking:
when one rule doesn't match, try another one to derive the current
symbol in the input stream.

I'm not sure one needs to start again with a naive approach just to
avoid any parser theory. For a user of a parser it is quite important
whether she has to wait 50 seconds for a parse to run or 50
milliseconds. I don't like to compromise speed for implementation
simplicity here.


Just Another Victim of the Ambient Morality

unread,
Nov 4, 2007, 9:05:10 PM11/4/07
to

"Kay Schluehr" <kay.sc...@gmx.net> wrote in message
news:1194223837.1...@o3g2000hsb.googlegroups.com...

> On Nov 4, 10:44 pm, "Just Another Victim of the Ambient Morality"
> <ihates...@hotmail.com>
>
>> I believe there is a cure and it's called recursive descent parsing.
>> It's slow, obviously, but it's correct and, sometimes (arguably, often),
>> that's more important the execution speed.
>
> Recursive decendent parsing is not necessarily slow but from your
> remarks above I infer you want a general RD parser with backtracking:
> when one rule doesn't match, try another one to derive the current
> symbol in the input stream.

I think I've just discovered a major hurdle in my understand of the
problem.
You keep saying "with backtracking." Why? Isn't "backtracking"
inherent in recursion? So, why can't these alleged "recursive descent
parsers" find valid parsings? How are they not already backtracking? What
was the point of being recursive if not to take advantage of the inherent
backtracking in it?
Obviously, these parsers aren't recursing through what I think they
should be recursing. The question is "why not?"

Correct me if I'm wrong but I'm beginning to think that pyparsing
doesn't typically use recursion, at all. It only employs it if you create
one, using the Forward class. Otherwise, it does everything iteratively,
hence the lack of "backtracking."


> I'm not sure one needs to start again with a naive approach just to
> avoid any parser theory. For a user of a parser it is quite important
> whether she has to wait 50 seconds for a parse to run or 50
> milliseconds. I don't like to compromise speed for implementation
> simplicity here.

This attitude is all too prevalent among computer professionals... Of
course it's a useful thing to shield users from the intricacies of parser
theory! Just as much as it is useful to shield drivers from needing
automotive engineering or software users from programing. How many people
have come to this newsgroup asking about anomalous pyparsing behaviour,
despite their grammars being mathematically correct.
Think of it this way. You can force all the clients of pyparsing to
duplicate work on figuring out how to massage pyparsing to their grammars,
or you can do the work of getting pyparsing to solve people's problems,
once. That's what a library is supposed to do...
Finally, I can't believe you complain about potential speed problems.
First, depending on the size of the string, it's likely to be the difference
between 2ms and 200ms. Secondly, if speed were an issue, you wouldn't go
with a recursive descent parser. You'd go with LALR or the many other
parsing techniques available. Recursive descent parsing is for those
situations where you need correctness, regardless of execution time. These
situations happen...
I've said this before, albeit for a different language, but it applies
to Python just as well. I don't use Python to write fast code, I use it to
write code fast.
If _you_ "don't like to compromise speed for implementation simplicity"
then you have a plethora choices available to you. What about the guy who
needs to parse correctly and is unconcerned about speed?


Neil Cerutti

unread,
Nov 4, 2007, 9:09:08 PM11/4/07
to

Unfortunately I haven't had much time to play with it today; just
barely enough to put it through a very few paces.

> It just so happens that the contstructions that I made only
> ever had to increment the matches by one, so they just happen
> to work. That's an easy bug to fix but a pretty major one to
> have overlooked. Hence, my enthusiasm for input...
>
>> The grammar in your implementation is:
>>
>>>>> goal = OneOrMore(RuleAnd('a') | RuleAnd('b')) + RuleAnd('b')
>>>>> goal.parse(0, 'ab')
>> True
>>>>> goal.parse(0, 'ba')
>> False
>>>>> goal.parse(0, 'b')
>> False
>>>>> goal.parse(0, 'aaab')
>> True
>>>>> goal.parse(0, 'abc')
>> True
>>
>> So far so good. :)
>
> Good! Keep hammering at it!
> More importantly, study it to understand the idea I'm
> trying to convey. This is what I thought a recursive descent
> parser would do...

Kay has pointed out how it works. Strangely enough, I've never
studied a backtracking RDP before (trying to teach yourself a
subject like parsing can be tricky--I've had to somehow avoid all
the texts that overuse Greek letters--those incomprehensible
symbols confuse the hell out of me). It does simplify the job of
the grammar designer, but Kay's message makes it sound like it
won't scale very well.

It might, perhaps, be an interesting feature for PyParsing to
entertain by setting a 'backtracking' option, for when you're
writing a quick script and don't want to fuss too much with a
non-conformant grammar.

I'll have more time to look at it tomorrow.

--
Neil Cerutti

Neil Cerutti

unread,
Nov 4, 2007, 9:49:38 PM11/4/07
to
On 2007-11-05, Just Another Victim of the Ambient Morality <ihat...@hotmail.com> wrote:
>
> "Kay Schluehr" <kay.sc...@gmx.net> wrote in message
> news:1194223837.1...@o3g2000hsb.googlegroups.com...
>> On Nov 4, 10:44 pm, "Just Another Victim of the Ambient Morality"
>> <ihates...@hotmail.com>
>>
>>> I believe there is a cure and it's called recursive descent parsing.
>>> It's slow, obviously, but it's correct and, sometimes (arguably, often),
>>> that's more important the execution speed.
>>
>> Recursive decendent parsing is not necessarily slow but from your
>> remarks above I infer you want a general RD parser with backtracking:
>> when one rule doesn't match, try another one to derive the current
>> symbol in the input stream.
>
> I think I've just discovered a major hurdle in my understand of the
> problem.
> You keep saying "with backtracking." Why? Isn't "backtracking"
> inherent in recursion? So, why can't these alleged "recursive descent
> parsers" find valid parsings? How are they not already backtracking? What
> was the point of being recursive if not to take advantage of the inherent
> backtracking in it?
> Obviously, these parsers aren't recursing through what I think they
> should be recursing. The question is "why not?"

There are different kinds of recursion. Compare:

def fac1(x, y=1):
""" Compute factorials with a recursive function (it calls
itself), but the stack is not actually used for storing
anything important, i.e., it is tail-recursive. """
if x < 0:
raise ValueError('non-negative integer')
elif x == 0:
return y
else:
return fac1(x-1, y*x)

to

def fac2(x):
""" Computes factorials with a recursive process, keeping
the state of the calculation on the stack. """
if x < 0:
raise ValueError('non-negative integer')
if x == 0:
return 1
else:
return fac2(x-1) * x

to

def Ack(x, y):
""" The Ackermann function. Creates a humongous mess even
with quite tiny numbers. """
if x < 0 or y < 0:
raise ValueError('non-negative integer')
elif x == 0:
return y + 1
elif y == 0:
return foo3(x-1, 1)
else:
return foo3(x-1, foo3(x, y-1))

There's probably a word for the type of recursive process built
by fac2; the RDP's I'm most familiar with create a fac2 sort of
process, which stores valuable info on the stack.

And even though fac1 defines an iterative process, the code
itself is recursive, and you can call it a recursive function if
you wish (and in Python you might as well).

> Correct me if I'm wrong but I'm beginning to think that
> pyparsing doesn't typically use recursion, at all. It only
> employs it if you create one, using the Forward class.
> Otherwise, it does everything iteratively, hence the lack of
> "backtracking."

It's recursive because each production rule calls other
production rules to define itself. A rule regularly ends up
calling itself. Consider the Parser class I built earlier.
list_tail keeps calling itself to continue consuming characters
in an ab_list. The stack is used to keep track of where we are in
the grammar; at any time you can look up the stack and see how
you got where you are--you 'descend' down from the topmost
productions to the most primitive productions, and then back up
once everything has been sorted out. Take another look
at the exception raised in my Parsing class example for an
illustrative traceback.

>> I'm not sure one needs to start again with a naive approach just to
>> avoid any parser theory. For a user of a parser it is quite important
>> whether she has to wait 50 seconds for a parse to run or 50
>> milliseconds. I don't like to compromise speed for implementation
>> simplicity here.
>

> Finally, I can't believe you complain about potential speed
> problems. First, depending on the size of the string, it's
> likely to be the difference between 2ms and 200ms. Secondly,
> if speed were an issue, you wouldn't go with a recursive
> descent parser. You'd go with LALR or the many other parsing
> techniques available. Recursive descent parsing is for those
> situations where you need correctness, regardless of execution
> time. These situations happen...

RDP is plenty fast; speed has never been one of it's
disadvantages, as far as I know. Today there are many
excellent parser generators and compiler builders that compose an
RDP under the hood, e.g., Antlr and Gentle.

> I've said this before, albeit for a different language, but
> it applies to Python just as well. I don't use Python to write
> fast code, I use it to write code fast.
> If _you_ "don't like to compromise speed for implementation
> simplicity" then you have a plethora choices available to you.
> What about the guy who needs to parse correctly and is
> unconcerned about speed?

You have to be concerned about speed when something runs so
slowly in common circumstances compared to other well-known
algotithms that you can't practically wait for an answer. Would
you consider bubble-sort a suitable general-purpose sorting
algorithm for Python?

--
Neil Cerutti

Neil Cerutti

unread,
Nov 4, 2007, 9:52:48 PM11/4/07
to
On 2007-11-05, Neil Cerutti <hor...@yahoo.com> wrote:
> def Ack(x, y):
> """ The Ackermann function. Creates a humongous mess even
> with quite tiny numbers. """
> if x < 0 or y < 0:
> raise ValueError('non-negative integer')
> elif x == 0:
> return y + 1
> elif y == 0:
> return foo3(x-1, 1)
> else:
> return foo3(x-1, foo3(x, y-1))

Urk! Of course those foo3 calls should have been Ack calls.

--
Neil Cerutti

Just Another Victim of the Ambient Morality

unread,
Nov 4, 2007, 11:30:13 PM11/4/07
to

"Neil Cerutti" <hor...@yahoo.com> wrote in message
news:6xvXi.39855$G23....@newsreading01.news.tds.net...

While interesting, none of this actually addresses the point I was
making. I wasn't saying that there was no recursion (at least, not in this
paragraph), I was saying that it wasn't recursing through what I thought it
should be recursing through. It recurses through a set of rules without any
regard to how these rules interact with each other. That's why it fails to
parse valid strings. In my opinion, it should recurse through appropriate
combinations of rules to determine validity, rather than by arbitrary
categorization...


>> Correct me if I'm wrong but I'm beginning to think that
>> pyparsing doesn't typically use recursion, at all. It only
>> employs it if you create one, using the Forward class.
>> Otherwise, it does everything iteratively, hence the lack of
>> "backtracking."
>
> It's recursive because each production rule calls other
> production rules to define itself. A rule regularly ends up
> calling itself. Consider the Parser class I built earlier.
> list_tail keeps calling itself to continue consuming characters
> in an ab_list. The stack is used to keep track of where we are in
> the grammar; at any time you can look up the stack and see how
> you got where you are--you 'descend' down from the topmost
> productions to the most primitive productions, and then back up
> once everything has been sorted out. Take another look
> at the exception raised in my Parsing class example for an
> illustrative traceback.

I guess that all the And and Or class in pyparsing call methods of each
other from each other, even if they are doing so from different
instantiations. I still say they're not recursing through the right
things...


>>> I'm not sure one needs to start again with a naive approach just to
>>> avoid any parser theory. For a user of a parser it is quite important
>>> whether she has to wait 50 seconds for a parse to run or 50
>>> milliseconds. I don't like to compromise speed for implementation
>>> simplicity here.
>>
>> Finally, I can't believe you complain about potential speed
>> problems. First, depending on the size of the string, it's
>> likely to be the difference between 2ms and 200ms. Secondly,
>> if speed were an issue, you wouldn't go with a recursive
>> descent parser. You'd go with LALR or the many other parsing
>> techniques available. Recursive descent parsing is for those
>> situations where you need correctness, regardless of execution
>> time. These situations happen...
>
> RDP is plenty fast; speed has never been one of it's
> disadvantages, as far as I know. Today there are many
> excellent parser generators and compiler builders that compose an
> RDP under the hood, e.g., Antlr and Gentle.

I think I read somewhere that LALR was O(n) while RDP was O(e^n). Most
people would consider that, at least, slower...
I think your examples may exemplify how little speed matters rather than
how fast RDPs are...


>> I've said this before, albeit for a different language, but
>> it applies to Python just as well. I don't use Python to write
>> fast code, I use it to write code fast.
>> If _you_ "don't like to compromise speed for implementation
>> simplicity" then you have a plethora choices available to you.
>> What about the guy who needs to parse correctly and is
>> unconcerned about speed?
>
> You have to be concerned about speed when something runs so
> slowly in common circumstances compared to other well-known
> algotithms that you can't practically wait for an answer. Would
> you consider bubble-sort a suitable general-purpose sorting
> algorithm for Python?

What you've stated is an hypothetical. You need to be concerned about
speed when "you can't practically wait for an answer." Encryption is based
on this, for example. I'm not sure what you're getting at with "compared to
other well known algorithms." Generally, the Python virtual machine is
pitifully slow when compared to a compiled C program, however, the
comparison is unnecessary. For most tasks, Python is plenty fast. Context
is key. Things don't have to be fast, they just need to be fast enough...
As I have said before, if you need speed, there are many options.
Often, speed is not important, especially with the power of computing
available today. Development time is increasingly valuable. So, for us who
are not overly concerned with speed, what are our options?
Now, if my RDP takes two minutes to parse four characters of text, I'd
sympathize with your point of view. However, it really isn't noticeably
slow. I may be nervous parsing an entire text file with it but I'd feel
safe parsing individual lines of a .conf file, for instance...
And I'll have you know that when I was still learning to program (that
is, before I knew sort() was part of the C standard), I would routinely
implement bubble sort 'cause it was so trivial to implement. I wouldn't
even short cut, necessarily. Just iterate it as many times as there are
elements in the list!
Also, it should probably be noted that bubble sort is very fast for
nearly sorted lists; much faster than quicksort. So, while it shouldn't be
the default sorting algorithm, you could have it somewhere in the library...

Kay Schluehr

unread,
Nov 5, 2007, 2:47:44 AM11/5/07
to
On Nov 5, 3:05 am, "Just Another Victim of the Ambient Morality"
<ihates...@hotmail.com> wrote:
> "Kay Schluehr" <kay.schlu...@gmx.net> wrote in message

>
> news:1194223837.1...@o3g2000hsb.googlegroups.com...
>
> > On Nov 4, 10:44 pm, "Just Another Victim of the Ambient Morality"
> > <ihates...@hotmail.com>
>
> >> I believe there is a cure and it's called recursive descent parsing.
> >> It's slow, obviously, but it's correct and, sometimes (arguably, often),
> >> that's more important the execution speed.
>
> > Recursive decendent parsing is not necessarily slow but from your
> > remarks above I infer you want a general RD parser with backtracking:
> > when one rule doesn't match, try another one to derive the current
> > symbol in the input stream.
>
> I think I've just discovered a major hurdle in my understand of the
> problem.
> You keep saying "with backtracking." Why? Isn't "backtracking"
> inherent in recursion? So, why can't these alleged "recursive descent
> parsers" find valid parsings? How are they not already backtracking? What
> was the point of being recursive if not to take advantage of the inherent
> backtracking in it?
> Obviously, these parsers aren't recursing through what I think they
> should be recursing. The question is "why not?"

Backtracking and RD parsers are two different issues. An RD parser
keeps a grammar which is structured tree-like. So one feeds a start
symbol into the parser and the parser tries to "derive" a symbol of
the input stream by descending down along the tree of non-terminals.
For each non-terminal the parser calls itself because it is
essentially a recursive datastructure it iterates through.

Backtracking comes into play when the grammar is not properly "left
factored". This doesn't necessrily mean it is ambigous. Take for
example the following grammar:

S: A A 'foo' | A A 'bar'

The parser cannot immediately decide whether to take the left or the
right branch of the RHS. So it will checkout the left branch and when
it fails to derive 'foo' in the input stream it tries to select the
right branch. But the grammar is *not* ambigous: there is always a
unique parse tree that can be derived ( or none ).

Parser theory is mostly concerned with finding strategies to avoid
backtracking ( and resolving ambiguities ) because it slows parsers
down. One always wants to have parser effort that depends linear on
the size of the input stream.


> > I'm not sure one needs to start again with a naive approach just to
> > avoid any parser theory. For a user of a parser it is quite important
> > whether she has to wait 50 seconds for a parse to run or 50
> > milliseconds. I don't like to compromise speed for implementation
> > simplicity here.
>
> This attitude is all too prevalent among computer professionals... Of
> course it's a useful thing to shield users from the intricacies of parser
> theory!

Yes, but people who develop parser generators shall no at least a bit
about it ;)

> Just as much as it is useful to shield drivers from needing
> automotive engineering or software users from programing. How many people
> have come to this newsgroup asking about anomalous pyparsing behaviour,
> despite their grammars being mathematically correct.
> Think of it this way. You can force all the clients of pyparsing to
> duplicate work on figuring out how to massage pyparsing to their grammars,
> or you can do the work of getting pyparsing to solve people's problems,
> once. That's what a library is supposed to do...
> Finally, I can't believe you complain about potential speed problems.
> First, depending on the size of the string, it's likely to be the difference
> between 2ms and 200ms. Secondly, if speed were an issue, you wouldn't go
> with a recursive descent parser.

ANTLR is recursive descendend, so are ratpack parsers. Python uses an
extremely optimized recursive descendend table based parser for
parsing the language. The parser flies and it is not LALR. Grammars
are more accessible in EBNF style and simpler to write so many people
( like me ) prefer RD parsers and seek for ways to optimize them. As
always there is a separation of concerns. The software engineering
aspects of modularization, interface abstraction, library / framework
implementation and the algorithmic aspects. I'd like to see both being
addressed by a parser generator architecture.


Neil Cerutti

unread,
Nov 5, 2007, 6:04:18 AM11/5/07
to
On 2007-11-05, Just Another Victim of the Ambient Morality
<ihat...@hotmail.com> wrote:
> "Neil Cerutti" <hor...@yahoo.com> wrote in message
> news:6xvXi.39855$G23....@newsreading01.news.tds.net...
>> There are different kinds of recursion. Compare:
>
> While interesting, none of this actually addresses the
> point I was making. I wasn't saying that there was no
> recursion (at least, not in this paragraph), I was saying that
> it wasn't recursing through what I thought it should be
> recursing through. It recurses through a set of rules without
> any regard to how these rules interact with each other. That's
> why it fails to parse valid strings. In my opinion, it should
> recurse through appropriate combinations of rules to determine
> validity, rather than by arbitrary
> categorization...

OK. I thought you were saying that, without backtracking there's
no recursion. Never mind!

> I guess that all the And and Or class in pyparsing call
> methods of each other from each other, even if they are doing
> so from different instantiations. I still say they're not
> recursing through the right things...

Backtracking seems orthoganal to recursion, to me.

>>>> I'm not sure one needs to start again with a naive approach just to
>>>> avoid any parser theory. For a user of a parser it is quite important
>>>> whether she has to wait 50 seconds for a parse to run or 50
>>>> milliseconds. I don't like to compromise speed for implementation
>>>> simplicity here.
>>>
>>> Finally, I can't believe you complain about potential speed
>>> problems. First, depending on the size of the string, it's
>>> likely to be the difference between 2ms and 200ms. Secondly,
>>> if speed were an issue, you wouldn't go with a recursive
>>> descent parser. You'd go with LALR or the many other parsing
>>> techniques available. Recursive descent parsing is for those
>>> situations where you need correctness, regardless of execution
>>> time. These situations happen...
>>
>> RDP is plenty fast; speed has never been one of it's
>> disadvantages, as far as I know. Today there are many
>> excellent parser generators and compiler builders that compose an
>> RDP under the hood, e.g., Antlr and Gentle.
>
> I think I read somewhere that LALR was O(n) while RDP was
> O(e^n). Most people would consider that, at least, slower...

To my knowledge, most RDPs are LL(1), which is O(n). As you
increase the amount of lookahead (a backtracking RDP is, I
suppose, be a brute-force way of getting infinite lookahead), the
complexity increases.

> I think your examples may exemplify how little speed
> matters rather than how fast RDPs are...
>

> Also, it should probably be noted that bubble sort is very
> fast for nearly sorted lists; much faster than quicksort. So,
> while it shouldn't be the default sorting algorithm, you could
> have it somewhere in the library...

Yes, bubble-sort runs fast in certain circumstances; you wouldn't
want to be bereft of it completely. Backtracking parsers do
probably have their place in the pantheon. I don't want
PyParsing to do backtracking by default, though it might be a
useful option.

--
Neil Cerutti

Neil Cerutti

unread,
Nov 6, 2007, 10:48:11 AM11/6/07
to
On 2007-11-05, Just Another Victim of the Ambient Morality

<ihat...@hotmail.com> wrote:
> "Kay Schluehr" <kay.sc...@gmx.net> wrote in message
> news:1194223837.1...@o3g2000hsb.googlegroups.com...
>> I'm not sure one needs to start again with a naive approach
>> just to avoid any parser theory. For a user of a parser it is
>> quite important whether she has to wait 50 seconds for a parse
>> to run or 50 milliseconds. I don't like to compromise speed
>> for implementation simplicity here.
>
> This attitude is all too prevalent among computer
> professionals... Of course it's a useful thing to shield users
> from the intricacies of parser theory! Just as much as it is
> useful to shield drivers from needing automotive engineering or
> software users from programing. How many people have come to
> this newsgroup asking about anomalous pyparsing behaviour,
> despite their grammars being mathematically correct.

You might be interested in the Early parsing algorithm. It is
more efficient than the naive approach used in your prototype,
and still handles ambiguous grammars.

There is a Python module SPARK that provides generic classes for
building small language compilers using an Early parser, and I
was able to get it to parse your ambiguous grammar without
trouble. It is not as convenient or as well documented as
PyParsing, but the parsing algorithm provides the power you're
looking for. It might serve as a backend for the library you're
currently working on.

http://www.cpsc.ucalgary.ca/~aycock/spark/

--
Neil Cerutti

asdfjehqwjerhqjwljekrh

unread,
Nov 6, 2007, 11:27:06 AM11/6/07
to
On Nov 2, 6:47 am, Paul McGuire <pt...@austin.rr.com> wrote:
> Well I'll be darned! All this time, I thought "recursive descent"
> described the recursive behavior of the parser, which pyparsing
> definitely has. I never knew that backing up in the face of parse
> mismatches was a required part of the picture.

I looked at pyparsing about a year ago for some project and realized
that it wouldn't quite do what I needed it to. Maddeningly enough, I
cannot remember exactly what the problem was, but I think that it was
some combination of lack of backtracking and insufficient control over
whitespace skipping.

Mostly off-topic, what I could *really* use is something like this
folded into the Python standard library. I cannot count the number of
times that this would have saved me 30 lines of code.

http://mail.python.org/pipermail/python-dev/2004-August/047042.html

Mike

Just Another Victim of the Ambient Morality

unread,
Nov 7, 2007, 2:51:34 PM11/7/07
to

"Neil Cerutti" <hor...@yahoo.com> wrote in message
news:slrnfj13e4....@FIAD06.norwich.edu...

> On 2007-11-05, Just Another Victim of the Ambient Morality
> <ihat...@hotmail.com> wrote:
>> "Kay Schluehr" <kay.sc...@gmx.net> wrote in message
>> news:1194223837.1...@o3g2000hsb.googlegroups.com...
>>> I'm not sure one needs to start again with a naive approach
>>> just to avoid any parser theory. For a user of a parser it is
>>> quite important whether she has to wait 50 seconds for a parse
>>> to run or 50 milliseconds. I don't like to compromise speed
>>> for implementation simplicity here.
>>
>> This attitude is all too prevalent among computer
>> professionals... Of course it's a useful thing to shield users
>> from the intricacies of parser theory! Just as much as it is
>> useful to shield drivers from needing automotive engineering or
>> software users from programing. How many people have come to
>> this newsgroup asking about anomalous pyparsing behaviour,
>> despite their grammars being mathematically correct.
>
> You might be interested in the Early parsing algorithm. It is
> more efficient than the naive approach used in your prototype,
> and still handles ambiguous grammars.

I think I might be interested in this algorithm, thank you!


> There is a Python module SPARK that provides generic classes for
> building small language compilers using an Early parser, and I
> was able to get it to parse your ambiguous grammar without
> trouble. It is not as convenient or as well documented as
> PyParsing, but the parsing algorithm provides the power you're
> looking for. It might serve as a backend for the library you're
> currently working on.
>
> http://www.cpsc.ucalgary.ca/~aycock/spark/

You know, I tried this thing but, for the life of me, I can't figure out
how to use it and the few tutorials out there are less than illuminating...

Neil Cerutti

unread,
Nov 7, 2007, 4:04:40 PM11/7/07
to
On 2007-11-07, Just Another Victim of the Ambient Morality

<ihat...@hotmail.com> wrote:
> "Neil Cerutti" <hor...@yahoo.com> wrote in message
> news:slrnfj13e4....@FIAD06.norwich.edu...
>> You might be interested in the Early parsing algorithm. It is
>> more efficient than the naive approach used in your prototype,
>> and still handles ambiguous grammars.
>

I'll take this opportunity to correct my misspelling. It's
"Earley".

> I think I might be interested in this algorithm, thank you!
>

>> http://www.cpsc.ucalgary.ca/~aycock/spark/
>
> You know, I tried this thing but, for the life of me, I
> can't figure out how to use it and the few tutorials out there
> are less than illuminating...

I'll send you the code I composed.

The tricky part of Spark is the Token and AST classes you have to
use. A type used as a token class is required to provide a
__cmp__ function that behaves in the following confounding
manner:

class Token(object):
def __cmp__(self, o):
return cmp(self.type, o)

If you attempt to use a list, string or a tuple as token, it just
barfs. AST's are required to provide an even weirder interface.

In effect, you have to write badly designed wrappers around
tuples and lists, respectively to take advantage of the generic
classes.

Go to the examples directory of the distribution to find working
versions of these stupid classes.

Once you get over that hurdle it becomes easier. Be sure to
provide your Parser and Scanner classes with an error method to
prevent the library from raising SystemExit(!) on errors. Scanner
classes are also required to override the t_default method to
prevent this mishap.

In short, it hasn't really evovled into a user-friendly package
yet.

--
Neil Cerutti

Just Another Victim of the Ambient Morality

unread,
Nov 7, 2007, 4:15:50 PM11/7/07
to

"Neil Cerutti" <hor...@yahoo.com> wrote in message
news:ILpYi.39972$G23....@newsreading01.news.tds.net...

Thank you.
How is it that I seem to be the only one in the market for a correct
parser? Earley has a runtine of O(n^3) in the worst case and O(n^2)
typically. I have trouble believing that everyone else in the world has
such intense run-time requirements that they're willing to forego
correctness. Why can't I find a pyparsing-esque library with this
implementation? I'm tempted to roll my own except that it's a fairly
complicated algorithm and I don't really understand how it's any more
efficient than the naive approach...


Chris Mellon

unread,
Nov 7, 2007, 4:43:08 PM11/7/07
to Just Another Victim of the Ambient Morality, pytho...@python.org
On Nov 7, 2007 3:15 PM, Just Another Victim of the Ambient Morality
<ihat...@hotmail.com> wrote:

You have an unusual definition of correctness. Many people would say
that an ambiguous grammar is a bug, not something to support.

In fact, I often use pyparsing precisely in order to disambiguate
(according to specific rules, which are embodied by the parser)
ambiguous input, like bizarre hand-entered datetime value.

Steven D'Aprano

unread,
Nov 7, 2007, 5:16:37 PM11/7/07
to
On Wed, 07 Nov 2007 21:15:50 +0000, Just Another Victim of the Ambient
Morality wrote:

> Why can't I find a pyparsing-esque library with this implementation?
> I'm tempted to roll my own except that it's a fairly complicated
> algorithm and I don't really understand how it's any more efficient than
> the naive approach...

I think you may have just answered your own question :)


--
Steven.

Just Another Victim of the Ambient Morality

unread,
Nov 7, 2007, 6:15:40 PM11/7/07
to

"Chris Mellon" <ark...@gmail.com> wrote in message
news:mailman.945.11944717...@python.org...

> On Nov 7, 2007 3:15 PM, Just Another Victim of the Ambient Morality
> <ihat...@hotmail.com> wrote:
>
>> > In short, it hasn't really evovled into a user-friendly package
>> > yet.
>>
>> Thank you.
>> How is it that I seem to be the only one in the market for a correct
>> parser? Earley has a runtine of O(n^3) in the worst case and O(n^2)
>> typically. I have trouble believing that everyone else in the world has
>> such intense run-time requirements that they're willing to forego
>> correctness. Why can't I find a pyparsing-esque library with this
>> implementation? I'm tempted to roll my own except that it's a fairly
>> complicated algorithm and I don't really understand how it's any more
>> efficient than the naive approach...
>
> You have an unusual definition of correctness. Many people would say
> that an ambiguous grammar is a bug, not something to support.

I don't think I do. Besides, you assume too much...
First off, we've already established that there are unambiguous grammars
for which pyparsing will fail to parse. One might consider that a bug in
pyparsing...
Secondly, I get the impression you want to consider ambiguous grammars,
in some sense, "wrong." They are not. Even if they were, if you are
parsing something for which you are not the creator and that something
employs an ambiguous grammar, what choice do you have? Furthermore, given a
set of possible parsings, you might be able to decide which one you favour
given the context of what was parsed! There's a plethora of applications
for parsing ambiguous grammars yet there are no tools for doing so?


> In fact, I often use pyparsing precisely in order to disambiguate
> (according to specific rules, which are embodied by the parser)
> ambiguous input, like bizarre hand-entered datetime value.

What do you mean? How do you use pyparsing to disambiguate:


01-01-08


...?


Just Another Victim of the Ambient Morality

unread,
Nov 7, 2007, 6:16:35 PM11/7/07
to

"Steven D'Aprano" <st...@REMOVE-THIS-cybersource.com.au> wrote in message
news:13j4ea5...@corp.supernews.com...

Yes, but my own answer lacks detail that I was hoping you could
provide...

Chris Mellon

unread,
Nov 7, 2007, 7:01:31 PM11/7/07
to pytho...@python.org
On Nov 7, 2007 5:15 PM, Just Another Victim of the Ambient Morality

<ihat...@hotmail.com> wrote:
>
> "Chris Mellon" <ark...@gmail.com> wrote in message
> news:mailman.945.11944717...@python.org...
> > On Nov 7, 2007 3:15 PM, Just Another Victim of the Ambient Morality
> > <ihat...@hotmail.com> wrote:
> >
> >> > In short, it hasn't really evovled into a user-friendly package
> >> > yet.
> >>
> >> Thank you.
> >> How is it that I seem to be the only one in the market for a correct
> >> parser? Earley has a runtine of O(n^3) in the worst case and O(n^2)
> >> typically. I have trouble believing that everyone else in the world has
> >> such intense run-time requirements that they're willing to forego
> >> correctness. Why can't I find a pyparsing-esque library with this
> >> implementation? I'm tempted to roll my own except that it's a fairly
> >> complicated algorithm and I don't really understand how it's any more
> >> efficient than the naive approach...
> >
> > You have an unusual definition of correctness. Many people would say
> > that an ambiguous grammar is a bug, not something to support.
>
> I don't think I do.

There are an enormous variety of parsing tools, and it's the subject
of much research. And in all those tools, not one meets your
definition of correctness? You don't think that might make it unusual?


> Besides, you assume too much...
> First off, we've already established that there are unambiguous grammars
> for which pyparsing will fail to parse. One might consider that a bug in
> pyparsing...

You might. Or you might not, since it's well known that there are lots
of types of parsers that can't parse all possible grammars, but that
doesn't make those parsers useless.


> Secondly, I get the impression you want to consider ambiguous grammars,
> in some sense, "wrong." They are not.

Sure they are, at least in many contexts. I understand that you want
support for them, but it's by far more common to want one and only one
set of results from parsing a particular document.

>Even if they were, if you are
> parsing something for which you are not the creator and that something
> employs an ambiguous grammar, what choice do you have?

You either disambiguate, or you don't accept ambiguous input. The
third option seems to be what you want, which is to find all possible
solutions and return all of them (and wouldn't this be NP-hard in the
general case?) but that's not a satisfactory result in most
applications.

> Furthermore, given a
> set of possible parsings, you might be able to decide which one you favour
> given the context of what was parsed! There's a plethora of applications
> for parsing ambiguous grammars yet there are no tools for doing so?
>

If you can do this, isn't this really a sign that your grammar is
context sensitive?

>
> > In fact, I often use pyparsing precisely in order to disambiguate
> > (according to specific rules, which are embodied by the parser)
> > ambiguous input, like bizarre hand-entered datetime value.
>
> What do you mean? How do you use pyparsing to disambiguate:
>
>
> 01-01-08

The same way a human would - given an ambiguous date such as this, I
(arbitrarily) decided what it would mean. The alternative is to refuse
the input.

Just Another Victim of the Ambient Morality

unread,
Nov 8, 2007, 12:07:00 AM11/8/07
to

"Chris Mellon" <ark...@gmail.com> wrote in message
news:mailman.948.11944800...@python.org...

It doesn't appear to be common, I'll grant you that!
However, there is some research. For instance, the Earley parser
appears to be what I want (in conjunction with a parse tree builder). A CYK
parser would probably do, too. The algorithms are out there yet no one has
chosen to use any of them. At the same time, there are several LALR
parsers. Why did anyone need to write the second one after the first one
was written?!
In fact, in a sense, my problem is solved. There exists a solution to
my problem. It's just that no one has implemented that solution. I guess
you're right in that it really does appear to be an unusual problem but I
don't understand how...


>> Besides, you assume too much...
>> First off, we've already established that there are unambiguous
>> grammars
>> for which pyparsing will fail to parse. One might consider that a bug in
>> pyparsing...
>
> You might. Or you might not, since it's well known that there are lots
> of types of parsers that can't parse all possible grammars, but that
> doesn't make those parsers useless.

No one said they were useless. I only said that a correct parser is
useful. Many people in this thread seem to disagree and I find this
incredible...


>> Secondly, I get the impression you want to consider ambiguous
>> grammars,
>> in some sense, "wrong." They are not.
>
> Sure they are, at least in many contexts. I understand that you want
> support for them, but it's by far more common to want one and only one
> set of results from parsing a particular document.

Okay, in some contexts, an ambiguous grammar may be considered
erroneous. However, in many other contexts, it's merely a fact of life.
How is it that there are no tools to address this? If nothing else,
pyparsing throws the same error it does when there is no valid parsing of
the string. Having no solution and having several solutions are not the
same thing...


>>Even if they were, if you are
>> parsing something for which you are not the creator and that something
>> employs an ambiguous grammar, what choice do you have?
>
> You either disambiguate, or you don't accept ambiguous input. The
> third option seems to be what you want, which is to find all possible
> solutions and return all of them (and wouldn't this be NP-hard in the
> general case?) but that's not a satisfactory result in most
> applications.

What do you mean by "disambiguate?" Do you mean disambiguate the
grammar? One of the conditions of the problem is that you have no control
over the grammar, so that's really not an option. Also, an implicit
condition of solving a problem is that the problem be... solved, so not
accepting the input is not an option, either.
While there are many applications that can't deal with multiple
solutions, surely there are some? Again, perhaps you can pick one solution
over the other through the context of the parse results? That's kind of
hard to do if your parser refuses to return any results...
Just because a grammar is ambiguous doesn't mean the input string is.
It can easily be the case that most or all the input you're expecting will,
in practice, only produce one correct parse tree. In this case, it would be
useful for your parser to return a correct solution, even at random!
Finally, who cares if something is NP-Hard? Okay, in some situations,
you'd care. In many others, you don't. For instance, suppose your input
length has an upper bound? Unless that's a really high bound or a really
complex grammar, its runtime is not likely relevant...


>> Furthermore, given a
>> set of possible parsings, you might be able to decide which one you
>> favour
>> given the context of what was parsed! There's a plethora of applications
>> for parsing ambiguous grammars yet there are no tools for doing so?
>
> If you can do this, isn't this really a sign that your grammar is
> context sensitive?

I don't think so. I'm using the word "context" colloquially, not
grammatically. If you know what kind of output you're expecting and only
one of several parsings produces that kind of output, then you know which
one to run with. That doesn't mean the grammar is context sensitive; just
that the data is...
As an aside, do you know what tools are available for parsing context
senstive grammars?


>> > In fact, I often use pyparsing precisely in order to disambiguate
>> > (according to specific rules, which are embodied by the parser)
>> > ambiguous input, like bizarre hand-entered datetime value.
>>
>> What do you mean? How do you use pyparsing to disambiguate:
>>
>> 01-01-08
>
> The same way a human would - given an ambiguous date such as this, I
> (arbitrarily) decided what it would mean. The alternative is to refuse
> the input.

You use pyparsing to "arbitrarily decide what it would mean?"
You said you "often use pyparsing" to "disambiguate ambiguous input" and
I was wondering what you meant by this.
On one hand, it's not exactly a fair question since that date string is
not grammatically ambiguous. On the other hand, you're the one who brought
up "bizarre hand-entered datetime value," and I was only trying to give an
example of what you meant. So, what do you mean?


Kay Schluehr

unread,
Nov 8, 2007, 1:40:22 AM11/8/07
to
> What do you mean by "disambiguate?" Do you mean disambiguate the
> grammar? One of the conditions of the problem is that you have no control
> over the grammar, so that's really not an option. Also, an implicit
> condition of solving a problem is that the problem be... solved, so not
> accepting the input is not an option, either.

Could you please learn some parser theory 101 and then come back when
you have complaints about one or the other implemententation of a
particular Python parser? When some guy enters a forum with a "newbie
question" most of the time people are willing to give a fair and
comprehensive answer but they don't want to mess around with him
endlessly when he is not willing to learn and to listen.


JimJJ...@gmail.com

unread,
Nov 8, 2007, 11:58:02 AM11/8/07
to
On Nov 3, 10:49 am, Paul McGuire <pt...@austin.rr.com> wrote:

> grammar << ((word + grammar) | (word + Literal(end)))

> which works.

[Clarifies that the common (and similar) solution doesn't work -- this
works only because the literal binds tightly to the word, so you can't
get a word that matches on its own.]


> Unfortunately, when the OneOrMore gets constructed, it does not have
> any visibility beyond knowing what is to be repeated. Again, here is
> the data structure that is being built:
>
> - And
> - OneOrMore
> - Word(alphas)
> - Literal('end')
>
> Only at the level of the And is there any awareness that the OneOrMore
> is followed by anything, let alone by something which could be
> misinterpreted as something matching the OneOrMore's repetition
> expression.

> Can you suggest a way I could generalize this, so that OneOrMore stops
> matching before it gets to 'end'?

Not efficiently.

I think JAVotAM's point was that you could make it at least an option
if OneOrMore (and ZeroOrMore ...) could return fallback results as
well.

Then, when the And failed, it could retry with the next fallback
result from OneOrMore before it gives up completely.

-jJ

[david]

unread,
Nov 8, 2007, 6:28:59 PM11/8/07
to
Kay Schluehr wrote:
>
> Could you please learn some parser theory 101 and then come back when
> you have complaints about one or the other implemententation of a
> particular Python parser? When some guy enters a forum with a "newbie
> question" most of the time people are willing to give a fair and
> comprehensive answer but they don't want to mess around with him
> endlessly when he is not willing to learn and to listen.
>
>
Immaturity is a mark of this group, isn't it? When all else fails,
resort to insults.

My suggestion is, if you can't be polite, just go back to your own room.

(david)

Kay Schluehr

unread,
Nov 8, 2007, 11:19:15 PM11/8/07
to
On Nov 9, 12:28 am, "[david]" <da...@nospam.spam> wrote:
> Kay Schluehr wrote:
>
> > Could you please learn some parser theory 101 and then come back when
> > you have complaints about one or the other implemententation of a
> > particular Python parser? When some guy enters a forum with a "newbie
> > question" most of the time people are willing to give a fair and
> > comprehensive answer but they don't want to mess around with him
> > endlessly when he is not willing to learn and to listen.
>
> Immaturity is a mark of this group, isn't it? When all else fails,
> resort to insults.

Dear [david], when all else fails, be rigorous. Tell someone that it
is not o.k. to keep people busy with his complaints when he doesn't
know what he is talking about. This is *my* rejection of inpoliteness
and has nothing to do with "this group" in the first place. But I'm
not entirely sure I want to discuss about polite behaviour with
someone not using his full name.

Kay

0 new messages