Python and Regular Expressions

Richard Lamboj

unread,

Apr 7, 2010, 4:37:31 AM4/7/10

to pytho...@python.org

Hello,

i want to parse this String:

version 3.5.1 {

$pid_dir = /opt/samba-3.5.1/var/locks/
$bin_dir = /opt/samba-3.5.1/bin/

service smbd {
bin = ${bin_dir}smbd -D
pid = ${pid_dir}smbd.pid
}
service nmbd {
bin = ${bin_dir}nmbd -D
pid = ${pid_dir}nmbd.pid
}
service winbindd {
bin = ${bin_dir}winbindd -D
pid = ${pid_dir}winbindd.pid
}
}

version 3.2.14 {

$pid_dir = /opt/samba-3.5.1/var/locks/
$bin_dir = /opt/samba-3.5.1/bin/

service smbd {
bin = ${bin_dir}smbd -D
pid = ${pid_dir}smbd.pid
}
service nmbd {
bin = ${bin_dir}nmbd -D
pid = ${pid_dir}nmbd.pid
}
service winbindd {
bin = ${bin_dir}winbindd -D
pid = ${pid_dir}winbindd.pid
}
}

Step 1:

version 3.2.14 {

$pid_dir = /opt/samba-3.5.1/var/locks/
$bin_dir = /opt/samba-3.5.1/bin/

service smbd {
bin = ${bin_dir}smbd -D
pid = ${pid_dir}smbd.pid
}
service nmbd {
bin = ${bin_dir}nmbd -D
pid = ${pid_dir}nmbd.pid
}
service winbindd {
bin = ${bin_dir}winbindd -D
pid = ${pid_dir}winbindd.pid
}
}

Step 2:
service smbd {
bin = ${bin_dir}smbd -D
pid = ${pid_dir}smbd.pid
}
Step 3:
$pid_dir = /opt/samba-3.5.1/var/locks/
$bin_dir = /opt/samba-3.5.1/bin/

Step 4:
bin = ${bin_dir}smbd -D
pid = ${pid_dir}smbd.pid

My Regular Expressions:
version[\s]*[\w\.]*[\s]*\{[\w\s\n\t\{\}=\$\.\-_\/]*\}
service[\s]*[\w]*[\s]*\{([\n\s\w\=]*(\$\{[\w_]*\})*[\w\s\-=\.]*)*\}

I think it was no good Solution. I'am trying with Groups:
(service[\s\w]*)\{([\n\w\s=\$\-_\.]*)
but this part makes Problems: ${bin_dir}

Kind Regards

Richi

Chris Rebert

unread,

Apr 7, 2010, 4:52:14 AM4/7/10

to Richard Lamboj, pytho...@python.org

Regular expressions != Parsers

Every time someone tries to parse nested structures using regular
expressions, Jamie Zawinski kills a puppy.

Try using an *actual* parser, such as Pyparsing:
http://pyparsing.wikispaces.com/

Cheers,
Chris
--
Some people, when confronted with a problem, think:
"I know, I'll use regular expressions." Now they have two problems.
http://blog.rebertia.com

Bruno Desthuilliers

unread,

Apr 7, 2010, 5:13:33 AM4/7/10

to

Richard Lamboj a écrit :

> Hello,
>
> i want to parse this String:
>
> version 3.5.1 {
>
> $pid_dir = /opt/samba-3.5.1/var/locks/
> $bin_dir = /opt/samba-3.5.1/bin/
>
> service smbd {
> bin = ${bin_dir}smbd -D
> pid = ${pid_dir}smbd.pid
> }
> service nmbd {
> bin = ${bin_dir}nmbd -D
> pid = ${pid_dir}nmbd.pid
> }
> service winbindd {
> bin = ${bin_dir}winbindd -D
> pid = ${pid_dir}winbindd.pid
> }
> }

(snip)

I think you'd be better writing a specific parser here. Paul McGuire's
PyParsing package might help:

http://pyparsing.wikispaces.com/

My 2 cents.

Richard Lamboj

unread,

Apr 7, 2010, 5:44:30 AM4/7/10

to pytho...@python.org

Am Wednesday 07 April 2010 10:52:14 schrieb Chris Rebert:
> On Wed, Apr 7, 2010 at 1:37 AM, Richard Lamboj <richard...@bilcom.at>
wrote:

> > i want to parse this String:
> >
> > version 3.5.1 {
> >
> > $pid_dir = /opt/samba-3.5.1/var/locks/
> > $bin_dir = /opt/samba-3.5.1/bin/
> >
> > service smbd {
> > bin = ${bin_dir}smbd -D
> > pid = ${pid_dir}smbd.pid
> > }
> > service nmbd {
> > bin = ${bin_dir}nmbd -D
> > pid = ${pid_dir}nmbd.pid
> > }
> > service winbindd {
> > bin = ${bin_dir}winbindd -D
> > pid = ${pid_dir}winbindd.pid
> > }
> > }
> >

> > version 3.2.14 {

> >
> > $pid_dir = /opt/samba-3.5.1/var/locks/
> > $bin_dir = /opt/samba-3.5.1/bin/
> >
> > service smbd {
> > bin = ${bin_dir}smbd -D
> > pid = ${pid_dir}smbd.pid
> > }
> > service nmbd {
> > bin = ${bin_dir}nmbd -D
> > pid = ${pid_dir}nmbd.pid
> > }
> > service winbindd {
> > bin = ${bin_dir}winbindd -D
> > pid = ${pid_dir}winbindd.pid
> > }
> > }
> >

> > Step 1:
> >
> > version 3.2.14 {

> >
> > $pid_dir = /opt/samba-3.5.1/var/locks/
> > $bin_dir = /opt/samba-3.5.1/bin/
> >
> > service smbd {
> > bin = ${bin_dir}smbd -D
> > pid = ${pid_dir}smbd.pid
> > }
> > service nmbd {
> > bin = ${bin_dir}nmbd -D
> > pid = ${pid_dir}nmbd.pid
> > }
> > service winbindd {
> > bin = ${bin_dir}winbindd -D
> > pid = ${pid_dir}winbindd.pid
> > }
> > }
> >

> > Step 2:

> > service smbd {
> > bin = ${bin_dir}smbd -D
> > pid = ${pid_dir}smbd.pid
> > }

> > Step 3:

> > $pid_dir = /opt/samba-3.5.1/var/locks/
> > $bin_dir = /opt/samba-3.5.1/bin/
> >

> > Step 4:

> > bin = ${bin_dir}smbd -D
> > pid = ${pid_dir}smbd.pid
> >

> > My Regular Expressions:
> > version[\s]*[\w\.]*[\s]*\{[\w\s\n\t\{\}=\$\.\-_\/]*\}
> > service[\s]*[\w]*[\s]*\{([\n\s\w\=]*(\$\{[\w_]*\})*[\w\s\-=\.]*)*\}
> >
> > I think it was no good Solution. I'am trying with Groups:
> > (service[\s\w]*)\{([\n\w\s=\$\-_\.]*)
> > but this part makes Problems: ${bin_dir}
>
> Regular expressions != Parsers
>
> Every time someone tries to parse nested structures using regular
> expressions, Jamie Zawinski kills a puppy.
>
> Try using an *actual* parser, such as Pyparsing:
> http://pyparsing.wikispaces.com/
>
> Cheers,
> Chris
> --
> Some people, when confronted with a problem, think:
> "I know, I'll use regular expressions." Now they have two problems.
> http://blog.rebertia.com

Well, after some trying with regex, your both right. I will use pyparse it
seems to be the better solution.

Kind Regards

Patrick Maupin

unread,

Apr 7, 2010, 9:25:36 PM4/7/10

to

On Apr 7, 3:52 am, Chris Rebert <c...@rebertia.com> wrote:

> Regular expressions != Parsers

True, but lots of parsers *use* regular expressions in their
tokenizers. In fact, if you have a pure Python parser, you can often
get huge performance gains by rearranging your code slightly so that
you can use regular expressions in your tokenizer, because that
effectively gives you access to a fast, specialized C library that is
built into practically every Python interpreter on the planet.

> Every time someone tries to parse nested structures using regular
> expressions, Jamie Zawinski kills a puppy.

And yet, if you are parsing stuff in Python, and your parser doesn't
use some specialized C code for tokenization (which will probably be
regular expressions unless you are using mxtexttools or some other
specialized C tokenizer code), your nested structure parser will be
dog slow.

Now, for some applications, the speed just doesn't matter, and for
people who don't yet know the difference between regexps and parsing,
pointing them at PyParsing is certainly doing them a valuable service.

But that's twice today when I've seen people warned off regular
expressions without a cogent explanation that, while the re module is
good at what it does, it really only handles the very lowest level of
a parsing problem.

My 2 cents is that something like PyParsing is absolutely great for
people who want a simple parser without a lot of work. But if people
use PyParsing, and then find out that (for their particular
application) it isn't fast enough, and then wonder what to do about
it, if all they remember is that somebody told them not to use regular
expressions, they will just come to the false conclusion that pure
Python is too painfully slow for any real world task.

Regards,
Pat

Nobody

unread,

Apr 8, 2010, 6:13:00 AM4/8/10

to

On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote:

>> Regular expressions != Parsers
>
> True, but lots of parsers *use* regular expressions in their
> tokenizers. In fact, if you have a pure Python parser, you can often
> get huge performance gains by rearranging your code slightly so that
> you can use regular expressions in your tokenizer, because that
> effectively gives you access to a fast, specialized C library that is
> built into practically every Python interpreter on the planet.

Unfortunately, a typical regexp library (including Python's) doesn't allow
you to match against a set of regexps, returning the index of which one
matched. Which is what you really want for a tokeniser.

>> Every time someone tries to parse nested structures using regular
>> expressions, Jamie Zawinski kills a puppy.
>
> And yet, if you are parsing stuff in Python, and your parser doesn't
> use some specialized C code for tokenization (which will probably be
> regular expressions unless you are using mxtexttools or some other
> specialized C tokenizer code), your nested structure parser will be
> dog slow.

The point is that you *cannot* match arbitrarily-nested expressions using
regexps. You could, in theory, write a regexp which will match any valid
syntax up to N levels of nesting, for any finite N. But in practice, the
regexp is going to look horrible (and is probably going to be quite
inefficient if the regexp library uses backtracking rather than a DFA).

Even tokenising with Python's regexp interface is inefficient if the
number of token types is large, as you have to test against each regexp
sequentially.

Ultimately, if you want an efficient parser, you need something with a C
component, e.g. Plex.

Richard Lamboj

unread,

Apr 8, 2010, 7:56:48 AM4/8/10

to pytho...@python.org

At the moment i have less time, so its painful to read about parsing, but it
is quite interessting.

I have taken a look at the different Parsing Modules and i'am reading the
Source Code to understand how they Work. Since Yesterday i'am writing on my
own small Engine - Just for Fun and understanding how i can get what i need.

It seems that this is hard to code, becouse the logic is complex and sometimes
confussing. Its not easy to find a "perfect" solution.

If someone knows good links to this thema, or can explain how parsers
should/could work, please post it, or explain it.

Thanks for the Informations and the Help!

Kind Regards

Richi

Charles

unread,

Apr 8, 2010, 8:02:39 AM4/8/10

to

"Nobody" <nob...@nowhere.com> wrote in message
news:pan.2010.04.08....@nowhere.com...

> On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote:
>
>>> Regular expressions != Parsers
>>
>> True, but lots of parsers *use* regular expressions in their
>> tokenizers. In fact, if you have a pure Python parser, you can often
>> get huge performance gains by rearranging your code slightly so that
>> you can use regular expressions in your tokenizer, because that
>> effectively gives you access to a fast, specialized C library that is
>> built into practically every Python interpreter on the planet.
>
> Unfortunately, a typical regexp library (including Python's) doesn't allow
> you to match against a set of regexps, returning the index of which one
> matched. Which is what you really want for a tokeniser.
>

[snip]

Really !,
I am only a python newbie, but what about ...

import re
rr = [
( "id", '([a-zA-Z][a-zA-Z0-9]*)' ),
( "int", '([+-]?[0-9]+)' ),
( "float", '([+-]?[0-9]+\.[0-9]*)' ),
( "float", '([+-]?[0-9]+\.[0-9]*[eE][+-]?[0-9]+)' )
]
tlist = [ t[0] for t in rr ]
pat = '^ *(' + '|'.join([ t[1] for t in rr ]) + ') *$'
p = re.compile(pat)

ss = [ ' annc', '1234', 'abcd', ' 234sz ', '-1.24e3', '5.' ]
for s in ss:
m = p.match(s)
if m:
ix = [ i-2 for i in range(2,6) if m.group(i) ]
print "'"+s+"' matches and has type", tlist[ix[0]]
else:
print "'"+s+"' does not match"

output:
' annc' matches and has type id
'1234' matches and has type int
'abcd' matches and has type id
' 234sz ' does not match
'-1.24e3' matches and has type float
'5.' matches and has type float

seems to me to match a (small) set of regular expressions and
indirectly return the index of the matched expression, without
doing a sequential loop over the regular expressions.

Of course there is a loop over the reults of the match to determine
which sub-expression matched, but a good regexp library (which
I presume Python has) should match the sub-expressions without
looping over them. The techniques to do this were well known in
the 1970's when the first versons of lex were written.

Not that I would recommend tricks like this. The regular
expression would quickly get out of hand for any non-trivial
list of regular expresssions to match.

Charles

Neil Cerutti

unread,

Apr 10, 2010, 11:39:18 AM4/10/10

to

On 2010-04-08, Richard Lamboj <richard...@bilcom.at> wrote:
> If someone knows good links to this thema, or can explain how
> parsers should/could work, please post it, or explain it.
>
> Thanks for the Informations and the Help!

I liked Crenshaw's "Let's Build a Compiler!". It's pretty trivial
to convert his Pascal to Python, and you'll get to basic parsing
in no time. URL:http://compilers.iecc.com/crenshaw/

--
Neil Cerutti

Patrick Maupin

unread,

Apr 10, 2010, 12:21:39 PM4/10/10

to

On Apr 8, 5:13 am, Nobody <nob...@nowhere.com> wrote:
> On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote:
> >> Regular expressions != Parsers
>
> > True, but lots of parsers *use* regular expressions in their
> > tokenizers. In fact, if you have a pure Python parser, you can often
> > get huge performance gains by rearranging your code slightly so that
> > you can use regular expressions in your tokenizer, because that
> > effectively gives you access to a fast, specialized C library that is
> > built into practically every Python interpreter on the planet.

> Unfortunately, a typical regexp library (including Python's) doesn't allow
> you to match against a set of regexps, returning the index of which one
> matched. Which is what you really want for a tokeniser.

Actually, there is some not very-well-documented code in the re module
that will let you do exactly that. But even not using that code,
using a first cut of re.split() or re.finditer() to break the string
apart into tokens (without yet classifying them) is usually a huge
performance win over anything else in the standard library (or that
you could write in pure Python) for this task.

> >> Every time someone tries to parse nested structures using regular
> >> expressions, Jamie Zawinski kills a puppy.
>
> > And yet, if you are parsing stuff in Python, and your parser doesn't
> > use some specialized C code for tokenization (which will probably be
> > regular expressions unless you are using mxtexttools or some other
> > specialized C tokenizer code), your nested structure parser will be
> > dog slow.
>
> The point is that you *cannot* match arbitrarily-nested expressions using
> regexps. You could, in theory, write a regexp which will match any valid
> syntax up to N levels of nesting, for any finite N. But in practice, the
> regexp is going to look horrible (and is probably going to be quite
> inefficient if the regexp library uses backtracking rather than a DFA).

Trust me, I already knew that. But what you just wrote is a much more
useful thing to tell the OP than "Every time someone tries to parse

nested structures using regular expressions, Jamie Zawinski kills a

puppy" which is what I was responding to. And right after
regurgitating that inside joke, Chris Rebert then went on to say "Try
using an *actual* parser, such as Pyparsing". Which is all well and
good, except then the OP will download pyparsing, take a look, realize
that it uses regexps under the hood, and possibly be very confused.

> Even tokenising with Python's regexp interface is inefficient if the
> number of token types is large, as you have to test against each regexp
> sequentially.

It's not that bad if you do it right. You can first rip things apart,
then use a dict-based scheme to categorize them.

> Ultimately, if you want an efficient parser, you need something with a C
> component, e.g. Plex.

There is no doubt that you can get better performance with C than with
Python. But, for a lot of tasks, the Python performance is
acceptable, and, as always, algorithm, algorithm, algorithm...

A case in point. My pure Python RSON parser is faster on my computer
on a real-world dataset of JSON data than the json decoder that comes
with Python 2.6, *even with* the json decoder's C speedups enabled.

Having said that, the current subversion pure Python simplejson parser
is slightly faster than my RSON parser, and the C reimplementation of
the parser in current subversion simplejson completely blows the doors
off my RSON parser.

So, a naive translation to C, even by an experienced programmer, may
not do as much for you as an algorithm rework.

Regards,
Pat

Neil Cerutti

unread,

Apr 10, 2010, 12:35:51 PM4/10/10

to

On 2010-04-10, Patrick Maupin <pma...@gmail.com> wrote:
> Trust me, I already knew that. But what you just wrote is a
> much more useful thing to tell the OP than "Every time someone
> tries to parse nested structures using regular expressions,
> Jamie Zawinski kills a puppy" which is what I was responding
> to. And right after regurgitating that inside joke, Chris
> Rebert then went on to say "Try using an *actual* parser, such
> as Pyparsing". Which is all well and good, except then the OP
> will download pyparsing, take a look, realize that it uses
> regexps under the hood, and possibly be very confused.

I don't agree with that. If a person is trying to ski using
pieces of wood that they carved themselves, I don't expect them
to be surprised that the skis they buy are made out of similar
materials.

--
Neil Cerutti

Patrick Maupin

unread,

Apr 10, 2010, 1:11:07 PM4/10/10

to

On Apr 10, 11:35 am, Neil Cerutti <ne...@norwich.edu> wrote:

> On 2010-04-10, Patrick Maupin <pmau...@gmail.com> wrote:
> > as Pyparsing". Which is all well and good, except then the OP
> > will download pyparsing, take a look, realize that it uses
> > regexps under the hood, and possibly be very confused.
>
> I don't agree with that. If a person is trying to ski using
> pieces of wood that they carved themselves, I don't expect them
> to be surprised that the skis they buy are made out of similar
> materials.

But, in this case, the guy ASKED how to make the skis in his
woodworking shop, and was told not to be silly -- you don't use wood
to make skis -- and then directed to go buy some skis that are, in
fact, made out of wood.

I think it would have been perfectly appropriate to point out that it
might take some additional woodworking equipment and a bit of
experience and/or study and/or extra work to make decent skis out of
wood (and, oh, by the way, here is where you can buy some ready-made
skis cheap), but the original response didn't explain it like this.

Regards,
Pat

Stefan Behnel

unread,

Apr 10, 2010, 2:05:35 PM4/10/10

to pytho...@python.org

Patrick Maupin, 10.04.2010 19:11:

> On Apr 10, 11:35 am, Neil Cerutti<ne...@norwich.edu> wrote:
>> On 2010-04-10, Patrick Maupin<pmau...@gmail.com> wrote:
>>> as Pyparsing". Which is all well and good, except then the OP
>>> will download pyparsing, take a look, realize that it uses
>>> regexps under the hood, and possibly be very confused.
>>
>> I don't agree with that. If a person is trying to ski using
>> pieces of wood that they carved themselves, I don't expect them
>> to be surprised that the skis they buy are made out of similar
>> materials.
>
> But, in this case, the guy ASKED how to make the skis in his
> woodworking shop, and was told not to be silly -- you don't use wood
> to make skis -- and then directed to go buy some skis that are, in
> fact, made out of wood.

Running a Python program in CPython eventually boils down to a sequence of
commands being executed by the CPU. That doesn't mean you should write
those commands manually, even if you can. It's perfectly ok to write the
program in Python instead.

Stefan

Ethan Furman

unread,

Apr 10, 2010, 7:57:17 PM4/10/10

to Stefan Behnel, pytho...@python.org

Stefan Behnel wrote:
> Patrick Maupin, 10.04.2010 19:11:
>

>> On Apr 10, 11:35 am, Neil Cerutti<ne...@norwich.edu> wrote:
>>
>>> On 2010-04-10, Patrick Maupin<pmau...@gmail.com> wrote:
>>>
>>>> as Pyparsing". Which is all well and good, except then the OP
>>>> will download pyparsing, take a look, realize that it uses
>>>> regexps under the hood, and possibly be very confused.
>>>
>>>
>>> I don't agree with that. If a person is trying to ski using
>>> pieces of wood that they carved themselves, I don't expect them
>>> to be surprised that the skis they buy are made out of similar
>>> materials.
>>
>>
>> But, in this case, the guy ASKED how to make the skis in his
>> woodworking shop, and was told not to be silly -- you don't use wood
>> to make skis -- and then directed to go buy some skis that are, in
>> fact, made out of wood.
>
>

> Running a Python program in CPython eventually boils down to a sequence
> of commands being executed by the CPU. That doesn't mean you should
> write those commands manually, even if you can. It's perfectly ok to
> write the program in Python instead.
>
> Stefan

And it's even more perfectly okay to use Python when it's the best tool
for the job, and re when *it's* the best tool for the job.

~Ethan~

Steven D'Aprano

unread,

Apr 10, 2010, 9:23:56 PM4/10/10

to

On Sat, 10 Apr 2010 10:11:07 -0700, Patrick Maupin wrote:

> On Apr 10, 11:35 am, Neil Cerutti <ne...@norwich.edu> wrote:
>> On 2010-04-10, Patrick Maupin <pmau...@gmail.com> wrote:
>> > as Pyparsing". Which is all well and good, except then the OP will
>> > download pyparsing, take a look, realize that it uses regexps under
>> > the hood, and possibly be very confused.
>>
>> I don't agree with that. If a person is trying to ski using pieces of
>> wood that they carved themselves, I don't expect them to be surprised
>> that the skis they buy are made out of similar materials.
>
> But, in this case, the guy ASKED how to make the skis in his woodworking
> shop, and was told not to be silly -- you don't use wood to make skis --
> and then directed to go buy some skis that are, in fact, made out of
> wood.

As entertaining as this is, the analogy is rubbish. Skis are far too
simple to use as an analogy for a parser (he says, having never seen skis
up close in his life *wink*). Have you looked at PyParsing's source code?
Regexes are only a small part of the parser, and not analogous to the
wood of skis.

Perhaps a better analogy would be a tennis racket, with regexes being the
strings. You have a whole lot of strings, not just one, and they are held
together with a strong framework. Without the framework the strings are
useless, and without the strings the racket doesn't do anything useful.

Using this analogy, I would say the OP was wanting to play tennis with a
single piece of string, and asking for advise on beefing it up to make it
work better. Perhaps a knot tied in one end will help?

--
Steven

Paul Rubin

unread,

Apr 10, 2010, 9:38:28 PM4/10/10

to

Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> writes:
> As entertaining as this is, the analogy is rubbish. Skis are far too
> simple to use as an analogy for a parser (he says, having never seen skis
> up close in his life *wink*). Have you looked at PyParsing's source code?
> Regexes are only a small part of the parser, and not analogous to the
> wood of skis.

The impression that I have (from a distance) is that Pyparsing is a good
interface abstraction with a kludgy and slow implementation. That the
implementation uses regexps just goes to show how kludgy it is. One
hopes that someday there will be a more serious implementation, perhaps
using llvm-py (I wonder whatever happened to that project, by the way)
so that your parser script will compile to executable machine code on
the fly.

Paul McGuire

unread,

Apr 11, 2010, 12:32:09 AM4/11/10

to

On Apr 10, 8:38 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> The impression that I have (from a distance) is that Pyparsing is a good
> interface abstraction with a kludgy and slow implementation. That the
> implementation uses regexps just goes to show how kludgy it is. One
> hopes that someday there will be a more serious implementation, perhaps
> using llvm-py (I wonder whatever happened to that project, by the way)
> so that your parser script will compile to executable machine code on
> the fly.

I am definitely flattered that pyparsing stirs up so much interest,
and among such a distinguished group. But I have to take some umbrage
at Paul Rubin's left-handed compliment, "Pyparsing is a good
interface abstraction with a kludgy and slow implementation,"
especially since he forms his opinions "from a distance".

I actually *did* put some thought into what I wanted in pyparsing
before designing it, and this forms this chapter of "Getting Started
with Pyparsing" (available here as a free online excerpt:
http://my.safaribooksonline.com/9780596514235/what_makes_pyparsing_so_special#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTk3ODA1OTY1MTQyMzUvMTYmaW1hZ2VwYWdlPTE2),
the "Zen of Pyparsing" as it were. My goals were:

- build parsers using explicit constructs (such as words, groups,
repetition, alternatives), vs. expression encoding using specialized
character sequences, as found in regexen

- easy parser construction from primitive elements to complex groups
and alternatives, using Python's operator overloading for ease of
direct implementation of parsers using ordinary Python syntax; include
mechanisms for defining recursive parser expressions

- implicit skipping of whitespace between parser elements

- results returned not just as a list of strings, but as a rich data
object, with access to parsed fields by name or by list index, taking
interfaces from both dicts and lists for natural adoption into common
Python idioms

- no separate code-generation steps, a la lex/yacc

- support for parse-time callbacks, for specialized token handling,
conversion, and/or construction of data structures

- 100% pure Python, to be runnable on any platform that supports
Python

- liberal licensing, to permit easy adoption into any user's projects
anywhere

So raw performance really didn't even make my short-list, beyond the
obvious "should be tolerably fast enough."

I have found myself reading posts on c.l.py with wording like "I'm
trying to parse <blah-blah> and I've been trying for hours/days to get
this regex working." For kicks, I'd spend 5-15 minutes working up a
working pyparsing solution, which *does* run comparatively slowly,
perhaps taking a few minutes to process the poster's data file. But
the net solution is developed and running in under 1/2 an hour, which
to me seems like an overall gain compared to hours of fruitless
struggling with backslashes and regex character sequences. On top of
which, the pyparsing solutions are still readable when I come back to
them weeks or months later, instead of staring at some line-noise
regex and just scratch my head wondering what it was for. And
sometimes "comparatively slowly" means that it runs 50x slower than a
compiled method that runs in 0.02 seconds - that's still getting the
job done in just 1 second.

And is the internal use of regexes with pyparsing really a "kludge"?
Why? They are almost completely hidden from the parser developer. And
yet by using compiled regexes, I retain the portability of 100% Python
while leveraging the compiled speed of the re engine.

It does seem that there have been many posts of late (either on c.l.py
or the related posts on Stackoverflow) where the OP is trying to
either scrape content from HTML, or parse some type of recursive
expression. HTML scrapers implemented using re's are terribly
fragile, since HTML in the wild often contains little surprises
(unexpected whitespace; upper/lower case inconsistencies; tag
attributes in unpredictable order; attribute values with double,
single, or no quotation marks) which completely frustrate any re-based
approach. Granted, there are times when an re-parsing-of-HTML
endeavor *isn't* futile or doomed from the start - the OP may be
working with a very restricted set of HTML, generated from some other
script so that the output is very consistent. Unfortunately, this
poster usually gets thrown under the same "you'll never be able to
parse HTML with re's" bus. I can't explain the surge in these posts,
other than to wonder if we aren't just seeing a skewed sample - that
is, the many cases where people *are* successfully using re's to solve
their text extraction problems aren't getting posted to c.l.py, since
no one posts questions they already have the answers to.

So don't be too dismissive of pyparsing, Mr. Rubin. I've gotten many e-
mails, wiki, and forum posts from Python users at all levels of the
expertise scale, saying that pyparsing has helped them to be very
productive in one or another aspect of creating a command parser, or
adding safe expression evaluation to an app, or just extracting some
specific data from a log file. I am encouraged that most report that
they can get their parsers working in reasonably short order, often by
reworking one of the examples that comes with pyparsing. If you're
offering to write that extension to pyparsing that generates the
parser runtime in fast machine code, it sounds totally bitchin' and
I'd be happy to include it when it's ready.

-- Paul

Patrick Maupin

unread,

Apr 11, 2010, 2:29:04 AM4/11/10

to

On Apr 10, 1:05 pm, Stefan Behnel <stefan...@behnel.de> wrote:

> Running a Python program in CPython eventually boils down to a sequence of
> commands being executed by the CPU. That doesn't mean you should write
> those commands manually, even if you can. It's perfectly ok to write the
> program in Python instead.

Absolutely. But (as I seem to have posted many times recently) if
somebody asks how to do "x" it may be useful to point out that it
sounds like he really wants "y" and there are already several canned
solutions that do "y", but if he really wants "x", here is how he
should do it, or here is why he will have problems if he attempts to
do it (hint: whether Jamie Zawinski decides to kill a puppy or not is
not really a problem for somebody just asking a programming question
-- that's really up to Jamie).

Regards,
Pat

Neil Cerutti

unread,

Apr 12, 2010, 8:09:32 AM4/12/10

to

On 2010-04-11, Steven D'Aprano

<st...@REMOVE-THIS-cybersource.com.au> wrote:
> On Sat, 10 Apr 2010 10:11:07 -0700, Patrick Maupin wrote:
>> On Apr 10, 11:35??am, Neil Cerutti <ne...@norwich.edu> wrote:
>>> On 2010-04-10, Patrick Maupin <pmau...@gmail.com> wrote:

>>> > as Pyparsing". ??Which is all well and good, except then the OP will

>>> > download pyparsing, take a look, realize that it uses regexps under
>>> > the hood, and possibly be very confused.
>>>
>>> I don't agree with that. If a person is trying to ski using pieces of
>>> wood that they carved themselves, I don't expect them to be surprised
>>> that the skis they buy are made out of similar materials.
>>
>> But, in this case, the guy ASKED how to make the skis in his woodworking
>> shop, and was told not to be silly -- you don't use wood to make skis --
>> and then directed to go buy some skis that are, in fact, made out of
>> wood.
>
> As entertaining as this is, the analogy is rubbish.

You should have seen the car engine analogy I thought up at
first. ;)

> Skis are far too simple to use as an analogy for a parser (he
> says, having never seen skis up close in his life *wink*).
> Have you looked at PyParsing's source code? Regexes are only a
> small part of the parser, and not analogous to the wood of
> skis.

I was mainly trying to get accross my incredulity that somebody
should be surprised a parsing package uses regexes under the
good. But for the record, a set of downhill skis comes with a
really fancy interface layer:

URL:http://images03.olx.com/ui/1/85/66/13147966_1.jpg

--
Neil Cerutti