Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Comparison of parsers in python?

25 views
Skip to first unread message

Peng Yu

unread,
Sep 19, 2009, 4:21:58 PM9/19/09
to
Hi,

I did a google search and found various parser in python that can be
used to parse different files in various situation. I don't see a page
that summarizes and compares all the available parsers in python, from
simple and easy-to-use ones to complex and powerful ones.

I am wondering if somebody could list all the available parsers and
compare them.

Regards,
Peng

Robert Kern

unread,
Sep 19, 2009, 7:05:18 PM9/19/09
to pytho...@python.org
Peng Yu wrote:
> Hi,
>
> I did a google search and found various parser in python that can be
> used to parse different files in various situation. I don't see a page
> that summarizes and compares all the available parsers in python, from
> simple and easy-to-use ones to complex and powerful ones.

Second hit for "python parser":

http://nedbatchelder.com/text/python-parsers.html

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Peng Yu

unread,
Sep 19, 2009, 9:34:24 PM9/19/09
to
On Sep 19, 6:05 pm, Robert Kern <robert.k...@gmail.com> wrote:
> Peng Yu wrote:
> > Hi,
>
> > I did a google search and found various parser in python that can be
> > used to parse different files in various situation. I don't see a page
> > that summarizes and compares all the available parsers in python, from
> > simple and easy-to-use ones to complex and powerful ones.
>
> Second hit for "python parser":
>
> http://nedbatchelder.com/text/python-parsers.html

This is more a less just a list of parsers. I would like some detailed
guidelines on which one to choose for various parsing problems.

Regards,
Peng

TerryP

unread,
Sep 19, 2009, 11:39:16 PM9/19/09
to
Peng Yu wrote:
> This is more a less just a list of parsers. I would like some detailed
> guidelines on which one to choose for various parsing problems.
>
> Regards,
> Peng


It depends on the parsing problem.

Obviously your not going to use an INI parser to work with XML, or
vice versa. Likewise some formats can be parsed in different ways, XML
parsers for example are often build around a SAX or DOM model. The
differences between them (hit Wikipedia) can effect the performance of
your application, more then learning how to use an XML parsers API can
effect the hair on your head.

For flat data, simple unix style rc or dos style ini file will often
suffice, and writing a parser is fairly trivial; in fact writing a
config file parser is an excellent learning exercise, to get a feel
for a given languages standard I/O, string handling, and type
conversion features. These kind of parsers tend to be pretty quick
because of their simplicity, and writing a small but extremely fast
one can be enjoyable at times; one of these days I need to do it in
X86 assembly just for the hell of it. Python includes an INI parser in
the standard library.

XML serves well for hierarchical data models, but can be a royal pain
to write code around the parsers (IMHO anyway!), but often is handy.
Popular parsers for XML include expat and libxml2 - there is also a
more "Pythonic" wrapper for libxml/libxslt called py-lxml; Python also
comes with parsers for XML. Other formats such as JSON, YAML, heck
even S-expressions could be used and parsed. Some programs only parse
enough to slup up code and eval it (not always smart, but sometimes
useful).

In general the issues to consider when selecting a parser for a given
format, involve: speed, size, and time. How long does it take to
process the data set, how much memory (size) does it consume, and how
much bloody time will it take to learn the API ;).


The best way to choose a parser, is experiment with several, test (and
profile!) them according to the project, then pick the one you like
best, out of those that are suitable for the task. Profiling can be
very important.


--
TerryP

andrew cooke

unread,
Sep 20, 2009, 7:50:53 AM9/20/09
to

it would be simpler if you described what you want to do - parsers can
be used for a lot of problems.

also, the parsers do not exist in isolation - you need to worry about
whether they are supported, how good the documentation is, etc.

and different parsers handle different grammars - see
http://wiki.python.org/moin/LanguageParsing - so if you already have a
particular grammar then your life is simpler if you choose a parser
that matches.


these are the three that i know most about - i think all three are
currently maintained:

for simple parsing problems, i think pyparsing is the most commonly
used - http://pyparsing.wikispaces.com/

my own lepl - http://www.acooke.org/lepl/ - tries to combine ease of
use with some more advanced features

the nltk - http://www.nltk.org/ - is particularly targeted at parsing
natural languages and includes a wide variety of tools.


but for parsing a large project you might be better interfacing to a
compiled parser (lepl has memoisation, so should scale quite well, but
it's not something i've looked at in detail yet).

andrew

andrew cooke

unread,
Sep 20, 2009, 7:55:25 AM9/20/09
to
On Sep 19, 11:39 pm, TerryP <bigboss1...@gmail.com> wrote:
[...]

> For flat data, simple unix style rc or dos style ini file will often
> suffice, and writing a parser is fairly trivial; in fact writing a
[...]

python already includes parsers for ".ini" configuration files.

[...]


> The best way to choose a parser, is experiment with several, test (and
> profile!) them according to the project, then pick the one you like
> best, out of those that are suitable for the task. Profiling can be
> very important.

profiling is going to show you the constant complexity, but - unless
you think hard - it's not going to explain how a parser will scale
(how performance changes as the amount of text to be parsed
increases). for that, you need to look at the algorithm used, which
is usually documented somewhere. there's going to be trade-offs -
parsers that handle large texts better could well be more complex and
slower on small texts.

andrew

Peng Yu

unread,
Sep 20, 2009, 8:11:50 AM9/20/09
to pytho...@python.org
On Sun, Sep 20, 2009 at 6:50 AM, andrew cooke <and...@acooke.org> wrote:
> On Sep 19, 9:34 pm, Peng Yu <pengyu...@gmail.com> wrote:
>> On Sep 19, 6:05 pm, Robert Kern <robert.k...@gmail.com> wrote:
>> >http://nedbatchelder.com/text/python-parsers.html
>>
>> This is more a less just a list of parsers. I would like some detailed
>> guidelines on which one to choose for various parsing problems.
>
> it would be simpler if you described what you want to do - parsers can
> be used for a lot of problems.

I have never used any parser. The task at my hand right now is to
parse this http://genome.ucsc.edu/goldenPath/help/wiggle.html, which
is a fairly simple even without any parser package.

I think that it is worthwhile for me to learn some parser packages to
try to parse this format. So that I may in future parse more complex
syntax. Do you have any suggestion what parser I should use for now?

Regards,
Peng

> also, the parsers do not exist in isolation - you need to worry about
> whether they are supported, how good the documentation is, etc.
>
> and different parsers handle different grammars - see
> http://wiki.python.org/moin/LanguageParsing - so if you already have a
> particular grammar then your life is simpler if you choose a parser
> that matches.
>
>
> these are the three that i know most about - i think all three are
> currently maintained:
>
> for simple parsing problems, i think pyparsing is the most commonly
> used - http://pyparsing.wikispaces.com/
>
> my own lepl - http://www.acooke.org/lepl/ - tries to combine ease of
> use with some more advanced features
>
> the nltk - http://www.nltk.org/ - is particularly targeted at parsing
> natural languages and includes a wide variety of tools.
>
>
> but for parsing a large project you might be better interfacing to a
> compiled parser (lepl has memoisation, so should scale quite well, but
> it's not something i've looked at in detail yet).
>
> andrew

> --
> http://mail.python.org/mailman/listinfo/python-list
>

andrew cooke

unread,
Sep 20, 2009, 8:20:38 AM9/20/09
to
On Sep 20, 8:11 am, Peng Yu <pengyu...@gmail.com> wrote:
> On Sun, Sep 20, 2009 at 6:50 AM, andrew cooke <and...@acooke.org> wrote:
> > On Sep 19, 9:34 pm, Peng Yu <pengyu...@gmail.com> wrote:
> >> On Sep 19, 6:05 pm, Robert Kern <robert.k...@gmail.com> wrote:
> >> >http://nedbatchelder.com/text/python-parsers.html
>
> >> This is more a less just a list of parsers. I would like some detailed
> >> guidelines on which one to choose for various parsing problems.
>
> > it would be simpler if you described what you want to do - parsers can
> > be used for a lot of problems.
>
> I have never used any parser. The task at my hand right now is to
> parse thishttp://genome.ucsc.edu/goldenPath/help/wiggle.html, which

> is a fairly simple even without any parser package.
>
> I think that it is worthwhile for me to learn some parser packages to
> try to parse this format. So that I may in future parse more complex
> syntax. Do you have any suggestion what parser I should use for now?

pyparsing would work fine for that, and has a broad community of users
that will probably be helpful.

i am currently working on an extension to lepl that is related, and i
may use that format as an example. if so, i'll tell you. but for
now, i think pyparsing makes more sense for you.

andrew

andrew cooke

unread,
Sep 20, 2009, 8:58:06 AM9/20/09
to

One word of warning - the documentation for that format says at the
beginning that it is compressed in some way. I am not sure if that
means within some program, or on disk. But most parsers will not be
much use with a compressed file - you will need to uncompress it first.

Peng Yu

unread,
Sep 20, 2009, 9:04:22 AM9/20/09
to pytho...@python.org

The file size of a wig file can be very large (GB). Most tasks on this
file format does not need the parser to save all the lines read from
the file in the memory to produce the parsing result. I'm wondering if
pyparsing is capable of parsing large wig files by keeping only
minimum required information in the memory.

Regards,
Peng

andrew cooke

unread,
Sep 20, 2009, 9:12:22 AM9/20/09
to
> The file size of a wig file can be very large (GB). Most tasks on this
> file format does not need the parser to save all the lines read from
> the file in the memory to produce the parsing result. I'm wondering if
> pyparsing is capable of parsing large wig files by keeping only
> minimum required information in the memory.

ok, now you are getting into the kind of detail where you will need to
ask the authors of individual packages.

lepl is stream oriented and should behave as you want (it will only
keep in memory what it needs, and will read data gradually from a
file) but (1) it's fairly new and i have not tested the memory use -
there may be some unexpected memory leak; (2) it's python 2.6/3 only;
(3) parsing line-based formats like this is not yet supported very
well (you can do it, but you have to explicitly match the newline
character to find the end of line); (4) the community for support is
small.

so i would suggest asking on the pyparsing list for advice on using
that with large data files (you are getting closer to the point where
i would recommend lepl - but of course i am biased as i wrote it).

andrew

ps is there somewhere can download example files? this would be
useful for my own testing. thanks.

andrew cooke

unread,
Sep 20, 2009, 9:19:21 AM9/20/09
to

also, parsing large files may be slow. in which case you may be
better with a non-python solution (even if you call it from python).

your file format is so simple that you may find a lexer is enough for
what you want, and they should be stream oriented. have a look at the
"shlex" package that is already in python. will that help?

alternatively, perhaps plex - http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/
- that is pure python, but greg ewing is a good programmer and he says
on that page it is as fast as possible for python, so it is probably
going to be quite fast.

andrew

ps maybe you already know, but a lexer is simpler than a parser in
that it doesn't use the context to decide how to treat things. so it
can recognise something is a number, or a word, or a quoted string,
but not whether it is part of a track definition line or a data value,
for example. but in this case the format is so simple that a lexer
might do quite a ot of what you want, and would make the remaining
plain python program very simple.

Peng Yu

unread,
Sep 20, 2009, 9:41:05 AM9/20/09
to pytho...@python.org

I don't quite understand this point. If I don't use a parser, since
python can read numbers line by line, why I need a lexer package?

Regards,
Peng

andrew cooke

unread,
Sep 20, 2009, 9:49:12 AM9/20/09
to
> I don't quite understand this point.  If I don't use a parser, since
> python can read numbers line by line, why I need a lexer package?

for the lines of numbers it would make no difference; for the track
definition lines it would save you some work.

as you said, this is a simple format, so the case for any tool is
marginal - i'm just exploring the options.

andrew

Peng Yu

unread,
Sep 20, 2009, 10:08:30 AM9/20/09
to pytho...@python.org
On Sun, Sep 20, 2009 at 8:49 AM, andrew cooke <and...@acooke.org> wrote:
>> I don't quite understand this point.  If I don't use a parser, since
>> python can read numbers line by line, why I need a lexer package?
>
> for the lines of numbers it would make no difference; for the track
> definition lines it would save you some work.

So for the track definition, using a lexer package would be better
than using regex in python, right?

> as you said, this is a simple format, so the case for any tool is
> marginal - i'm just exploring the options.
>
> andrew

> --
> http://mail.python.org/mailman/listinfo/python-list
>

andrew cooke

unread,
Sep 20, 2009, 10:36:49 AM9/20/09
to
> So for the track definition, using a lexer package would be better
> than using regex in python, right?

they are similar. a lexer is really just a library that packages
regular expressions in a certain way. so you could write your own
code and you would really be writing a simple lexer. the advantage of
writing your own code is that it will be easier to modify and you will
get more experience with regular expressions. the advantage of using
a library (a lexer) is that it has already been tested by other
people, it is already packaged and so your code will be better
structured, and it may have features (perhaps logging, or handling of
quoted strings, for example) that will save you some work in the
future.

andrew

Robert Kern

unread,
Sep 20, 2009, 2:10:49 PM9/20/09
to pytho...@python.org
Peng Yu wrote:

> The file size of a wig file can be very large (GB). Most tasks on this
> file format does not need the parser to save all the lines read from
> the file in the memory to produce the parsing result. I'm wondering if
> pyparsing is capable of parsing large wig files by keeping only
> minimum required information in the memory.

I cannot recommend pyparsing for large amounts of text. Even before you hit
memory limits, you will run into the problem that pyparsing runs many functions
for each character of text. Python function calls are expensive.

Since the format is line-oriented, one option is to use pyparsing or other
parser to handle the track definition lines and just str.split(), float() and
int() for the data lines.

andrew cooke

unread,
Sep 20, 2009, 2:35:44 PM9/20/09
to
On Sep 20, 9:12 am, andrew cooke <and...@acooke.org> wrote:
> ps is there somewhere can download example files?  this would be
> useful for my own testing.  thanks.

i replied to a lot of your questions here; any chance you could reply
to this one of mine?

the wig format looks like it could be a good test for lepl.

thanks,
andrew

Peng Yu

unread,
Sep 20, 2009, 3:16:29 PM9/20/09
to pytho...@python.org

I missed your question. I have only some files that I only use a
subset of the syntax in wig. Here is one example.

track type=wiggle_0 name="MACS_counts_after_shifting" description="H3K4me1B"
variableStep chrom=chr10 span=10
3001871 1
3001881 1
3001891 1
3001901 1
track type=wiggle_0 name="MACS_counts_after_shifting" description="H3K4me1B"
variableStep chrom=chr11 span=10
3000331 3
3000341 3
3000351 3
3000361 3
3000371 3
3000381 3

andrew cooke

unread,
Sep 20, 2009, 6:11:42 PM9/20/09
to
On Sep 20, 3:16 pm, Peng Yu <pengyu...@gmail.com> wrote:
> On Sun, Sep 20, 2009 at 1:35 PM, andrew cooke <and...@acooke.org> wrote:
> > On Sep 20, 9:12 am, andrew cooke <and...@acooke.org> wrote:
> >> ps is there somewhere can download example files?  this would be
> >> useful for my own testing.  thanks.
>
> > i replied to a lot of your questions here; any chance you could reply
> > to this one of mine?
>
> > the wig format looks like it could be a good test for lepl.
>
> I missed your question. I have only some files that I only use a
> subset of the syntax in wig. Here is one example.

ah, thanks. i'll see if i can find something on the 'net - i am
hoping to test how / whether gigabytes of data can be parsed.

andrew

Simon Forman

unread,
Sep 21, 2009, 1:01:43 AM9/21/09
to

Check out http://navarra.ca/?p=538

(FWIW I like and use SPARK, but I haven't used or tried many others.)

Nobody

unread,
Sep 21, 2009, 10:59:49 AM9/21/09
to

I have a similar question.

What I want: a tokeniser generator which can take a lex-style grammar (not
necessarily lex syntax, but a set of token specifications defined by
REs, BNF, or whatever), generate a DFA, then run the DFA on sequences of
bytes. It must allow the syntax to be defined at run-time.

What I don't want: anything written by someone who doesn't understand the
field (i.e. anything which doesn't use a DFA).

greg

unread,
Sep 22, 2009, 1:07:33 AM9/22/09
to
Nobody wrote:

> What I want: a tokeniser generator which can take a lex-style grammar (not
> necessarily lex syntax, but a set of token specifications defined by
> REs, BNF, or whatever), generate a DFA, then run the DFA on sequences of
> bytes. It must allow the syntax to be defined at run-time.

You might find my Plex package useful:

http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/

It was written some time ago, so it doesn't know about
the new bytes type yet, but it shouldn't be hard to
adapt it for that if you need to.

> What I don't want: anything written by someone who doesn't understand the
> field (i.e. anything which doesn't use a DFA).

Plex uses a DFA.

--
Greg

Nobody

unread,
Sep 25, 2009, 4:36:25 PM9/25/09
to
On Tue, 22 Sep 2009 17:07:33 +1200, greg wrote:

>> What I want: a tokeniser generator which can take a lex-style grammar (not
>> necessarily lex syntax, but a set of token specifications defined by
>> REs, BNF, or whatever), generate a DFA, then run the DFA on sequences of
>> bytes. It must allow the syntax to be defined at run-time.
>
> You might find my Plex package useful:
>
> http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/

I haven't had time to play around with this yet, but it appears to be
essentially what I'm looking for.

andrew cooke

unread,
Sep 26, 2009, 7:43:10 AM9/26/09
to
On Sep 21, 10:59 am, Nobody <nob...@nowhere.com> wrote:
> I have a similar question.
>
> What I want: a tokeniser generator which can take a lex-style grammar (not
> necessarily lex syntax, but a set of token specifications defined by
> REs, BNF, or whatever), generate a DFA, then run the DFA on sequences of
> bytes. It must allow the syntax to be defined at run-time.
>
> What I don't want: anything written by someone who doesn't understand the
> field (i.e. anything which doesn't use a DFA).

lepl will do this, but it's integrated with the rest of the parser
(which is recursive descent).

for example:

float = Token(Float())
word = Token(Word(Lower())
punctuation = ~Token(r'[\.,]')

line = (float | word)[:, punctuation]
parser = line.string_parser()

will generate a lexer with three tokens. here two are specified using
lepl's matchers and one using a regexp, but in all three cases they
are converted to dfas internally.

then a parser is generated that will match a sequence of floats and
words, separated by punctuation. spaces are discarded by the lexer by
default, but that can be changed through the configuration (which
would be passed to the string_parser method).

it's also possible to specify everything using matchers and then get
lepl to compile "as much as possible" of the matcher graph to nfas
before matching (nfas rather than dfas because they are implemented
with a stack to preserve the backtracking abilities of the recursive
descent parser they replace). the problem here is that not all
matchers can be converted (matchers can contain arbitrary python
functions, while my nfa+dfa implementations cannot, and also my
"compiler" isn't very smart), while using tokens explicitly gives you
an error if the automatic compilation fails (in which case the simple
fix is to just give the regexp).

(also, you say "sequence of bytes" rather than strings - lepl will
parse the byte[] type in python3 and even has support for matching
binary values).

disclaimer: newish library, python 2.6+ only, and while i have quite a
few users (or, at least, downloads), i doubt that many use these more
advanced features, and everything is pure python with little
performance tuning so far.

andrew

0 new messages