Fast parsing

784 views
Skip to first unread message

Felix Schlesinger

unread,
Jul 2, 2010, 3:07:43 PM7/2/10
to cython-users
Hi,

I have a bunch of programs for which the slowest part is reading text-
based data into an internal datastructure. A simple situation might
look something like this:

for l in input:
line = l.rstrip().split()
foo = int(line[1]
bar = float(line[2])
data[foo] = bar

Where data might be a dict, a numpy array or even a pytables object
(hdf5). The processing can be more complex, but the files are almost
always in simple 'table' like formats.
Even very simple examples are often CPU and not disk-limited in
standard python. Does anyone here have experience with speeding such
loops up? Just dropping the whole thing into cython does not seem to
do much (probably because of all the python calls and conversions).
Does using C string routines help? Are there already fast parsers for
this kind of simple situation around? Just looking for some experience
from others before reinventing the wheel.

Thanks
Felix

Stefan Behnel

unread,
Jul 2, 2010, 3:28:19 PM7/2/10
to cython...@googlegroups.com
Felix Schlesinger, 02.07.2010 21:07:

> I have a bunch of programs for which the slowest part is reading text-
> based data into an internal datastructure. A simple situation might
> look something like this:
>
> for l in input:
> line = l.rstrip().split()
> foo = int(line[1]
> bar = float(line[2])
> data[foo] = bar
>
> Where data might be a dict, a numpy array or even a pytables object
> (hdf5). The processing can be more complex, but the files are almost
> always in simple 'table' like formats.
> Even very simple examples are often CPU and not disk-limited in
> standard python.

First thing I'd check is that it's not the last line that eats up the
processing cycles. Maybe not for dicts, but insertion into more complex
data structures may well take a moment.

Next, check if you opened the file in text or binary mode. Text mode may be
more CPU heavy, but you may or may not need that additional overhead.

Then, your lines seem to be short, and if you have a lot of them, the
processing can seriously add up. Note that you are creating several objects
in your code above:

- the loop creates one for l
- lstrip() may create another one
- split() creates three in your example
- int() creates one
- float() creates one

And most of them get thrown away after each iteration. You can avoid
instantiating some of the intermediate strings as well as the split tuple
by doing the parsing manually, but the other objects (l/foo/bar) are really
used.


> Does anyone here have experience with speeding such
> loops up? Just dropping the whole thing into cython does not seem to
> do much (probably because of all the python calls and conversions).

Right. All of the above processing happens inside of the CPython runtime
(ok, except for the indexing...).


> Does using C string routines help?

I'd say they'll at least help a bit, yes. If you target Python 2 (as
opposed to Python 3), you can also do the file reading in C code, thus
avoiding another object instantiation. See the PyFile_... functions in the
C-API. But try (and benchmark) the other options first.

Stefan

Yoav Goldberg

unread,
Jul 2, 2010, 5:01:27 PM7/2/10
to cython...@googlegroups.com
You could probably speed things up a little by getting rid of some of
the indexing and name lookups in python. This is a bit kuldgy, but I
found it to do help somewhat (not nearly as good as having a
pure-cython solution though, but much easier to write/maintain).

Specifically I would re-write the code as:

lines = (line.rstrip().split() for line in input)
items =( (int(foo),float(bar)) for foo,bar in lines )
data = dict( items )

depending on your case, it may be better (or worse) to replace some of
the generator expressions with list-comprehensions.

Y

Felix Schlesinger

unread,
Jul 2, 2010, 5:16:28 PM7/2/10
to cython-users


On Jul 2, 3:28 pm, Stefan Behnel <stefan...@behnel.de> wrote:
> Felix Schlesinger, 02.07.2010 21:07:
> > for l in input:
> >      line = l.rstrip().split()
> >      foo = int(line[1]
> >      bar = float(line[2])
> >      data[foo] = bar
> Then, your lines seem to be short, and if you have a lot of them, the
> processing can seriously add up. Note that you are creating several objects
> in your code above:
>
> - the loop creates one for l
> - lstrip() may create another one
> - split() creates three in your example
> - int() creates one
> - float() creates one

I think this is the main problem. Hence simple cython does not help
much.

> And most of them get thrown away after each iteration. You can avoid
> instantiating some of the intermediate strings as well as the split tuple
> by doing the parsing manually, but the other objects (l/foo/bar) are really
> used.

I am not sure what you have in mind here. How would I do manual
parsing in python without creating those objects?
In the example I am currently testing using C string functions leads
to a big speed-up. i.e. if I replace

line = l.split()
foo = int(line[1])
bar = int(line[2])

with
cdef char* sp

sp = l
pos = atoi(strtok(sp,"\t"))
val = atoi(strtok(NULL,"\t"))

I get 6x faster code. This is even though there is some other stuff
going on in this inner loop of the script, including two calls to
l.startswith(). Hence it seems to me str.split and int() on the
resulting list are the problematic part.

Christopher Barker

unread,
Jul 2, 2010, 8:26:03 PM7/2/10
to cython...@googlegroups.com
Felix Schlesinger wrote:
> I have a bunch of programs for which the slowest part is reading text-
> based data into an internal datastructure. A simple situation might
> look something like this:
>
> for l in input:
> line = l.rstrip().split()
> foo = int(line[1]
> bar = float(line[2])
> data[foo] = bar

I've done tests on this sort of stuff, and no matter how you slice it,
parsing a bunch of text into numbers can only go so fast in Python.

Ideally, there would be a fairly generic way to have a bunch of text
parsed in C.

numpy.fromfile() can parse text, but it is very limited as to the format
(no comments, can only read multiple lines if the separator is text,
etc.) If it fits your needs, it's pretty darn fast.

I've spent some time looking at fixing some of those limitations, but
boy does it get to be some ugly C! (and there are less than trivial (to
me anyway) bugs in the current implementation as well)

You can also read one line at a time with Python, and use
numpy.fromstring() to parse it. That should be faster than what you're
doing.

I've written my own parser in C (before fromfile existed). It only does
floats and doubles, and generates a numpy array -- but it's fast -- let
me know if you want it.

Ideally, I'd love to see a more general purpose text parser written in
cython -- I think it would be pretty useful.

-Chris


> Where data might be a dict, a numpy array or even a pytables object
> (hdf5). The processing can be more complex, but the files are almost
> always in simple 'table' like formats.
> Even very simple examples are often CPU and not disk-limited in
> standard python. Does anyone here have experience with speeding such
> loops up? Just dropping the whole thing into cython does not seem to
> do much (probably because of all the python calls and conversions).
> Does using C string routines help? Are there already fast parsers for
> this kind of simple situation around? Just looking for some experience
> from others before reinventing the wheel.
>
> Thanks
> Felix


--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris....@noaa.gov

Stefan Behnel

unread,
Jul 3, 2010, 2:55:17 AM7/3/10
to cython...@googlegroups.com
Felix Schlesinger, 02.07.2010 23:16:
> [...]

> Hence it seems to me str.split and int() on the
> resulting list are the problematic part.

That's exactly what I meant.


> This is even though there is some other stuff
> going on in this inner loop of the script, including two calls to
> l.startswith().

Recent Cython will optimise away the method call here if it knows that
'startswith' is a bytes or unicode object.

Make sure you use the latest developer version of Cython, it has a lot of
great features for text processing, including 'in' tests for characters
against literals and optimised string looping. I don't think it's well
documented yet, but here's a start:

http://behnel.de/cgi-bin/weblog_basic/index.php?cat=11

Stefan

Stefan Behnel

unread,
Jul 3, 2010, 4:20:28 AM7/3/10
to cython...@googlegroups.com
Stefan Behnel, 03.07.2010 08:55:

Actually, the future 0.13 docs aren't all that bad:

http://hg.cython.org/cython-docs/raw-file/tip/src/tutorial/strings.rst

Stefan

Felix Schlesinger

unread,
Jul 5, 2010, 4:13:16 PM7/5/10
to cython-users
> Ideally,  there would be a fairly generic way to have a bunch of text
> parsed in C.
>
> numpy.fromfile() can parse text, but it is very limited as to the format
> (no comments, can only read multiple lines if the separator is text,
> etc.) If it fits your needs, it's pretty darn fast.

In the addition to the limited format the problem for me is that often
I do not want to read the entire file into an array, but rather
perform some action on each line (i.e. add a value to a dict).
Maybe a good general solution would be a cython compiled function that
uses C parsing to generate a struct from each line and returns that to
the python code. However what would be the best way to make this
general (in the sense of not having to write a new parsing function by
hand to change the number of columns)?
In specific cases one can of course just write the whole thing in C
(including the 'action' per line), but a fast simple line parser
usable from python would be nicer.

Btw: I just saw that python 2.7 has a new IO module and performance
improvements in str.split. Does anyone here if that applies to this
kind of simple parsing task?

Felix

Yoav Goldberg

unread,
Jul 5, 2010, 4:26:11 PM7/5/10
to cython...@googlegroups.com
I think a good way of making it generic would be the make the
resulting code support the iteration protocol (next() and raising
StopIteration at the end) and return one parsed line at each step.
When used from within cython this could return a c struct for fast
processing, and from within python this could return either a numpy
array or a python class with __slots__.

Stefan Behnel

unread,
Jul 5, 2010, 4:51:20 PM7/5/10
to cython...@googlegroups.com, Felix Schlesinger
Felix Schlesinger, 05.07.2010 22:13:

> Btw: I just saw that python 2.7 has a new IO module

Only the one that was in Py3.1 already. I don't have hard numbers, but it
is supposed to be a lot faster, with the drawback of not having a C-API
that you could hook into from Cython.


> and performance
> improvements in str.split. Does anyone here if that applies to this
> kind of simple parsing task?

You will still not get around the object allocations. String search
performance is not the bottleneck in your code.

Stefan

Christopher Barker

unread,
Jul 5, 2010, 6:28:22 PM7/5/10
to cython...@googlegroups.com
Felix Schlesinger wrote:
>> Ideally, there would be a fairly generic way to have a bunch of text
>> parsed in C.
>>
>> numpy.fromfile() can parse text, but it is very limited as to the format
>> (no comments, can only read multiple lines if the separator is text,
>> etc.) If it fits your needs, it's pretty darn fast.
>
> In the addition to the limited format the problem for me is that often
> I do not want to read the entire file into an array, but rather
> perform some action on each line (i.e. add a value to a dict).

You can do that with fromfile() and/or fromstring(). With fromfile(),
you can do it if you know how many numbers are on a particular line.

with fromstring, you can do:

struct = fromstring(myfile.readline(), sep=' ')

which does do one more object allocation than you "need", but it is
faster than a few calls to split(), etc.

Also, you can read whole bunch of lines into an array with fromfile,
then loop through that to do something if you want.

In my tests, I'm trying to read, say 10,000 numbers into an array, all
at once. But if you're going to do a bunch of python on each line (or
any python!), there may be no point in doing more than the above --
profile carefully!

> Maybe a good general solution would be a cython compiled function that
> uses C parsing to generate a struct from each line and returns that to
> the python code. However what would be the best way to make this
> general (in the sense of not having to write a new parsing function by
> hand to change the number of columns)?

I'm imaging something simlilar to what MATLAB does: you pass in a C
format string, and it uses that to parse the file, recycling it as need
be to read as much as you are asking for.

If you have a "struct" with multiple data types, it might make the most
sense to pass in a numpy struct array, and have you Cython code
construct a format string from that.

That is almost what fromfile() does -- except it counts on the dtype
object itself to know how to parse itself, and the implementation is
ugly C with bugs. (using atof() and the like).

see this thread:

http://mail.scipy.org/pipermail/numpy-discussion/2010-January/047753.html

for all the gory details.

Another option is to write numpy.genfromtext in Cython, but you'd have
to make it pretty C heavy to get the performance you're looking for, so
maybe that's the same thing!

-Chris

Felix Schlesinger

unread,
Jul 6, 2010, 12:06:36 AM7/6/10
to cython-users
> You can do that with fromfile() and/or fromstring(). With fromfile(),
> you can do it if you know how many numbers are on a particular line.
>
> with fromstring, you can do:
>
> struct = fromstring(myfile.readline(), sep=' ')

This works for the special (but common case) of a single datatype.
However all my files have different types of columns including ints
and strings (record array). And both fromstring and fromfile seem to
accept only simple dtypes.

> In my tests, I'm trying to read, say 10,000 numbers into an array, all
> at once. But if you're going to do a bunch of python on each line (or
> any python!), there may be no point in doing more than the above --
> profile carefully!

As long as the python is just an access to a dict or a numpy array the
parsing is still the bottleneck for me.


> I'm imaging something simlilar to what MATLAB does: you pass in a C
> format string, and it uses that to parse the file, recycling it as need
> be to read as much as you are asking for.
>
> If you have a "struct" with multiple data types, it might make the most
> sense to pass in a numpy struct array, and have you Cython code
> construct a format string from that.

That sounds excellent. Even though I see no big problem with the user
explicitly providing the format string.

Felix

Christopher Barker

unread,
Jul 6, 2010, 1:01:17 PM7/6/10
to cython...@googlegroups.com
Felix Schlesinger wrote:
>> You can do that with fromfile() and/or fromstring(). With fromfile(),
>> you can do it if you know how many numbers are on a particular line.
>>
>> with fromstring, you can do:
>>
>> struct = fromstring(myfile.readline(), sep=' ')
>
> This works for the special (but common case) of a single datatype.
> However all my files have different types of columns including ints
> and strings (record array). And both fromstring and fromfile seem to
> accept only simple dtypes.

right you are -- I hadn't tested that. They are written to let the dtype
do the parsing, so I imagine the idea was that they could be extended
to have custom dtypes know how to parse themselves, but no one has
written that code.

Given that, I'm all more inclined to write something totally different,
or significantly re-factor the fromfile/fromstring code - it's really
fragile and hard to maintain as it is.

> As long as the python is just an access to a dict or a numpy array the
> parsing is still the bottleneck for me.

Are you sure? Anyway, a fromfile() that could handle complex dtypes
would be good for what you need.


>> I'm imaging something simlilar to what MATLAB does: you pass in a C
>> format string, and it uses that to parse the file, recycling it as need
>> be to read as much as you are asking for.
>>
>> If you have a "struct" with multiple data types, it might make the most
>> sense to pass in a numpy struct array, and have you Cython code
>> construct a format string from that.
>
> That sounds excellent. Even though I see no big problem with the user
> explicitly providing the format string.

That's probably a good first step anyway.


Another issue -- IIUC correctly, with Python 2.6, you can get get
regular old C file handle from the Python file object. But with 3.0 (and
2.7??) Python handles files differently, so you can't get a regular old
C file handle -- so I don't know if you can simply use fscanf the same way.

I'm sure someone on this list knows about that.

Felix Schlesinger

unread,
Jul 6, 2010, 5:00:50 PM7/6/10
to cython-users
> Recent Cython will optimise away the method call here if it knows that
> 'startswith' is a bytes or unicode object.

This does not seem to work for me. I just downloaded cython-devel and
compiled a script containing the following:

def load(file inp):
cdef:
bytes l

for l in inp:
if l.startswith(b'track'):
...

Judging by the annotated source in the HTML cython still generates a
python method call (see below). What else do I need to declare to get
this optimization?

Felix

__pyx_t_3 = PyObject_GetAttr(((PyObject *)__pyx_v_l),
__pyx_n_s__startswith); if (unlikely(!__pyx_t_3)) {__pyx_filename =
__pyx_f[0]; __pyx_lineno = 64; __pyx_clineno = __LINE__; goto
__pyx_L1_error;}
__Pyx_GOTREF(__pyx_t_3);
__pyx_t_4 = PyTuple_New(1); if (unlikely(!__pyx_t_4))
{__pyx_filename = __pyx_f[0]; __pyx_lineno = 64; __pyx_clineno =
__LINE__; goto __pyx_L1_error;}
__Pyx_GOTREF(__pyx_t_4);
__Pyx_INCREF(((PyObject *)__pyx_n_b__track));
PyTuple_SET_ITEM(__pyx_t_4, 0, ((PyObject *)__pyx_n_b__track));
__Pyx_GIVEREF(((PyObject *)__pyx_n_b__track));
__pyx_t_5 = PyObject_Call(__pyx_t_3, __pyx_t_4, NULL); if
(unlikely(!__pyx_t_5)) {__pyx_filename = __pyx_f[0]; __pyx_lineno =
64; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
__Pyx_GOTREF(__pyx_t_5);
__Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
__Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
__pyx_t_6 = __Pyx_PyObject_IsTrue(__pyx_t_5); if
(unlikely(__pyx_t_6 < 0)) {__pyx_filename = __pyx_f[0]; __pyx_lineno =
64; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
__Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
if (__pyx_t_6) {

Stefan Behnel

unread,
Jul 7, 2010, 12:53:28 AM7/7/10
to cython...@googlegroups.com
Christopher Barker, 06.07.2010 19:01:

> Another issue -- IIUC correctly, with Python 2.6, you can get get
> regular old C file handle from the Python file object. But with 3.0 (and
> 2.7??) Python handles files differently, so you can't get a regular old
> C file handle -- so I don't know if you can simply use fscanf the same way.

In 2.7, it's only the case for the new io module, the file object works as
before. It's true that there isn't currently a public C-API for the io
module, though, so the I/O layers in Py3 can't be bypassed at the moment.

Stefan

Stefan Behnel

unread,
Jul 8, 2010, 3:05:29 AM7/8/10
to cython...@googlegroups.com
Felix Schlesinger, 06.07.2010 23:00:

>> Recent Cython will optimise away the method call here if it knows that
>> 'startswith' is a bytes or unicode object.
>
> This does not seem to work for me. I just downloaded cython-devel and
> compiled a script containing the following:
>
> def load(file inp):
> cdef:
> bytes l
>
> for l in inp:
> if l.startswith(b'track'):
> ...
>
> Judging by the annotated source in the HTML cython still generates a
> python method call (see below). What else do I need to declare to get
> this optimization?

Ah, sorry, bytes.startswith() doesn't map to a C-API call in Py3, so we
can't optimise it that way (works for the unicode type, though). However,
we could just call strncmp() instead in this case. I'll add a mental note
for now.

Stefan

Lisandro Dalcin

unread,
Jul 8, 2010, 6:45:05 PM7/8/10
to cython...@googlegroups.com

However, we could easily obtain fileobj.fileno() and call C stdlib
fdopen() function for get the FILE* stream pointer, right?

--
Lisandro Dalcin
---------------
CIMEC (INTEC/CONICET-UNL)
Predio CONICET-Santa Fe
Colectora RN 168 Km 472, Paraje El Pozo
Tel: +54-342-4511594 (ext 1011)
Tel/Fax: +54-342-4511169

Christopher Barker

unread,
Jul 8, 2010, 8:39:56 PM7/8/10
to cython...@googlegroups.com
Lisandro Dalcin wrote:
> On 7 July 2010 01:53, Stefan Behnel <stef...@behnel.de> wrote:
>> Christopher Barker, 06.07.2010 19:01:
>>> Another issue -- IIUC correctly, with Python 2.6, you can get get
>>> regular old C file handle from the Python file object. But with 3.0 (and
>>> 2.7??) Python handles files differently, so you can't get a regular old
>>> C file handle -- so I don't know if you can simply use fscanf the same
>>> way.
>> In 2.7, it's only the case for the new io module, the file object works as
>> before. It's true that there isn't currently a public C-API for the io
>> module, though, so the I/O layers in Py3 can't be bypassed at the moment.
>>
>
> However, we could easily obtain fileobj.fileno() and call C stdlib
> fdopen() function for get the FILE* stream pointer, right?

Would that give you the pointer to an already opened file, that may have
already had some data read? So you could just pick up from there?

Reply all
Reply to author
Forward
0 new messages