First thing I'd check is that it's not the last line that eats up the
processing cycles. Maybe not for dicts, but insertion into more complex
data structures may well take a moment.
Next, check if you opened the file in text or binary mode. Text mode may be
more CPU heavy, but you may or may not need that additional overhead.
Then, your lines seem to be short, and if you have a lot of them, the
processing can seriously add up. Note that you are creating several objects
in your code above:
- the loop creates one for l
- lstrip() may create another one
- split() creates three in your example
- int() creates one
- float() creates one
And most of them get thrown away after each iteration. You can avoid
instantiating some of the intermediate strings as well as the split tuple
by doing the parsing manually, but the other objects (l/foo/bar) are really
used.
> Does anyone here have experience with speeding such
> loops up? Just dropping the whole thing into cython does not seem to
> do much (probably because of all the python calls and conversions).
Right. All of the above processing happens inside of the CPython runtime
(ok, except for the indexing...).
> Does using C string routines help?
I'd say they'll at least help a bit, yes. If you target Python 2 (as
opposed to Python 3), you can also do the file reading in C code, thus
avoiding another object instantiation. See the PyFile_... functions in the
C-API. But try (and benchmark) the other options first.
Stefan
Specifically I would re-write the code as:
lines = (line.rstrip().split() for line in input)
items =( (int(foo),float(bar)) for foo,bar in lines )
data = dict( items )
depending on your case, it may be better (or worse) to replace some of
the generator expressions with list-comprehensions.
Y
I've done tests on this sort of stuff, and no matter how you slice it,
parsing a bunch of text into numbers can only go so fast in Python.
Ideally, there would be a fairly generic way to have a bunch of text
parsed in C.
numpy.fromfile() can parse text, but it is very limited as to the format
(no comments, can only read multiple lines if the separator is text,
etc.) If it fits your needs, it's pretty darn fast.
I've spent some time looking at fixing some of those limitations, but
boy does it get to be some ugly C! (and there are less than trivial (to
me anyway) bugs in the current implementation as well)
You can also read one line at a time with Python, and use
numpy.fromstring() to parse it. That should be faster than what you're
doing.
I've written my own parser in C (before fromfile existed). It only does
floats and doubles, and generates a numpy array -- but it's fast -- let
me know if you want it.
Ideally, I'd love to see a more general purpose text parser written in
cython -- I think it would be pretty useful.
-Chris
> Where data might be a dict, a numpy array or even a pytables object
> (hdf5). The processing can be more complex, but the files are almost
> always in simple 'table' like formats.
> Even very simple examples are often CPU and not disk-limited in
> standard python. Does anyone here have experience with speeding such
> loops up? Just dropping the whole thing into cython does not seem to
> do much (probably because of all the python calls and conversions).
> Does using C string routines help? Are there already fast parsers for
> this kind of simple situation around? Just looking for some experience
> from others before reinventing the wheel.
>
> Thanks
> Felix
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
That's exactly what I meant.
> This is even though there is some other stuff
> going on in this inner loop of the script, including two calls to
> l.startswith().
Recent Cython will optimise away the method call here if it knows that
'startswith' is a bytes or unicode object.
Make sure you use the latest developer version of Cython, it has a lot of
great features for text processing, including 'in' tests for characters
against literals and optimised string looping. I don't think it's well
documented yet, but here's a start:
http://behnel.de/cgi-bin/weblog_basic/index.php?cat=11
Stefan
Actually, the future 0.13 docs aren't all that bad:
http://hg.cython.org/cython-docs/raw-file/tip/src/tutorial/strings.rst
Stefan
Only the one that was in Py3.1 already. I don't have hard numbers, but it
is supposed to be a lot faster, with the drawback of not having a C-API
that you could hook into from Cython.
> and performance
> improvements in str.split. Does anyone here if that applies to this
> kind of simple parsing task?
You will still not get around the object allocations. String search
performance is not the bottleneck in your code.
Stefan
You can do that with fromfile() and/or fromstring(). With fromfile(),
you can do it if you know how many numbers are on a particular line.
with fromstring, you can do:
struct = fromstring(myfile.readline(), sep=' ')
which does do one more object allocation than you "need", but it is
faster than a few calls to split(), etc.
Also, you can read whole bunch of lines into an array with fromfile,
then loop through that to do something if you want.
In my tests, I'm trying to read, say 10,000 numbers into an array, all
at once. But if you're going to do a bunch of python on each line (or
any python!), there may be no point in doing more than the above --
profile carefully!
> Maybe a good general solution would be a cython compiled function that
> uses C parsing to generate a struct from each line and returns that to
> the python code. However what would be the best way to make this
> general (in the sense of not having to write a new parsing function by
> hand to change the number of columns)?
I'm imaging something simlilar to what MATLAB does: you pass in a C
format string, and it uses that to parse the file, recycling it as need
be to read as much as you are asking for.
If you have a "struct" with multiple data types, it might make the most
sense to pass in a numpy struct array, and have you Cython code
construct a format string from that.
That is almost what fromfile() does -- except it counts on the dtype
object itself to know how to parse itself, and the implementation is
ugly C with bugs. (using atof() and the like).
see this thread:
http://mail.scipy.org/pipermail/numpy-discussion/2010-January/047753.html
for all the gory details.
Another option is to write numpy.genfromtext in Cython, but you'd have
to make it pretty C heavy to get the performance you're looking for, so
maybe that's the same thing!
-Chris
right you are -- I hadn't tested that. They are written to let the dtype
do the parsing, so I imagine the idea was that they could be extended
to have custom dtypes know how to parse themselves, but no one has
written that code.
Given that, I'm all more inclined to write something totally different,
or significantly re-factor the fromfile/fromstring code - it's really
fragile and hard to maintain as it is.
> As long as the python is just an access to a dict or a numpy array the
> parsing is still the bottleneck for me.
Are you sure? Anyway, a fromfile() that could handle complex dtypes
would be good for what you need.
>> I'm imaging something simlilar to what MATLAB does: you pass in a C
>> format string, and it uses that to parse the file, recycling it as need
>> be to read as much as you are asking for.
>>
>> If you have a "struct" with multiple data types, it might make the most
>> sense to pass in a numpy struct array, and have you Cython code
>> construct a format string from that.
>
> That sounds excellent. Even though I see no big problem with the user
> explicitly providing the format string.
That's probably a good first step anyway.
Another issue -- IIUC correctly, with Python 2.6, you can get get
regular old C file handle from the Python file object. But with 3.0 (and
2.7??) Python handles files differently, so you can't get a regular old
C file handle -- so I don't know if you can simply use fscanf the same way.
I'm sure someone on this list knows about that.
In 2.7, it's only the case for the new io module, the file object works as
before. It's true that there isn't currently a public C-API for the io
module, though, so the I/O layers in Py3 can't be bypassed at the moment.
Stefan
Ah, sorry, bytes.startswith() doesn't map to a C-API call in Py3, so we
can't optimise it that way (works for the unicode type, though). However,
we could just call strncmp() instead in this case. I'll add a mental note
for now.
Stefan
However, we could easily obtain fileobj.fileno() and call C stdlib
fdopen() function for get the FILE* stream pointer, right?
--
Lisandro Dalcin
---------------
CIMEC (INTEC/CONICET-UNL)
Predio CONICET-Santa Fe
Colectora RN 168 Km 472, Paraje El Pozo
Tel: +54-342-4511594 (ext 1011)
Tel/Fax: +54-342-4511169
Would that give you the pointer to an already opened file, that may have
already had some data read? So you could just pick up from there?