[SciPy-User] Alternatives to genfromtxt and loadtxt?

Giorgos Tzampanakis

unread,

May 13, 2011, 6:40:00 PM5/13/11

to scipy...@scipy.org

I have numeric data in ascii files, each file about 800 MB. Loading such a
file to Octave takes about 30 seconds. On numpy it is so slow that I've
never had the patience to see it through to the end.

A faster way that I have found is to convert the data to hdf5 and then
load them into numpy, however this is an extra step that I would like to
avoid, if possible.

Any suggestions welcome.

_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

Yury V. Zaytsev

unread,

May 14, 2011, 6:53:57 AM5/14/11

to SciPy Users List

On Fri, 2011-05-13 at 22:40 +0000, Giorgos Tzampanakis wrote:
> I have numeric data in ascii files, each file about 800 MB. Loading such a
> file to Octave takes about 30 seconds. On numpy it is so slow that I've
> never had the patience to see it through to the end.

If the layout is more or less simple, you may have better luck with
reading files with Python's built-in CSV reader and only then converting
the lists to NumPy arrays.

I know it must definitively be not the best solution out there, but it
takes zero effort and I have able to load 500 Mb large files in a matter
of dozens of seconds without any problems:

import csv

import numpy as np

# Auto-detect the CSV dialect that is being used
#
dialect = csv.Sniffer().sniff(fp.read(1024))

fp.seek(0)
reader = csv.reader(fp, dialect)

data = []

for row in reader:

# Filter out empty fields
#
row = [x for x in row if x != ""]

...

data.append(row)

...

matrix = np.asarray(data, dtype = np.float)

--
Sincerely yours,
Yury V. Zaytsev

Giorgos Tzampanakis

unread,

May 14, 2011, 8:25:27 AM5/14/11

to scipy...@scipy.org

On 2011-05-14, Yury V. Zaytsev wrote:

> On Fri, 2011-05-13 at 22:40 +0000, Giorgos Tzampanakis wrote:
>> I have numeric data in ascii files, each file about 800 MB. Loading such a
>> file to Octave takes about 30 seconds. On numpy it is so slow that I've
>> never had the patience to see it through to the end.
>
> If the layout is more or less simple, you may have better luck with
> reading files with Python's built-in CSV reader and only then converting
> the lists to NumPy arrays.
>
> I know it must definitively be not the best solution out there, but it
> takes zero effort and I have able to load 500 Mb large files in a matter
> of dozens of seconds without any problems:

Thanks for the suggestion! It wasn't quite as fast as Octave, in fact it
was about 6 times slower, but I think it'll do for an initial load. Then I
can save to numpy's native binary format.

The question now is, why aren't genfromtxt and loadtxt using this approach
if it is faster than what they're doing?

Yury V. Zaytsev

unread,

May 14, 2011, 8:45:21 AM5/14/11

to SciPy Users List

Hi!

On Sat, 2011-05-14 at 12:25 +0000, Giorgos Tzampanakis wrote:

> Thanks for the suggestion! It wasn't quite as fast as Octave, in fact it
> was about 6 times slower, but I think it'll do for an initial load. Then I
> can save to numpy's native binary format.

That's also what I do for >1 Gb matrices: just save them in the native
NumPy format for later use and then the load times become negligible,
especially from /dev/shm mounts ;-)

> The question now is, why aren't genfromtxt and loadtxt using this approach
> if it is faster than what they're doing?

I think it all comes down to post-processing and heuristics. It seems
that these functions do quite a lot of extra work to make sure that the
data is loaded correctly, the precision isn't lost etc.

I even suppose that there is a way to speed them up by specifying
formats in function calls, but I've never really got time to figure it
out and went for my extra simple reader instead, since I had to perform
some weird pre-processing on the data row by row anyway.

--
Sincerely yours,
Yury V. Zaytsev

Reply all

Reply to author

Forward