A faster way that I have found is to convert the data to hdf5 and then
load them into numpy, however this is an extra step that I would like to
avoid, if possible.
Any suggestions welcome.
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user
If the layout is more or less simple, you may have better luck with
reading files with Python's built-in CSV reader and only then converting
the lists to NumPy arrays.
I know it must definitively be not the best solution out there, but it
takes zero effort and I have able to load 500 Mb large files in a matter
of dozens of seconds without any problems:
import csv
import numpy as np
# Auto-detect the CSV dialect that is being used
#
dialect = csv.Sniffer().sniff(fp.read(1024))
fp.seek(0)
reader = csv.reader(fp, dialect)
data = []
for row in reader:
# Filter out empty fields
#
row = [x for x in row if x != ""]
...
data.append(row)
...
matrix = np.asarray(data, dtype = np.float)
--
Sincerely yours,
Yury V. Zaytsev
> On Fri, 2011-05-13 at 22:40 +0000, Giorgos Tzampanakis wrote:
>> I have numeric data in ascii files, each file about 800 MB. Loading such a
>> file to Octave takes about 30 seconds. On numpy it is so slow that I've
>> never had the patience to see it through to the end.
>
> If the layout is more or less simple, you may have better luck with
> reading files with Python's built-in CSV reader and only then converting
> the lists to NumPy arrays.
>
> I know it must definitively be not the best solution out there, but it
> takes zero effort and I have able to load 500 Mb large files in a matter
> of dozens of seconds without any problems:
Thanks for the suggestion! It wasn't quite as fast as Octave, in fact it
was about 6 times slower, but I think it'll do for an initial load. Then I
can save to numpy's native binary format.
The question now is, why aren't genfromtxt and loadtxt using this approach
if it is faster than what they're doing?
On Sat, 2011-05-14 at 12:25 +0000, Giorgos Tzampanakis wrote:
> Thanks for the suggestion! It wasn't quite as fast as Octave, in fact it
> was about 6 times slower, but I think it'll do for an initial load. Then I
> can save to numpy's native binary format.
That's also what I do for >1 Gb matrices: just save them in the native
NumPy format for later use and then the load times become negligible,
especially from /dev/shm mounts ;-)
> The question now is, why aren't genfromtxt and loadtxt using this approach
> if it is faster than what they're doing?
I think it all comes down to post-processing and heuristics. It seems
that these functions do quite a lot of extra work to make sure that the
data is loaded correctly, the precision isn't lost etc.
I even suppose that there is a way to speed them up by specifying
formats in function calls, but I've never really got time to figure it
out and went for my extra simple reader instead, since I had to perform
some weird pre-processing on the data row by row anyway.
--
Sincerely yours,
Yury V. Zaytsev