I did a quick test and using a native approach h5py took 246 seconds
where PyTables took only 0.005 seconds. Are there any tricks to
improve this?
Thanks.
Test was reading a 7.2MB array of ints, it was already in the disk
cache so we're really measuring only CPU overhead:
>>> import h5py
>>> f = h5py.File('stack.h5')
>>> s = time.time(); numpy.array(f['segment']); print time.time() - s
array([[ 1110, 0, 0],
[ 1011, 11, 495],
[ 1011, 17, 497],
...,
[ 1092, 39769, 1035238],
[ 1092, 39769, 1035243],
[ 1092, 39769, 1035246]], dtype=int32)
246.247138977
>>> import tables
>>> f = tables.openFile('stack.h5')
>>> s = time.time(); f.getNode('/segment').read(); print time.time() - s
array([[ 1110, 0, 0],
[ 1011, 11, 495],
[ 1011, 17, 497],
...,
[ 1092, 39769, 1035238],
[ 1092, 39769, 1035243],
[ 1092, 39769, 1035246]], dtype=int32)
0.00539898872375
That issue is described here:
http://code.google.com/p/h5py/wiki/CommonProblems#Performance
Also, doesn't f['segment'] already return a numpy array?
> Is there any way to speed up conversion of datasets to numpy arrays?
> I saw the FAQ, it more or less just says to "convert once only" but
> even a single conversion is too slow in our case. If the bytes are
> sitting in memory isn't there some trick to get numpy to slurp them up
> directly?
>
> I did a quick test and using a native approach h5py took 246 seconds
> where PyTables took only 0.005 seconds. Are there any tricks to
> improve this?
Yes, this pathological performance is a side effect of using the
expression "data = numpy.array(<hdf5 dataset>)"; NumPy doesn't know
that the dataset works like an array and resorts to iteration over the
first axis to create the array. Try the syntax "data =
f['something'][...]" instead. H5py 1.3 (now in beta) contains an
__array__ method to better communicate its capabilities to NumPy.
To further clarify, the data is actually still on disk until you ask
for a slice (both h5py and PyTables work this way), but this doesn't
really matter for small datasets.
To address Keith's comment:
> Also, doesn't f['segment'] already return a numpy array?
It actually returns an "array-like" Dataset object, which has dtype
and shape attributes, and supports slicing, but is a proxy for an HDF5
dataset. However, these are not intended to be replacements for NumPy
arrays and can't participate in NumPy mathematical operations.
They're just a convenient proxy for reading and writing data. To get
a NumPy array, do e.g. f['segment'][...].
Andrew
Try the syntax "data = f['something'][...]" instead.
myarray = dataset[...]But didn't realize it meant literally use an ellipsis. I thought it meant "just index dataset directly instead of converting it to a numpy array"!