numpy conversion speed

320 views
Skip to first unread message

Philip Winston

unread,
Mar 12, 2010, 2:43:10 PM3/12/10
to h5py
Is there any way to speed up conversion of datasets to numpy arrays?
I saw the FAQ, it more or less just says to "convert once only" but
even a single conversion is too slow in our case. If the bytes are
sitting in memory isn't there some trick to get numpy to slurp them up
directly?

I did a quick test and using a native approach h5py took 246 seconds
where PyTables took only 0.005 seconds. Are there any tricks to
improve this?

Thanks.

Test was reading a 7.2MB array of ints, it was already in the disk
cache so we're really measuring only CPU overhead:

>>> import h5py
>>> f = h5py.File('stack.h5')
>>> s = time.time(); numpy.array(f['segment']); print time.time() - s
array([[ 1110, 0, 0],
[ 1011, 11, 495],
[ 1011, 17, 497],
...,
[ 1092, 39769, 1035238],
[ 1092, 39769, 1035243],
[ 1092, 39769, 1035246]], dtype=int32)
246.247138977

>>> import tables
>>> f = tables.openFile('stack.h5')
>>> s = time.time(); f.getNode('/segment').read(); print time.time() - s
array([[ 1110, 0, 0],
[ 1011, 11, 495],
[ 1011, 17, 497],
...,
[ 1092, 39769, 1035238],
[ 1092, 39769, 1035243],
[ 1092, 39769, 1035246]], dtype=int32)
0.00539898872375

Keith Goodman

unread,
Mar 12, 2010, 4:43:17 PM3/12/10
to h5...@googlegroups.com
On Fri, Mar 12, 2010 at 11:43 AM, Philip Winston <pwin...@gmail.com> wrote:
> Is there any way to speed up conversion of datasets to numpy arrays?
> I saw the FAQ, it more or less just says to "convert once only" but
> even a single conversion is too slow in our case.  If the bytes are
> sitting in memory isn't there some trick to get numpy to slurp them up
> directly?
>
> I did a quick test and using a native approach h5py took 246 seconds
> where PyTables took only 0.005 seconds. Are there any tricks to
> improve this?
>
> Thanks.
>
> Test was reading a 7.2MB array of ints, it was already in the disk
> cache so we're really measuring only CPU overhead:
>
>>>> import h5py
>>>> f = h5py.File('stack.h5')
>>>> s = time.time(); numpy.array(f['segment']); print time.time() - s

That issue is described here:

http://code.google.com/p/h5py/wiki/CommonProblems#Performance

Also, doesn't f['segment'] already return a numpy array?

Andrew Collette

unread,
Mar 12, 2010, 4:49:05 PM3/12/10
to h5...@googlegroups.com
Hi,

> Is there any way to speed up conversion of datasets to numpy arrays?
> I saw the FAQ, it more or less just says to "convert once only" but
> even a single conversion is too slow in our case.  If the bytes are
> sitting in memory isn't there some trick to get numpy to slurp them up
> directly?
>
> I did a quick test and using a native approach h5py took 246 seconds
> where PyTables took only 0.005 seconds. Are there any tricks to
> improve this?

Yes, this pathological performance is a side effect of using the
expression "data = numpy.array(<hdf5 dataset>)"; NumPy doesn't know
that the dataset works like an array and resorts to iteration over the
first axis to create the array. Try the syntax "data =
f['something'][...]" instead. H5py 1.3 (now in beta) contains an
__array__ method to better communicate its capabilities to NumPy.

To further clarify, the data is actually still on disk until you ask
for a slice (both h5py and PyTables work this way), but this doesn't
really matter for small datasets.

To address Keith's comment:

> Also, doesn't f['segment'] already return a numpy array?

It actually returns an "array-like" Dataset object, which has dtype
and shape attributes, and supports slicing, but is a proxy for an HDF5
dataset. However, these are not intended to be replacements for NumPy
arrays and can't participate in NumPy mathematical operations.
They're just a convenient proxy for reading and writing data. To get
a NumPy array, do e.g. f['segment'][...].

Andrew

Philip Winston

unread,
Mar 12, 2010, 9:19:51 PM3/12/10
to h5...@googlegroups.com
Try the syntax "data = f['something'][...]" instead.

Aha!  Great, yes then it is exactly the same speed as PyTables, which is to say about 45,000X times faster than calling numpy.array().

I saw the line in the FAQ:
myarray = dataset[...]
But didn't realize it meant literally use an ellipsis.  I thought it meant "just index dataset directly instead of converting it to a numpy array"!

Thanks for the quick help.

-Philip

Reply all
Reply to author
Forward
0 new messages