Using h5py Dataset objects with numpy

1,036 views

Skip to first unread message

Josh Hemann

unread,

Feb 5, 2011, 3:22:10 PM2/5/11

to h5py

I am new to using h5py but am really impressed so far.

All of my Python coding involves use of numpy/scipy/matplotlib for
statistical analysis. Given the size of some data sets, I need the
option of working from disk as opposed to strictly in-memory
processing, so discovering h5py has been hugely helpful. I'd like to
write as general numpy/scipy/matplotlib code as possible though, such
that whether the data are in memory, or on disk, my code will work the
same. Tall order, I know, and I am trying to see how far I can take
this.

In the following example it looks like I can easily hand off h5py
Datasets to scipy as if they were numpy ndarrays:

In [30]: f = h5py.File('tx.hdf5', 'r')

In [31]: tx = f['tx']

In [32]: tx
Out[32]: <HDF5 dataset "tx": shape (8443090,), type "<f8">

In [33]: import scipy.stats.mstats_extras as mstats

# Compute the Harrell-Davis quantiles of the array...
In [34]: mstats.hdquantiles(tx, prob=[0.25, 0.5, 0.75])
Out[34]:
masked_array(data = [ 8.1428571428571175 16.281780492823938
29.428571428571413 ],
mask = False,
fill_value = 1e+20)

So far so good. But of course, I may need to do simple arithmetic
operations on my arrays, like

In [35]: tx = tx + 5

which results in an error

---------------------------------------------------------------------------
TypeError Traceback (most recent call
last)

C:\\<ipython console> in <module>()

TypeError: unsupported operand type(s) for +: 'Dataset' and 'int'

So, obviously my code will not be 1:1 as if I was working with numpy
arrays, but can anyone point out comprehensive examples of working
with h5py Datasets and the key scientific Python packages? I am
looking for guidance on how to work with Datasets as much as possible
before having to extract h5py data into memory via numpy arrays.

Andrew Collette

unread,

Feb 8, 2011, 12:38:03 AM2/8/11

to h5...@googlegroups.com

Hi,

> I am new to using h5py but am really impressed so far.

Thanks!

> So, obviously my code will not be 1:1 as if I was working with numpy
> arrays, but can anyone point out comprehensive examples of working
> with h5py Datasets and the key scientific Python packages? I am
> looking for guidance on how to work with Datasets as much as possible
> before having to extract h5py data into memory via numpy arrays.

Yes, the "Numpy-array-like" abstraction is a bit leaky in the case of
Datasets, and most of this is intentional. Apart from attribute
access, the only NumPy-like features which are guaranteed to work are
a subset of indexing, and the __array__ capability. This means that
things like addition, multiplication and other operations are not
supported. The reason for this is that h5py is designed as a data
storage system, with some NumPy-like features bolted on for
convenience. There are serious performance issues which arise when
trying to pretend that Dataset objects are "real" NumPy arrays,
because of the cost of performing I/O. It's important to clarify that
no operations in h5py/HDF5 are performed on-disk. For example, an
implementation which evaluated "tx = tx + 5" in the example you
provide would have to do something like "tx[:] = tx[:] + 5", requiring
the entire dataset to be read and then written back to disk. If you
have many of these "simple" statements then you spend most of your
time shuffling data back and forth from storage.

In practice, any operation which expects slicing to be cheap will run
into this problem. Attractive as it might be to have code which looks
the same for NumPy/HDF5, it's better to do your I/O explicitly so you
know what's happening where. Your goal should be to perform a maximum
of one disk read and one disk write operation for each bit of data
that participates in your operations.