> I am new to using h5py but am really impressed so far.
Thanks!
> So, obviously my code will not be 1:1 as if I was working with numpy
> arrays, but can anyone point out comprehensive examples of working
> with h5py Datasets and the key scientific Python packages? I am
> looking for guidance on how to work with Datasets as much as possible
> before having to extract h5py data into memory via numpy arrays.
Yes, the "Numpy-array-like" abstraction is a bit leaky in the case of
Datasets, and most of this is intentional. Apart from attribute
access, the only NumPy-like features which are guaranteed to work are
a subset of indexing, and the __array__ capability. This means that
things like addition, multiplication and other operations are not
supported. The reason for this is that h5py is designed as a data
storage system, with some NumPy-like features bolted on for
convenience. There are serious performance issues which arise when
trying to pretend that Dataset objects are "real" NumPy arrays,
because of the cost of performing I/O. It's important to clarify that
no operations in h5py/HDF5 are performed on-disk. For example, an
implementation which evaluated "tx = tx + 5" in the example you
provide would have to do something like "tx[:] = tx[:] + 5", requiring
the entire dataset to be read and then written back to disk. If you
have many of these "simple" statements then you spend most of your
time shuffling data back and forth from storage.
In practice, any operation which expects slicing to be cheap will run
into this problem. Attractive as it might be to have code which looks
the same for NumPy/HDF5, it's better to do your I/O explicitly so you
know what's happening where. Your goal should be to perform a maximum
of one disk read and one disk write operation for each bit of data
that participates in your operations.
HTH,
Andrew