Is there a way to make views of h5py datasets?

319 views
Skip to first unread message

stuarteberg

unread,
Feb 15, 2013, 1:59:40 PM2/15/13
to h5...@googlegroups.com
Hi,

I'm wondering if there's a way to create h5py dataset views, much like numpy array views.

Ideally, I would like to use h5py datasets in place of numpy arrays, without needing to distinguish between the two.  Is that possible?

This annotated transcript shows what I mean.

Thanks,
Stuart

# Make a zero-valued dataset
In [51]: f = h5py.File('example.h5', 'w')
In [52]: d = f.create_dataset('dset', data=numpy.zeros((4,4)))

# Make a zero-valued array for comparison
In [53]: a = numpy.zeros((4,4))

# This function modifies all entries of the provided array view
In [54]: def update_view(view):
   ....:     view[:] = 1
   ....:    

# Array is initially zero....
In [55]: a
Out[55]:
array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

In [56]: update_view(a[0])

# ... and now it's been updated with some ones.
In [57]: a
Out[57]:
array([[ 1.,  1.,  1.,  1.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

# dataset is initially zero....
In [58]: d[:]
Out[58]:
array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

In [59]: update_view(d[0])

# .... but it isn't updated.  (d[:] was a copy, not a view)
In [60]: d[:]
Out[60]:
array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

Andrew Collette

unread,
Feb 15, 2013, 3:59:36 PM2/15/13
to h5...@googlegroups.com
Hi Stuart,

> I'm wondering if there's a way to create h5py dataset views, much like numpy
> array views.
>
> Ideally, I would like to use h5py datasets in place of numpy arrays, without
> needing to distinguish between the two. Is that possible?

There is no mechanism in h5py for this, although I bet you could build
one using dataspaces.

However, there are more subtle issues with this approach. Since the
view would point back to data in the file, there are efficiency
concerns. After adding the __array__ method to Dataset objects, we
got lots of complaints from people who mixed Dataset objects with
NumPy arrays and were surprised that it was slow. Although the
objects have similar syntax for reads and writes, code written for
NumPy arrays makes all kinds of assumptions about speed which Datasets
break.

That said, if you're really dead-set on this I could give you some
suggestions on how to write a DatasetView class using HDF5 dataspace
support.

Andrew

Stuart Berg

unread,
Feb 15, 2013, 4:39:59 PM2/15/13
to h5...@googlegroups.com
Hi Andrew,

Thanks for your thoughts on this.  I wasn't even aware of the h5py.Dataset.__array__ method.  Is there documentation about this method anywhere (besides the docstring)?

I thought this would enable a specific optimization in my code.  But now that you mention it, I can see how this might not really be an optimization for me at all.  In some simple cases, my "update" functions merely write to the view. However, in other cases, my "update" functions actually use the view as temporary storage space for intermediate computations.  Using an hdf5 dataset for that would likely be slow.  Any memory I saved would probably not be worth the performance hit of reading/writing hdf5 for intermediate results.

I won't need you to give me any suggestions for how to implement a DataView class.  Thanks for offering, though.

By the way, do your efficiency concerns still hold for a dataset that is entirely in-memory?  That is, if the dataset's file was created like so:

mem_file = h5py.File(filename, driver='core', backing_store=False, mode='w')

Best,
Stuart

--
You received this message because you are subscribed to the Google Groups "h5py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h5py+uns...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



Andrew Collette

unread,
Feb 15, 2013, 5:22:36 PM2/15/13
to h5...@googlegroups.com
Hi Stuart,

> Thanks for your thoughts on this. I wasn't even aware of the
> h5py.Dataset.__array__ method. Is there documentation about this method
> anywhere (besides the docstring)?

It's mentioned here:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html

and I think in the official NumPy Book (PDF), which is now free. We
added it because np.array(dataset) was slow; if __array__ isn't
present, np.array iterates over the first axis.

> By the way, do your efficiency concerns still hold for a dataset that is
> entirely in-memory? That is, if the dataset's file was created like so:
>
> mem_file = h5py.File(filename, driver='core', backing_store=False, mode='w')

Although the data will be in memory, there's some overhead for reading
and writing. Code that handles slicing, for example, is written
entirely in Python. And you still have to do all the HDF5 things
required to talk to a file; set up types, create and select
dataspaces, etc., all which takes time and is done on a per-read or
per-write basis.

Andrew
Reply all
Reply to author
Forward
0 new messages