compact layout

Pierre Complex

unread,

May 21, 2015, 4:13:47 AM5/21/15

to h5...@googlegroups.com

Hi,

I sometimes create compact datasets. In h5py, the facility to do this is not embedded in "create_dataset" and I thus duplicate create_dataset to add the setting to the dcpl.

The function that create the dataset (make_new in dataset.py) does not handle this option.

Is there a plan to add this? If not, may I add an option (a named argument to make_new_dset, layout) in which one could pass "h5p.COMPACT"?

Regards,

Pierre

Andrew Collette

unread,

May 21, 2015, 10:54:39 AM5/21/15

to h5...@googlegroups.com

Hi Pierre,

> Is there a plan to add this? If not, may I add an option (a named argument
> to make_new_dset, layout) in which one could pass "h5p.COMPACT"?

I am not aware of anyone working on this at the moment. The trouble
is that right now the layout strategy is determined automatically by
the choice of compression, the "chunks" argument, etc. A new "layout"
keyword to create_dataset would interact with that in complicated
ways.

May I ask the specific motivation for using "compact"? Btw, these
datasets are limited to 64k and don't work right in parallel mode.

Andrew

Pierre Complex

unread,

May 21, 2015, 12:56:21 PM5/21/15

to h5...@googlegroups.com

Hi Andrew,

The point is to store small datasets efficiciently. I must say that I have not
compared memory or performance for my applications but the use case is
to store scalar parameters. Attributes cannot be used as I would need to an
attribute to the data.

<element>
 \-- step: Integer[]

 \-- time: Integer[]

 \-- value: <type>[variable][...]

Here element is a group, step, time and value are datasets. I want the possibility
to add an attribute "unit" to the dataset "time". As "time" is a scalar parameter, I
guess that making it compact would be a plus.

In parallel, trouble occurs only if you write different data from different cpus according to
http://www.hdfgroup.org/HDF5/doc/RM/RM_H5D.html#Dataset-Write

If that makes sense, I can look into the argument logic.

Regards,

Pierre

Andrew Collette

unread,

May 21, 2015, 2:04:15 PM5/21/15

to h5...@googlegroups.com

Hi Pierre,

> Here element is a group, step, time and value are datasets. I want the
> possibility
> to add an attribute "unit" to the dataset "time". As "time" is a scalar
> parameter, I
> guess that making it compact would be a plus.

I tend to be a bit suspicious of such optimizations. Before we
discuss options for adding things to h5py it would be interesting to
see, for some reasonable case, what the performance benefits are.

> In parallel, trouble occurs only if you write different data from different
> cpus according to
> http://www.hdfgroup.org/HDF5/doc/RM/RM_H5D.html#Dataset-Write

That seems like a pretty big problem... generally the point of
parallel processing is that one writes different data from different
CPUs, and datasets with compact storage will silently record incorrect
data if this is done.

To be frank, it looks to me like compact dataset storage is a
misfeature in HDF5. But I'm open to being convinced otherwise.

Andrew

Pierre Complex

unread,

May 21, 2015, 4:50:44 PM5/21/15

to h5...@googlegroups.com

Hi Andrew,

On Thursday, May 21, 2015 at 8:04:15 PM UTC+2, Andrew Collette wrote:

Hi Pierre,

> Here element is a group, step, time and value are datasets. I want the
> possibility
> to add an attribute "unit" to the dataset "time". As "time" is a scalar
> parameter, I
> guess that making it compact would be a plus.

I tend to be a bit suspicious of such optimizations. Before we
discuss options for adding things to h5py it would be interesting to
see, for some reasonable case, what the performance benefits are.

Ok, I will do some tests and come back if the results are interesting. Else, I will forget about the idea.

> In parallel, trouble occurs only if you write different data from different
> cpus according to
> http://www.hdfgroup.org/HDF5/doc/RM/RM_H5D.html#Dataset-Write

That seems like a pretty big problem... generally the point of
parallel processing is that one writes different data from different
CPUs, and datasets with compact storage will silently record incorrect
data if this is done.

To be frank, it looks to me like compact dataset storage is a
misfeature in HDF5. But I'm open to being convinced otherwise.

I have honestly no real expectation here with respect to HDF5. It just
seemed a reasonable idea :-)

Cheers,

Pierre

Reply all

Reply to author

Forward