I'd like to get more input on an API design question. Numpy has a datetime64 dtype to represent timestamps, but HDF5 has no corresponding type (there is one called H5T_TIME, but it's not portable and not supported). It is therefore proposed to store datetimes as HDF5 OPAQUE data with a tag recording the numpy dtype.
If you use h5py to store and retrieve arbitrary numpy data, this is an easy win. But if you're writing data for other, non-Python tools to consume, opaque data probably won't mean anything to them. So I'd like to require some extra step to create these datasets, to remind the user that they are probably h5py-specific.
We have now passed through three designs for this concept:
1. Allow it with no restrictions, so you can create a dataset like f['data'] = datetime_array
2. Require the dtype to be globally registered before storing this kind of data: h5py.register_dtype(datetime_array.dtype); f['data'] = datetime_array
3. Require an opaque wrapper when creating the dataset: f.create_dataset('data', data=datetime_array, dtype=h5py.opaque_dtype(datetime_array.dtype))
I prefer 3 as an API - it avoids adding global mutable state (like 2), and I think it's clearer. But it's proving tricky to implement, and I'm starting to wonder if I'm swimming against the design of h5py. What do other people prefer? Do you have an idea for a better option?
For context, there are a few other numpy types h5py handles with no equivalent HDF5 type:
- boolean - makes an HDF5 enum with two values
- complex - makes an HDF5 compound type for (real, imaginary) parts
- Fixed-width UTF-32 unicode - explicitly refuses to handle it
This has now been discussed over two PRs:
Thanks,
Thomas