API design: storing numpy datetimes in HDF5 datasets

Thomas Kluyver

unread,

Jul 16, 2019, 8:04:34 AM7/16/19

to h5py

I'd like to get more input on an API design question. Numpy has a datetime64 dtype to represent timestamps, but HDF5 has no corresponding type (there is one called H5T_TIME, but it's not portable and not supported). It is therefore proposed to store datetimes as HDF5 OPAQUE data with a tag recording the numpy dtype.

If you use h5py to store and retrieve arbitrary numpy data, this is an easy win. But if you're writing data for other, non-Python tools to consume, opaque data probably won't mean anything to them. So I'd like to require some extra step to create these datasets, to remind the user that they are probably h5py-specific.

We have now passed through three designs for this concept:

1. Allow it with no restrictions, so you can create a dataset like f['data'] = datetime_array

2. Require the dtype to be globally registered before storing this kind of data: h5py.register_dtype(datetime_array.dtype); f['data'] = datetime_array

3. Require an opaque wrapper when creating the dataset: f.create_dataset('data', data=datetime_array, dtype=h5py.opaque_dtype(datetime_array.dtype))

I prefer 3 as an API - it avoids adding global mutable state (like 2), and I think it's clearer. But it's proving tricky to implement, and I'm starting to wonder if I'm swimming against the design of h5py. What do other people prefer? Do you have an idea for a better option?

For context, there are a few other numpy types h5py handles with no equivalent HDF5 type:

- boolean - makes an HDF5 enum with two values

- complex - makes an HDF5 compound type for (real, imaginary) parts

- Fixed-width UTF-32 unicode - explicitly refuses to handle it

This has now been discussed over two PRs:

https://github.com/h5py/h5py/pull/1232

https://github.com/h5py/h5py/pull/1262

Thanks,

Thomas

Gareth Williams

unread,

Jul 16, 2019, 5:56:20 PM7/16/19

to h5...@googlegroups.com

What approach does xarray take? It might be good to be consistent, but you might need to handle more general cases.

Gareth

--
You received this message because you are subscribed to the Google Groups "h5py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h5py+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/h5py/9e684bfe-a819-4664-a712-dec06d4dfe8f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thomas Kluyver

unread,

Jul 17, 2019, 5:34:45 AM7/17/19

to h5...@googlegroups.com

- Xarray doesn't write plain HDF5 files, but it can write netCDF4, which is a layer on top of HDF5. Datetimes get stored as int64, with some rather verbose attributes describing them (units: b'days since 2007-07-13 00:00:00', calendar: b'proleptic_gregorian'). I haven't looked into whether this is part of the netCDF4 format, or something specific to xarray.

- Pytables does not appear to support datetimes. Attempting to store them with f.create_array() got me "ValueError: unknown type: 'datetime64[D]'".

- pandas writes HDF5 files with a wrapper around pytables. Datetimes are stored as int64 with an attribute value_type: b'datetime64'. There are no explicit units or epoch in the file. Pandas always uses
nanosecond resolution and the Unix epoch, so it can read them, but any other tool would either have to guess or recognise that the file was written by pandas.

This highlights the problem: there are sensible ways to represent datetimes which could be meaningful to other tools, but there isn't one obvious standard, especially for the units. And there's no common way to express the units and the epoch in metadata.

Thomas

To view this discussion on the web visit https://groups.google.com/d/msgid/h5py/CAJCKTG2SB2ZL2hA7jNemrHYJuMjOeBfhbitCfo1YjnWVvy07-Q%40mail.gmail.com.

Gareth Williams

unread,

Jul 17, 2019, 6:36:19 AM7/17/19

to h5...@googlegroups.com

Yes it is netcdf that is setting and supporting the semantic standard. Actually that is not quite true... netcdf provides a particular technical structure on top of hdf to lay out metadata. Extra conventions exist to specify what metadata should be present and in what form for a clear semantic representation of data. Note CF (climate and forecadting) conventions in particular. Real world time is tricky, hence the support for various units and reference times and calendars and ...

xarray logically maps very closely to netdcf, with python structures including pandas for the arrays. You might look to (or ask) what the xarray authors do and what they assume if the user is not explicit.

Of course if you follow their lead you end up reinventing part of netdcf, that is keeping specific extra metadata in the hdf and converting on load and store.

Gareth

To view this discussion on the web visit https://groups.google.com/d/msgid/h5py/CAOvn4qgtOMY-5WEFsrMwXFyvmRC1LPSvrmaBdk1VPO0_JStWgw%40mail.gmail.com.

Thomas Caswell

unread,

Jul 17, 2019, 9:19:58 AM7/17/19

to h5py

Folks,

I spent some time talking with Anthony Scopatz about this in person at scipy last week. As a caveat, I have not read the source in either PR carefully yet.

I agree that we should not go with option 1 (automatic registration). Although I don't think it will break any current users (who are special casing this in application land), it could be surprising for future users and will paint us into a weird corner if in the future hdf5 _does_ get a standard dtype for datetime.

Option 2 looks pretty good (particularly the context manager version), but runs the risk of some downstream project doing the registration and now we are back to effectively option 1.

Option 3 has its appeal it terms of the locality and explicitness, but it is pretty verbose and breaks some of the "feels right" -ness of the API. On the other hand if users want to control details of chunking / compression / extendability you have to go to explicitly creating the dataset and "do weird opaque python-only-things for a dtype" arguebly falls under the same umbrella. How wolud this work for compound types with datetime in it? This also has the advantage of clearly being a "at dataset creation time"

We also talked about using attributes to track "this is a datetime" and I think using the opaque type is a better option.

I am between 2 and 3 (or both!). I have a very slight preferance for 3 but would be :+1: on either.

Tom

PS thought I sent this email ~24hrs ago

--

You received this message because you are subscribed to the Google Groups "h5py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h5py+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/h5py/9e684bfe-a819-4664-a712-dec06d4dfe8f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Thomas Caswell
tcas...@gmail.com

Thomas Kluyver

unread,

Jul 17, 2019, 12:48:03 PM7/17/19

to h5...@googlegroups.com

Thanks Thomas C,

Just to keep this thread up to date, Anthony mentioned on #1232 that you agreed this could be postponed to a later release so long as that can happen relatively soon. We're discussing there how soon and what the release would be called.

> How wolud this work for compound types with datetime in it?

My thinking was that you'd have to specify the compound dtype with the opaque wrapper around the datetime part of it. I realise this could be rather verbose, but it does make it explicit.

A couple of different variations on the idea:

A. Extra parameter to opaque-ify any unrecognised dtype: f.create_dataset(..., allow_opaque=True)

B. The same, but specific to datetimes: f.create_dataset(..., opaque_datetime=True)

Either of these could error if they were used with a dtype for which it's not relevant.

Thomas K

To view this discussion on the web visit https://groups.google.com/d/msgid/h5py/CAA48SF_hCEXRc44So4w4SNvNQ9N17ahJiC1dZUkERTq%3DfG%2Byhw%40mail.gmail.com.

James Tocknell

unread,

Jul 21, 2019, 1:07:18 AM7/21/19

to h5...@googlegroups.com

I created a branch (https://github.com/aragilar/h5py/tree/try-opaque,
it's a PoC rather than something that's PR worthy) where I wrap up the
original numpy dtype inside a np.void dtype (aka do the same thing as
all the other "special dtypes", this to me is probably the best option
in terms of writing (as it composes with other dtypes in the same way
as the other special dtypes). This doesn't solve the reading problem
though. Would people be fine with having registration for reading, but
not writing (this registry could be used for other type conversion,
e.g. https://github.com/h5py/h5py/issues/1258 and controlling the
types read out)?

James

> To view this discussion on the web visit https://groups.google.com/d/msgid/h5py/CAOvn4qi1vHazfS7fSx1xbsShW68b8TM0onWMguwViWs%2Bo8%2Bo%3Dw%40mail.gmail.com.

> For more options, visit https://groups.google.com/d/optout.

--

Don't send me files in proprietary formats (.doc(x), .xls, .ppt etc.).
It isn't good enough for Tim Berners-Lee, and it isn't good enough for
me either. For more information visit
http://www.gnu.org/philosophy/no-word-attachments.html.

Truly great madness cannot be achieved without significant intelligence.
- Henrik Tikkanen

If you're not messing with your sanity, you're not having fun.
- James Tocknell

In theory, there is no difference between theory and practice; In
practice, there is.

Reply all

Reply to author

Forward