String datasets in h5py 3.0rc1

949 views
Skip to first unread message

Ray Osborn

unread,
Oct 14, 2020, 8:04:26 PM10/14/20
to h5py
The new release candidate appears to handle strings differently to v2.1.0. I was going to post a Github issue, but I thought I had better check that this wasn't a planned feature I didn't know about. With v3.0.0rc1, I get:

>>> import h5py as h5
>>> f=h5.File('text1.h5', 'a')
>>> f.create_dataset('text', data='some text')
<HDF5 dataset "text": shape (), type "|O">
>>> f['text'][()]
b'some text'

With v2.1.0, I get:

>>> import h5py as h5
>>> f=h5.File('text1.h5', 'a')
>>> f.create_dataset('text', data='some text')
<HDF5 dataset "text": shape (), type "|O">
>>> f['text'][()]
'some text'

Is there some reason why 3.0.0rc1 is now returning a byte string rather than unicode? Is text no longer stored as a UTF8 variable length string by default?

Ray

Ray Osborn

unread,
Oct 14, 2020, 8:30:42 PM10/14/20
to h5py
I've just come across https://github.com/h5py/h5py/issues/1338, so it appears to have been planned. Without having read the whole issue carefully, it looks as if this is going to be a major pain in the neck. Before the change, I would create a dataset with a string and I would get the string back when I read it. I thought that was the whole point of having a special dtype for variable length strings. Now I have to add an extra decode whenever I read it back.

I will try to spend some time reading the issue more carefully. 

Ray

arag...@gmail.com

unread,
Oct 15, 2020, 4:17:22 AM10/15/20
to h5py
Hi Ray

There is documentation about the new way to handle strings at https://docs.h5py.org/en/latest/strings.html, which may make things clearer (you can access the different versions of the docs by clicking on the "v: stable" link on the bottom left of the page). For datasets where the encoding metadata in the HDF5 file matches the actual encoding of the strings, using asstr() with no arguments will give you a str which has been correctly decoded, but it can also be used to control the cases where the metadata and the encoding do not match, by accepting the same arguments as bytes.decode (which may be the case when reading files not produced by h5py). This is similar to the older astype method, but is designed to make working with string datasets (especially those not encoded in ASCII) more robust.

James

Thomas Kluyver

unread,
Oct 15, 2020, 4:27:17 AM10/15/20
to h5...@googlegroups.com
Hi Ray,

Indeed, it is deliberate. Sorry, I know the changes will break some stuff. It's mentioned in the release notes, but perhaps I should have mentioned this specifically in the announcement. There's a wrapper you can use to make it more convenient to get strings: f['text'].asstr()[()]

Why did we make these changes? We wanted to make string access more consistent, with less of a difference between strings tagged ASCII or UTF-8 in HDF5 files, and less of a difference between fixed-length and variable-length strings. Translating HDF5 ASCII & UTF-8 strings to different Python types kind of made sense in Python 2, where str & unicode were (awkwardly) interchangeable, but it's a weird artifact in Python 3, where bytes & str are much more distinct.

So why does it return bytes by default, rather than Python str? In short, because that's what HDF5 stores. There's no guarantee that a string is actually stored in the encoding it's tagged with, whether that's ASCII or UTF-8. It's easier to deal with such cases if h5py gives you the raw data, rather than decoding it incorrectly. For fixed-length strings, we can also read them more efficiently as numpy bytes arrays.

I hope that makes it clearer - thanks for testing this!

Thomas

--
You received this message because you are subscribed to the Google Groups "h5py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h5py+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/h5py/758e7ddc-45ce-4f63-8386-5e25def51f4dn%40googlegroups.com.

Thomas Kluyver

unread,
Oct 15, 2020, 5:06:24 AM10/15/20
to h5...@googlegroups.com
If you need to read strings as str and support both h5py 3.0 and 2.x, you'll need something like this to bridge the difference:


If you know that your strings are tagged UTF-8, it's slightly more verbose in the new version. But if you need to handle ASCII strings in the same way, it's much simpler now.

Thomas

Raymond Osborn

unread,
Oct 15, 2020, 8:42:15 AM10/15/20
to 'Alan Robinson' via h5py
Hi everyone,
Thanks for all the quick replies. Ironically, I’m not too much affected by this because my code has always been designed to read both unicode and byte strings, so I run whatever I read through an automatic converter, but I suspect that this will generate a lot of problems for others. I am thinking of libraries that have a large amount of embedded h5py code that needs to be backwardly-compatible.

Have you considered adding a keyword argument to h5py.File to revert to the old behavior when reading strings with the h5.special_dtype(vlen=str)? Assuming the File constructor ignores unknown keywords, this would give everyone a backward-compatible solution that doesn’t require them to do a version check on every single read.

With regards,
Ray

Message has been deleted

Raymond Osborn

unread,
Oct 15, 2020, 9:02:52 AM10/15/20
to 'Alan Robinson' via h5py
Since I only discovered the issue because it caused a unit test to fail, I’m not going to worry about this any more, but I would encourage you to highlight this in future announcements, as you suggested earlier. 

Thanks for keeping h5py running.

Ray

On Oct 15, 2020, at 7:54 AM, Thomas Kluyver <tak...@gmail.com> wrote:

> Assuming the File constructor ignores unknown keywords,

It doesn't. ;-)

I'm sure this change will break some code. But I don't think there's any easy way to make the change without breaking backwards compatibility at some point, so 3.0 seems like the time to do it. I initially made the case for leaving the string conversions alone, but when we discussed it in detail last year, I was won over to the idea of moving to a more consistent model, and I'm pretty happy with what we ended up with.

Thomas


--
You received this message because you are subscribed to the Google Groups "h5py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h5py+uns...@googlegroups.com.

Thomas Kluyver

unread,
Oct 15, 2020, 9:15:01 AM10/15/20
to h5...@googlegroups.com
Thanks Ray. I'll make sure to mention it when 3.0 is released.

Thomas Kluyver

unread,
Oct 15, 2020, 9:27:08 AM10/15/20
to h5...@googlegroups.com
I've made a PR to describe the incompatibility more clearly in the h5py 3.0 release notes:

Reply all
Reply to author
Forward
0 new messages