dtype for HDF5 strings

800 views
Skip to first unread message

Ray Osborn

unread,
Oct 9, 2014, 2:37:41 PM10/9/14
to h5...@googlegroups.com
I'm puzzled by the object dtype that is assigned to strings within h5py. Here's an example:

>>> a=h5.File('test.h5', 'w')
>>> a['string']='string'
>>> a['string'].dtype
>>> dtype('O')

When I read in the string value, using a['string'][()], a Python string is returned, but if I needed to check the dtype before reading the value, then a value of 'O' is not particularly helpful. Is there a reason why a string dtype is not assigned?


Ray

Andrew Collette

unread,
Oct 9, 2014, 3:46:31 PM10/9/14
to h5...@googlegroups.com
Hi Ray,

> When I read in the string value, using a['string'][()], a Python string is
> returned, but if I needed to check the dtype before reading the value, then
> a value of 'O' is not particularly helpful. Is there a reason why a string
> dtype is not assigned?

The "O" type is used in this case because the dataset was created
using a Python string. In contrast to NumPy "S" dtype strings, Python
strings are stored in HDF5 as "variable-length" strings, and
represented with the "O" type.

To get a dataset using the NumPy "S" string type, use the
create_dataset method explicitly, or just assign a "numpy.string_" :

>>> a['npstring'] = numpy.string_("Hello")
>>> a['npstring'][()].dtype
dtype('S5')

Andrew

Ray Osborn

unread,
Oct 14, 2014, 3:10:22 PM10/14/14
to h5...@googlegroups.com
Hi Andrew,
I'm interested in how HDF5 (or h5py) knows to return a string when returning the value of a dataset with dtype 'object'. I had a look at the h5py code, but I think it must be handled in the underlying HDF5 code. Presumably, something with dtype 'object' doesn't have to be a string.

Ray

Andrew Collette

unread,
Oct 14, 2014, 3:35:53 PM10/14/14
to h5...@googlegroups.com
Hi Ray,

> I'm interested in how HDF5 (or h5py) knows to return a string when returning
> the value of a dataset with dtype 'object'. I had a look at the h5py code,
> but I think it must be handled in the underlying HDF5 code. Presumably,
> something with dtype 'object' doesn't have to be a string.

We attach a small amount of metadata to the dtype. Originally we used
field names, but a user contributed a patch to use the .metadata
field in the dtype. The process is entirely transparent, using
h5py.special_dtype and h5py.check_dtype:

http://docs.h5py.org/en/latest/special.html

This approach is used for variable-length strings & arrays, enums,
object references, etc., which don't have a precise NumPy equivalent.

Andrew

Ray Osborn

unread,
Oct 14, 2014, 4:30:19 PM10/14/14
to h5...@googlegroups.com
I apologize. I completely missed the online documentation, which explains it all very clearly. I think I was fixated on finding the answer in the dataset page, and neglected to check the other pages before posting.

Ray
Reply all
Reply to author
Forward
0 new messages