numpy array from MATLAB file string array

1,798 views
Skip to first unread message

Angus

unread,
Oct 28, 2011, 12:25:24 PM10/28/11
to h5py
Hi all,

I'm trying to read an array of string from a HDF5-format MATLAB file,
into numpy. The field I want is an array of strings. H5py shows this
as an array of 'HDF5 object references'.

In[203]: d['field']
Out[203]: <HDF5 dataset "trans_threshold": shape (11, 1), type "|O8">

In[204]: d['field'].value
Out[204]:
array([[<HDF5 object reference>],
[<HDF5 object reference>],
[<HDF5 object reference>],
[<HDF5 object reference>],
[<HDF5 object reference>],
[<HDF5 object reference>],
[<HDF5 object reference>],
[<HDF5 object reference>],
[<HDF5 object reference>],
[<HDF5 object reference>],
[<HDF5 object reference>]], dtype=object)

If I try to dereference one of these (I haven't worked out how to
dereference them all at once yet), I'm getting the wrong type out.

In[207]: d[d['field'][0].item()]
Out[207]: <HDF5 dataset "4b": shape (1, 1), type "<u2">

In[208]: d[d['field'][0].item()].value
Out[208]: array([[45]], dtype=uint16)

Is there an easier way to get at the values, and how can I determine,
from the file, the correct typecast required to get my data?

Thanks,

Angus

Andrew Collette

unread,
Oct 31, 2011, 1:20:42 PM10/31/11
to h5...@googlegroups.com
Hi Angus,

> I'm trying to read an array of string from a HDF5-format MATLAB file,
> into numpy. The field I want is an array of strings. H5py shows this
> as an array of 'HDF5 object references'.

Matlab does use some odd conventions for storing things. Could you
post a (simple) example file?

Andrew

Angus McMorland

unread,
Oct 31, 2011, 2:18:29 PM10/31/11
to h5...@googlegroups.com

This file contains just my problem variable.

Thanks for taking a look.

Best,

Angus.
--
AJC McMorland
Post-doctoral research fellow
Neurobiology, University of Pittsburgh

trans_threshold.mat

Andrew Collette

unread,
Nov 7, 2011, 12:40:42 PM11/7/11
to h5...@googlegroups.com
Hi,

>> Matlab does use some odd conventions for storing things.  Could you
>> post a (simple) example file?
>
> This file contains just my problem variable.

It looks like Matlab is really abusing the HDF5 format here. The good
news is you're dereferencing the datasets correctly. It looks like
each element of trans_threshold is a reference to a dataset under the
group "/#refs#". Each dataset is a 1D collection of ints. By
inspection it looks like the ints are meant to be ASCII characters.
There's also a little bit of metadata in the form of attributes on
each "#refs#" dataset which I think is meant to indicate the character
set.

Given the (silly) complexity of what MATLAB is doing here, I don't
think there's a way to simplify your code that much. If you'll be
dealing with this kind of data a lot you might want to write a little
wrapper which recognizes the attributes, reads the datasets and spits
out a Python string.

HTH,
Andrew

Angus McMorland

unread,
Nov 9, 2011, 8:12:42 AM11/9/11
to h5...@googlegroups.com
Hi Andrew and others,

Thanks very much for taking a look. I'll post here my wrapper routine,
when I get round to writing it, to convert these arrays of references
into the corresponding array of values.

Angus

> HTH,
> Andrew
>
> --
> You received this message because you are subscribed to the Google Groups "h5py" group.
> To post to this group, send email to h5...@googlegroups.com.
> To unsubscribe from this group, send email to h5py+uns...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/h5py?hl=en.

faizan shaikh

unread,
Feb 13, 2016, 12:25:08 PM2/13/16
to h5py
Did you write the routine?

Ziang Yan

unread,
Aug 3, 2016, 10:09:05 PM8/3/16
to h5py
This works for me, save my day.
Reply all
Reply to author
Forward
0 new messages