Reading variable length byte datasets

David Eklund

unread,

Aug 26, 2011, 2:36:33 PM8/26/11

to h5py

Hello,

I understand that h5py does not support variable length data types
other than strings. I have a dataset that has a variable length 8-bit
unsigned character column that I need to read somehow and am wondering
the best way to do it. Is there any kind of work-around at the h5py
level (including the low-level interface), or is my only option to
write a C function and wrap it in python?

Thanks,
David

Gareth Williams

unread,

Aug 29, 2011, 5:26:43 AM8/29/11

to h5py

Hi David,

I think I can help. hdf5 is mostly for storing (variable length)
arrays of uniform sized data. It may help to think of your data as a
1d array of 8-bit unsigned characters rather than as a variable length
string.

If your problem is more complicated then you might need to post more
detail to get a useful response.

Gareth

David Eklund

unread,

Aug 29, 2011, 3:49:59 PM8/29/11

to h5py

Thanks Gareth. The reason for the variable length columns is to
accommodate our data source, which is a hardware device that emits
packets of data. We have scripts that decode those packets into
something meaningful (and fixed width), but we would like to store the
raw data in HDF as well, and in general we have to accept
heterogeneous packet lengths. This has worked well for us so far -
especially with the latest release of HDFView we can now open and view
raw packet data as a debugging tool. But the one hiccup is that we're
trying to incorporate python more into our data processing, and with
h5py we can't get it to read the raw data. I was hoping someone might
have an idea that would still allow us to get at the data through the
h5py API, even if it's just one row at a time, or else I'll probably
have to write something in C to return the data back in a readable
form.

David

Gareth Williams

unread,

Aug 30, 2011, 7:55:38 AM8/30/11

to h5...@googlegroups.com

Hi David,

In the script I've recently been writing, I take bytes (well a string) which is the whole content of a file (can be a binary file) and save to an array of uint8 in HDF with code like:

ds = myfile.create_dataset( name,

data=numpy.fromstring(content,numpy.uint8) )

and can extract the data with:

content = myfile[name][()].tostring()

I can do this in chunks if the file is big, but this describes the simplest case.

I can also extract the 'raw' data to 'binfile' with h5dump like so:

h5dump -d name -b -o binfile file.h5

This seems to match your use case well! At the same time you can decode your data and save that to separate data structures or attributes.

Gareth

--
You received this message because you are subscribed to the Google Groups "h5py" group.
To post to this group, send email to h5...@googlegroups.com.
To unsubscribe from this group, send email to h5py+uns...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/h5py?hl=en.

David Eklund

unread,

Sep 7, 2011, 5:06:45 PM9/7/11

to h5...@googlegroups.com

Gareth,

Thanks for the suggestion. For now we have too many other systems relying on the existing dataset format for me to change it, so I went ahead and wrote some C code to do the heavy lifting for me (wasn't as bad as I thought it might be). In the future I'll definitely consider alternative approaches like yours to make this easier on the python side.