Storing a large number of images into a HDF5 file

8,420 views
Skip to first unread message

rodrigob

unread,
Jul 26, 2011, 5:00:06 AM7/26/11
to h5...@googlegroups.com
Dear h5py community,

now that my blocking issue has been solved it is time to put h5py to action.

My usage scenario is as follows:
- Store groups/datasets with  ~10.000 images, each of ~1Mb (in some cases all images will not have the same dimensions)
- Store in different datasets meta-data for the images (the idea is to have dataset specific columns plus one reference to the original image)

So the question is: how do you suggest I store the images ?

The "natural way" of doing it would be to follow the HDF5 image specification and store one dataset per image. 
However this related post seems to indicate that HDF5 performance would break down after ~1000 images.

I could also store all the images in one single dataset, where each entry would be an encoded binary string (i.e. PNG files). 
Based on the mentioned post this would be a more scalable approach, with the additional benefit of having images specific compression.
This however feels less "in the HDF5 spirit".

What is your suggestion ?
Has any of you a similar experience ?
Where could I find information on this topic ?
Is it there any kind of benchmark that analyses the performance of such scenarios ?

Thanks for your comments and suggestions.
Regards,
rodrigob.


Paul Anton Letnes

unread,
Jul 26, 2011, 5:38:52 AM7/26/11
to h5...@googlegroups.com
Hi rodrigob!

What about using a single dataset with an added dimension? All of your images are probably of the same size, say, j * k. Then, if you have N pictures, make a dataset with shape (j, k, N). The final index will be the index of the image. Store a "metadata" dataset (a regular dataset in h5py speak) of length N with some sort of information about the images - say, text strings, floating point numbers to describe a parameter, or similar. Then, the third index in the image dataset and the index of the "metadata" dataset are the link between the metadata and the image.

This is a commonly used trick in programming, too - for instance, I regularly store a large number of 4x4 matrices in an Nx4x4 array. Should work for hdf5 datasets, too.

Cheers,
Paul.

> --
> You received this message because you are subscribed to the Google Groups "h5py" group.
> To view this discussion on the web visit https://groups.google.com/d/msg/h5py/-/2BI0hMuNuygJ.
> To post to this group, send email to h5...@googlegroups.com.
> To unsubscribe from this group, send email to h5py+uns...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/h5py?hl=en.

rodrigob

unread,
Jul 26, 2011, 6:18:16 AM7/26/11
to h5...@googlegroups.com
Yes indeed this would work, but you are considering two assumptions that do not hold in my application scenario:

1) Images are all of the same size: as indicated in my original message some datasets will not have images of the same sizes. For images of the same size your proposal seem reasonable. What about data compression ? Any idea of what can I expect using "raw bytes inside HDF5" versus using PNG encoding ?

2) There is one row of metadata per image: an example of data I am referring as "meta" is "bounding boxes". I want to store a number of boxes that indicate objects on the image. The number of boxes per image is unknown, but around ~10. Since the number is variable and unbounded, then the indexes between the two datasets would be "out of sync".
Here it is unclear what is best: I could either store an array of variable length arrays (keeping the indexes synchronized), or I could use multiple entries per image and have one column keeping references (more RDBM style).

I am more inclined towards the second option since it is more "format agnostic" (does not matter how I store the images, I can always keep a reference to them).

But since I have no real experience with h5py (and hdf5 in general), I would much appreciate your opinions on this matters.

Regards,
rodrigob.

Xialei Liu

unread,
Jan 24, 2016, 11:31:50 AM1/24/16
to h5py

Hey, Did you solve the problems of different sizes of input images by using hdf5, if it's done, please tell me how to do that. Thanks
在 2011年7月26日星期二 UTC+2上午11:00:06,rodrigob写道:

Vijetha Reddy

unread,
Feb 21, 2016, 9:27:26 PM2/21/16
to h5py
Hi all,

I am trying to do a similar thing, however I am stuck at read the data from the file. It has been quite slow(taking around 10 mins to read 20GB data). 

I have image data as numpy array of dimension (22500,3,224,224). I stored all of that using the below commands, 

h5f = h5py.File('data.h5', 'w')
h5f.create_dataset('data' , data=myNumpyArray)
h5f.close()

At a later time, I try to load them using the commands, 

h5f = h5py.File('data.h5', 'r')
myNumpyArray= h5f['data'][:]
h5f.close()

However, this loading of data into numpy array is taking a lot of time (more than 10 mins or sometimes even 15 mins). 
I am very new here and don't know if I am doing the correct thing. Can someone through some light here?
(@rodrigob could you help?)

Thank you, 
Vijetha. 

chang Robben

unread,
Apr 18, 2016, 4:33:11 AM4/18/16
to h5py
HI Vijetha.
    Did you solve your problem,if you do,can you help me?

    Thank you.
     chang

在 2016年2月22日星期一 UTC+8上午10:27:26,Vijetha Reddy写道:

Rodrigo Benenson

unread,
Apr 18, 2016, 5:39:48 AM4/18/16
to h5...@googlegroups.com
You might want to look into Fuel
https://github.com/mila-udem/fuel

which also provides an HDF5 backend.

regards,
rodrigob.
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "h5py" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/h5py/eLoXU3LL4qU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> h5py+uns...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Poornachandra Sandur

unread,
Jun 15, 2016, 12:36:59 PM6/15/16
to h5py
Hi Rodrigob,
                   You were able to create the datasets using Fuel. If so , please enlighten me with the knowledge on how to do it.

Poornachandra Sandur

unread,
Jan 11, 2017, 11:12:07 AM1/11/17
to h5py
Hi Vijetha,
                  In your mail you wrote " I have image data as numpy array of dimension (22500,3,224,224) , I want to know what does 22500 ,3,224,224 stand for " I am a newbie to this . Please give me some helpful pointers on this.

Ishrat Badami

unread,
Feb 6, 2017, 8:12:09 AM2/6/17
to h5py
Hi Poornachandra,

Here is an example of fuel converter code for ilsvrc 2010 (imagenet dataset).

Hope it helps.

Best
Ishrat

Poornachandra Sandur

unread,
Feb 6, 2017, 8:41:20 AM2/6/17
to h5...@googlegroups.com
Thank you very much Ishrat


--
You received this message because you are subscribed to a topic in the Google Groups "h5py" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/h5py/eLoXU3LL4qU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to h5py+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
warm regards,
Poornachandra S
+91-9901913049
Reply all
Reply to author
Forward
0 new messages