[BUG] Wrong values from first read of Dataset, correct on second read

Claudio Cimarelli

unread,

Sep 24, 2020, 9:54:28 AM9/24/20

to h5py

Dear all,

I am using this library for loading image data into a Pytorch Dataset class. At the initialization of the custom Pytorch Dataset class, I read a ".hdf5" file from the disk that contains all the images saved as NumPy arrays, and I keep a reference as an instance var.

At training/test time, when I index the dataset to retrieve an image, I found that the values of the image array are different from those that they should be as verified by comparing with a direct read of the file using Pillow Image and converting this to a ndarray. What's more strange is that if I do another indexing on the h5py dataset, I get the expected array.

Anyone that has any idea of why this might be happening?

Thanks a lot for your help and I hope that the problem description is clear enough.

Valentyn Stadnytskyi

unread,

Sep 24, 2020, 10:01:18 AM9/24/20

to h5...@googlegroups.com

Claudio,

Can you send more information how you read the file? When you say, it is different. How does it manifests? How different is the image? It is possible you saved the file as one data type and reading as a different datatype (int16 vs float32). I don’t remember but Pillow might be converting your image to 8bit format since it is more standard way to visualize images. Just keep in mind that arrays have datatype and your hdf5 file might have datatype specified as a part of dataset. For example, if you image(numpy array) is int16 and tour hdf5 file dataset has datatype of 8bit, the result written into hdf5 file will be converted upon writing.

I would need more information but I have outlined some potential things to check first.

Good luck,

Valentyn

--
You received this message because you are subscribed to the Google Groups "h5py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h5py+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/h5py/3975a929-80e1-4cd7-8347-78dc81a380e9n%40googlegroups.com.

Claudio Cimarelli

unread,

Sep 25, 2020, 5:28:00 AM9/25/20

to h5py

Dear Valentyn,

thanks for your reply and suggestions. I am creating the dataset by passing a numpy array containing all the images with type uint8 already. So the type of the Dataset is inferred and is '|u8'. I have tried also to pass it as float64 and convert it afterward when I index the specific image from the Dataset. To have more clear the context, I share some lines of code:

path = self.images_paths[idx]
image2 = Image.open(path)
img_arr2 = np.asarray(image2)
img_arr = self.dataset[idx, ...]

np.array_equal(img_arr, img_arr2) # returns False many times
img_arr = self.dataset[idx, ...]

np.array_equal(img_arr, img_arr2) # returns True (rarely False)
image = Image.fromarray(img_arr)

As you can see I open the image directly with PIL first and convert it to an array (uint8). Then, I read the same image from the Dataset and, before converting it to an Image, I compare the two. The problem is not the type but small divergences between the values. Roughly, there is 50-150px average difference between pixel value per Image ( in case they are not equal ).

I actually found the problem just today. It is related to multiprocessing when loading of the data by setting pytorch.Dataloader num_workers>0. When this feature is disabled, with num_workers=0, there is no problem. Do you have any idea how to fix this?

Thomas Kluyver

unread,

Sep 25, 2020, 5:55:32 AM9/25/20

to h5...@googlegroups.com

What version of HDF5 do you have? Check h5py.version.hdf5_version .

HDF5 used to have problems if you forked the process with the file open, and then tried to use it on both sides of the fork. This is meant to be improved in HDF5 1.10.5 and above, if your system provides the pread/pwrite functions. More details here: https://forum.hdfgroup.org/t/hdf5-and-parallelism-with-fork-2/5225

Thomas

To view this discussion on the web visit https://groups.google.com/d/msgid/h5py/8859b3aa-21e4-435b-9572-c83b373f9b07n%40googlegroups.com.

Claudio Cimarelli

unread,

Sep 25, 2020, 6:13:48 AM9/25/20

to h5py

Solved: Once I have understood it was caused by the multiprocessing read of the dataset, I managed to find a related issue. Here is the simple solution to the problem: https://github.com/pytorch/pytorch/issues/11929#issuecomment-649760983

TL;DR: the h5py has not to be read in the __init__ of the Pytorch.Dataset class but at the first __getitem__ call.

Thanks everyone

For Thomas: the version is 2.10.0

Thomas Kluyver

unread,

Sep 25, 2020, 6:16:30 AM9/25/20

to h5...@googlegroups.com

Thanks! 2.10 is your h5py version, though. I'm asking about the HDF5 version, which will start with 1.

Thomas

To view this discussion on the web visit https://groups.google.com/d/msgid/h5py/fb0aee32-3635-454e-b011-553624bc4a22n%40googlegroups.com.

Claudio Cimarelli

unread,

Sep 25, 2020, 8:14:30 AM9/25/20

to h5py

Sorry, I misread it. That is 1.10.5.

Thomas Kluyver

unread,

Sep 25, 2020, 8:25:08 AM9/25/20

to h5...@googlegroups.com

OK, that implies that either it was compiled without pread support, or there's another similar bug which using pread/pwrite didn't fix.

To view this discussion on the web visit https://groups.google.com/d/msgid/h5py/d1ca8d96-6025-4c6f-823a-c47027751aacn%40googlegroups.com.

Reply all

Reply to author

Forward