Problem with hdf5 data input

283 views
Skip to first unread message

Hongtao Yang

unread,
May 16, 2016, 10:45:32 PM5/16/16
to Caffe Users
Hi All,

I am currently using hdf5 as my data input, but I don't really understand how caffe read my hdf5 files.

Let's say I have 200 hdf5 files, each contain 100 data samples. If I set my training batch size to 64, how will caffe loop over all my 20000 data samples in all 200 hdf files? 

My guess is that caffe will load 1 hdf file, then randomly select 64 data samples as a batch. After updating the network, it will select another 64 samples since not all samples in the hdf file are covered in the previous batch. After all data samples in the current hdf file are fed to the network, it will load the next hdf file. And this goes on til the end of training. Is that correct?

Thanks for any help!
Hongtao


Ilya Zhenin

unread,
May 17, 2016, 4:31:15 AM5/17/16
to Caffe Users

In HDF5 data layer has a parameter "shuffle", description of which states that if you set it "true" then order of yours hdf5 data files will be shuffled, and order of data within each of these files. So I believe they read it in sequnce, order of which might be random. 

вторник, 17 мая 2016 г., 5:45:32 UTC+3 пользователь Hongtao Yang написал:

Hongtao Yang

unread,
May 17, 2016, 5:37:01 AM5/17/16
to Caffe Users
Thanks llya for your kind reply. I'm not really concerned about 'shuffle' or 'randomness', because in my hdf creation, I've already do the shuffling, so whatever order the files are read, they are always random. What I really don't understand is how caffe ensure that all my data samples are read into the network. Let me explain using an example:

I have 200 hdf5 files: 'dataset_1.hd', 'dataset_2.hd' ...... 'dataset_200.hd'
Each one of those hd files contain 100 data samples, this is the dimension is (100, channel, width, height).
The batch size for training is 64.
Total number of iteration is 50000.

Training begins. In the first iteration caffe will choose one of the 200 hdf files (let's say 'dataset_1.hd' for convenience), and select 64 data samples as a batch (whether it is randomly selected or in order does not matter). But when the second iteration begins, will caffe choose another 64 data samples from another hdf file? Or will it select another 64 sample from the same hdf file, because the network haven't seen the remaining 36 data samples from 'dataset_1.hd' yet? 

Thanks,
Hongtao

Eli Gibson

unread,
May 17, 2016, 9:06:00 AM5/17/16
to Caffe Users
Hi Hongtao, 
When shuffle is false, Caffe will load up hdf5 files in the order listed in the file. Within each file, it will pick samples in order in groups of 64. After all samples in the current hdf file are used, it will start on the next hdf5 file.
When shuffle is true, Caffe with shuffle the order of hdf5 files before going the list each time. That is, it will go through the whole list in one random order, then it will go through the whole list again in a different random order, etc. Within each file, it will pick samples in a random order. That is, it will use every sample exactly once, but the order will be random.

Because of this implementation, Caffe will use every sample once in one pass, and then use every sample in a second pass, and so on.

Eli
Reply all
Reply to author
Forward
0 new messages