hdf5, training example order, batch size, and shuffling

Sancho

unread,

Oct 23, 2014, 2:09:05 PM10/23/14

to caffe...@googlegroups.com

I'm using hdf5 as input, and the training examples were stored grouped by label. (All of label 0 come first, then all of label 1, then all of label 2, etc...).

As I understand it, there is no way to shuffle hdf5 input right now, so I should shuffle it before storing it into the hdf5 training file. Is this correct? I'm planning on addressing this.

However, given my current setup, am I correct that training suffers when the batch size is smaller than one of the label groups? That is, if each label has 500 examples, but my batch size is 250, does the network overfit to each label in succession, never really getting a stable gradient that is representative of the entire training set?

Is the optimal batch size from an optimization perspective (ignoring performance considerations) equal to the size of the training set?

Would a shuffle option for the hdf5 input layer be useful? Or would this not be efficient given that we're reading from a serial file?

Sancho

unread,

Oct 23, 2014, 2:17:28 PM10/23/14

to caffe...@googlegroups.com

Another question. If I can set the batch size to the entire training set, does that obviate the need to shuffle?

Jason Yosinski

unread,

Oct 23, 2014, 2:29:05 PM10/23/14

to Sancho, caffe...@googlegroups.com

> As I understand it, there is no way to shuffle hdf5 input right now, so I
> should shuffle it before storing it into the hdf5 training file. Is this
> correct? I'm planning on addressing this.

I believe this is correct.

> However, given my current setup, am I correct that training suffers when the
> batch size is smaller than one of the label groups? That is, if each label
> has 500 examples, but my batch size is 250, does the network overfit to each
> label in succession, never really getting a stable gradient that is
> representative of the entire training set?

Right.

> Is the optimal batch size from an optimization perspective (ignoring
> performance considerations) equal to the size of the training set?

As far as most theoretical considerations go, it is the easiest case
to consider the entire training set at once. But empirically people
have found that training on mini-batches can sometimes actually work
better when the mini-batches are small (much smaller than the training
set) instead of large (a better approximation to the whole training
set). One can give hand-wavy reasons for why this is the case -- e.g.
"adding a little noise during training helps the optimizer escape
local minima" -- but to my knowledge this phenomenon has not yet been
explained conclusively.

> Would a shuffle option for the hdf5 input layer be useful? Or would this not
> be efficient given that we're reading from a serial file?

Yes, but it might be slow, as you mention. An alternative would be to
take a random single (or random N for small N) slice(s) from the
dataset, producing mostly sequential reads.

> Another question. If I can set the batch size to the entire training set,
> does that obviate the need to shuffle?

It would. But see the point above about small mini-batches being desirable.

cheers,
jason

---------------------------
Jason Yosinski, Cornell Computer Science Ph.D. student
http://yosinski.com/ +1.719.440.1357

> --
> You received this message because you are subscribed to the Google Groups
> "Caffe Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to caffe-users...@googlegroups.com.
> To post to this group, send email to caffe...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/caffe-users/5c1d7033-5306-49c3-92f5-6a6d577f332c%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.

Guillaume Chevalier

unread,

Dec 25, 2015, 4:07:28 PM12/25/15

to Caffe Users, san...@gmail.com

So, if I understand, even if a single HDF5 file loaded contains 50000 pre-shuffled images, if I set the batch_size to 100, then all the 50000 images will be trained on iteratively by minibatches of 100 images, so the weights will be updated every 100 images.

In this scenario, the solver won't loop over the first 100 images, but will go trough the entire dataset.

Is that right?

Reply all

Reply to author

Forward