Actual behavior of 'iter_size' parameter

6,209 views
Skip to first unread message

thecro...@gmail.com

unread,
Aug 25, 2015, 6:51:33 PM8/25/15
to Caffe Users
I tried to use the 'iter_size' parameter while finetuning VGGNet. I'm just renaming the last layer (fc8) and training it for my specific problem.

I would like Caffe to compute the gradients using a batch size of 128. Yet, for VGGNet, 4 GB of GPU RAM is not so much, so I want to set a small batch_size and exploit iter_size. Say, batch_size = 8 and iter_size = 16.

I started the training process and I saw the loss decreasing (noisily) at every iterations. But how can the parameters be updated if the gradient will be computed at the 16th iteration? What am I missing?

Evan Shelhamer

unread,
Aug 25, 2015, 6:59:29 PM8/25/15
to thecro...@gmail.com, Caffe Users
A weight update / iteration is done for batch_size * iter_size inputs at a time. Each solver iteration reported is accumulated by running `iter_size` calls to forward + backward. See


Evan Shelhamer



--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/dc8beb1b-d77e-477d-9d52-a6a491ec397b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Flavio Ferrara

unread,
Aug 26, 2015, 10:19:56 AM8/26/15
to Evan Shelhamer, Caffe Users
Evan, thanks for the answer and your amazing work with Caffe!

I think I got it. Just to clarify:

- solver iteration is equivalent to weights update

- using 'iter_size' > 1, net_->ForwardBackward() is called several times.
Each time, it loads 'batch_size' examples, computes loss&gradients and then discards the examples, resulting in less memory than using a larger 'batch_size'.
The losses are accumulated and then averaged by 'iter_size'.

So, the resulting loss, for each solver iteration, somewhat involves 'iter_size' * 'batch_size' examples.

Etienne Perot

unread,
Jan 11, 2016, 12:59:55 PM1/11/16
to Caffe Users, thecro...@gmail.com
Hello there!

Is it taken into account in batch normalization layer? (I suppose there is little chance for BN layer to have same gradient between multiple pass & one giant pass ??)

Yifan Wang

unread,
Feb 22, 2016, 7:27:30 AM2/22/16
to Caffe Users, thecro...@gmail.com
Hey Evan,

I'm not quite convinced..
from what I see, although forwardbackward is called iter_size times, the accumulated loss is only used to be displayed. 

Does it mean the solver takes only the last iteration of a solver step (containing iter_size forwardbackward computations) into account?
Or does the network accumulate gradient and loss internally? (although I don't see this in actual layer implementation)

Yifan

Evan Shelhamer

unread,
Feb 22, 2016, 5:12:27 PM2/22/16
to Yifan Wang, Caffe Users, thecro...@gmail.com
The loss and gradients are accumulated. The gradient accumulation implementation is correct, has unit tests, and has been used effectively in research and practice.

Please read the code for more details:

1. the solver calls Net's ForwardBackward() `iter_size` times: https://github.com/BVLC/caffe/blob/master/src/caffe/solver.cpp#L222-L224

2. computing the gradient is done with accumulation (note the gradient is not reset to zero)



Evan Shelhamer





McCaffe

unread,
Feb 9, 2017, 9:56:50 AM2/9/17
to Caffe Users, yifanw...@gmail.com, thecro...@gmail.com
Do you keep all the gradients of all layers (would consume a lot of memory) or only the ones needed your the weight-update?
Or is the memory consumption the same as without iter_size > 1 because you just add up the gradients of all weights muliplied by 1 / iter_size ?


I can only see how ForwardBackward is called `iter_size` times in a loop before `ApplyUpdate()` is called, which presumably updates the weights

    Dtype loss = 0;
    for (int i = 0; i < param_.iter_size(); ++i) {
      loss += net_->ForwardBackward();
    }

The latter 
Seems to call an averaged weight update inside its normalization function


Evan Shelhamer

unread,
Feb 9, 2017, 9:45:16 PM2/9/17
to McCaffe, Caffe Users, Yifan Wang, Eric Draven
On Thu, Feb 9, 2017 at 6:56 AM, 'McCaffe' via Caffe Users <caffe...@googlegroups.com> wrote:
Or is the memory consumption the same as without iter_size > 1 because you just add up the gradients of all weights muliplied by 1 / iter_size ?

​That's right.​

Evan Shelhamer




Ali Aliev

unread,
Sep 21, 2017, 7:06:28 AM9/21/17
to Caffe Users
If you still wonder how calling ForwardBackward() multiple times accumulates the gradients, then the clue is in the gemm function semantics, which is used to calculate the parameter's gradient. If in gemm function beta=1., then the result of matrix multiplication is appended to the matrix C, which is the current value of gradient.

Gaurav Yengera

unread,
Mar 5, 2018, 5:31:40 PM3/5/18
to Caffe Users
Hello Evan,

I am trying to optimize a CNN-LSTM network on very long sequences such that iter_size*batch_size = sequence length. When using iter_size with the LSTM network, are the gradients accumulated correctly through time when the batch_size is smaller than the sequence length? It seems to me that the gradients for backpropagation through time are calculated for batch_size length sequences and averaged over iter_size number of such sequences.
Reply all
Reply to author
Forward
0 new messages