In some of the examples, like CIFAR, learning is usually done in a pseudo-batch fashion, as follows:
-- evaluate function for complete mini batch
for i = 1,#inputs do
-- estimate f
local output = model:forward(inputs[i])
local err = criterion:forward(output, targets[i])
f = f + err
-- estimate df/dW
local df_do = criterion:backward(output, targets[i])
model:backward(inputs[i], df_do)
-- update confusion
confusion:add(output, targets[i])
end
-- normalize gradients and f(X)
gradParameters:div(#inputs)
f = f/#inputs
However, most of the above can be done in true batched fashion, instead of iterating over each sample, as follows:
-- evaluate function for complete mini batch
local outputs = model:forward(inputs)
local f = criterion:forward(outputs, targets)
-- estimate df/dW
local df_do = criterion:backward(outputs, targets)
model:backward(inputs, df_do)
-- update confusion
confusion:batchAdd(outputs,targets)
My question is, do we still need to normalize the gradients:
-- normalize gradients and f(X)
gradParameters:div(#inputs)
f = f/#inputs
Criterion modules usually have a sizeAverage field, so they will provide the averaged value of f but what about gradParameters? Do we just scale the learning rate according to the batchsize to compensate for this?
Additionally, what would be the difference between the following:
model:backward(inputs, df_do)
model:backward(inputs, df_do, 1/batchsize)