correct procedure for fine-tuning a pretrained model

Arulkumar

unread,

Jan 20, 2016, 1:21:18 PM1/20/16

to torch7

I would like to clarify whether I am following the correct procedure to fine-tune a pretrained model.

I follow the below steps.

Load the model
Reset all the fully connected weights (call :reset() method on nn.Linear() nodes)
the learning rate is same for all the layers (as same as it was during actual training)
train the model

Is there any other things that I need to take care for finetuning?

specifically, do I need to modify any optim parameters such as leaning rates, momentum, weight decays ?

could you please clarify?

Arulkumar

unread,

Jan 20, 2016, 1:42:20 PM1/20/16

to torch7

also I have looked into this post:

https://groups.google.com/forum/#!topic/torch7/nuecUMknfSg

I am not clear, whether I need to allow back propagation to all the layers (or) should I stop backpropagation in Fully connected layer?

In CNN literature, I have read two types of finetuning:

1. Do not back propagate the gradient to initial conv layers, as the higher level features will stay the same (mostly) for similar datasets. (credits: http://cs231n.github.io/transfer-learning/)

(or) equivalently, set the learning rate of initial conv layers to 0

2. Set lower learning rate to initial conv layers and higher learning rate for linear classifiers (fully connected layers). Thi will make sure that the Fully connected layers are learning and adapting more than initial conv layers.

Are these equivalent? (or) Do I need to try both to find out which one works better?
Could you please share your experience?

Vislab

unread,

Jan 20, 2016, 5:46:22 PM1/20/16

to torch7

Depending on your model's length (number of convolution layers) the first two layers can be left out when backpropagating because you gain much from updating them since they are a very general view of lines and edges already. Don't recall what's the right function name to disable the backpropagation in a module, but when i find it out i'll update this post.

For the classifier train, i've tried the following strategies and they don't vary that much (this heavily depends on the dataset to train ofc):
1. Reshape/replace the last linear layer to the number of desired outputs, OR
2. Replace the entire classifier section with a new one (initialized with random weight)

About the learning rates, updating the convolution networks provides some extra accuracy, that's why a small learning rate is used to fine-tune the feature maps to the new dataset. You can:
a) Set a LR=1e-4 to the conv features and a LR=1e-2 to the classifier features and keep reducing the learning rate when the error converges or doesn't reduce for X epochs
b) In a lazy way, set a global LR=1e-3 and reduce the learning rate (i like this way :))

The point I'm trying to say is that test some strategies with the dataset you are trying to train and see what's the best result for you. Start with the simplest one.

Vislab

unread,

Jan 20, 2016, 6:14:20 PM1/20/16

to torch7

m.updateGradInput = function(self,i,o) end -- for the gradInput
m.accGradParameters = function(self,i,o) end -- for freezing the parameters

This should freeze the parameters. Props to massa for the info.

quarta-feira, 20 de Janeiro de 2016 às 18:21:18 UTC, Arulkumar escreveu:

Arulkumar

unread,

Jan 20, 2016, 8:59:18 PM1/20/16

to torch7

Thanks for the response.

I am using SGD for optimization. is it still hold good to modify

m.accGradParameters?

because, I had a look into the optim/sgd.lua code, itseems that parameters are updated directly rather than calling accGradparameters(). I am not sure about the control flow of how this parameter update happens. can you clarify about SGD vs accGradParameters(), if possible ? 

I will try your suggestions and post the results again.

Francisco Vitor Suzano Massa

unread,

Jan 21, 2016, 5:24:08 AM1/21/16

to torch7

You can still use SGD even after modifying accGradParameters.
What will happen is that the gradients for the modules you changed it will be 0, so there will be no update of the parameters.
There is one caveat though, which is if you are doing SGD with weight decay (L2 regularization), then even if your gradients are 0, you will still change the weights because they will be updated according to the current weights, see https://github.com/torch/optim/blob/master/sgd.lua#L48

If you really don't want to change some layers of the network, check https://gist.github.com/szagoruyko/1e994e713fce4a41773e#gistcomment-1583826

Adam Tow

unread,

Feb 23, 2016, 4:04:52 AM2/23/16

to torch7

Hi,

If I'm wanting to fix the weights in a certain layer of the network (set the learning rate in that layer to 0), I'm wondering if I need to overload both updateGradInput and accGradParameters to achieve this?

Overloading accGradParameters definitely fixes the weights in the specific layer to zero, but I get a different set of weights in the other layers if I do/don't overload updateGradInput.

Vislab

unread,

Feb 23, 2016, 5:48:28 AM2/23/16

to torch7

Weight decay will still be an issue even if you overload those functions. The only working solution for me is to either do sgd optim per module or to define a learningRates/weightDecays vectors and set the layers parameters to 0 for those layers you want to freeze.

Adam Tow

unread,

Feb 29, 2016, 5:38:47 AM2/29/16

to torch7

You say above to "Set a LR=1e-4 to the conv features and a LR=1e-2 to the classifier features" to set a learning rate for different layers of the network. Is this possible in Torch?

Thank you for your time.

Vislab

unread,

Feb 29, 2016, 6:09:30 AM2/29/16

to torch7

Everything is possible with torch! :)

There are two ways for you to setup different learning rates/weight decays in torch. You can either iteratively update each module of your network with different optimStates with optim.sgd (or another optimization method) or you can define a matrix for all weights and toggle different weights per layer or even for one single weight. Both should be relatively easy to setup, although the second one requires more memory because you have to store one or two extra matrices for learning rates and/or weight decays.

Adam Tow

unread,

Feb 29, 2016, 7:13:46 AM2/29/16

to torch7

Sorry if this is obvious, but I'm still not sure where exactly I can set a matrix of learning rates in the case of using nn without optim. From what you said, I thought that possibly calling mlp:updateParameters(learningRate) with a tensor of learning rates sized to match the number of parameters in the network would work. This doesn't work though. I haven't been able to find anything in the nn documentation with regards to setting the learning rate for individual layers, just can see where it is used for the entire network. Now that I think about it though, I guess I could just do the update to the parameters myself and not use mlp:updateParameters.

Mata Fu

unread,

Apr 10, 2018, 12:29:17 PM4/10/18

to torch7

For me, the network is quite deep, so if I just want to fine-tuning the fc layer, I chose to freeze the conv layer to make the training faster. But I found that set the conv module to evaluate mode cannot make the procedure faster. So does anyone know how to make the training faster in this way?

在 2016年1月20日星期三 UTC+1下午11:46:22，Vislab写道：

Reply all

Reply to author

Forward