set learning_rate_factor to 0 and the weights are still changed

Yi Liu

unread,

Sep 2, 2019, 5:48:27 AM9/2/19

to kaldi-help

Hi Dan,

I'm confused with the behavior of the learning-rate-factor. I'm doing something like transfer learning and use a pre-trained model. I want to fix the weights of the model so I do something like:

steps/nnet3/xconfig_to_configs.py --existing-model $pretrain_mdl \
    --xconfig-file $dir/configs/network.xconfig \
    --config-dir $dir/configs/


$train_cmd $dir/log/generate_input_mdl.log \
    nnet3-copy --edits="set-learning-rate-factor name=* learning-rate-factor=0" $pretrain_mdl - \| \
      nnet3-init --srand=1 - $dir/configs/final.config $dir/input.raw  || exit 1;

to generate the input.raw as the initial model. And then I use steps/nnet3/train_raw_dnn.py to train the model.

However, I found that the weights of the pre-trained model in final.raw are very different from those of the input.raw. I checked input.raw and 0.raw as well. Their weights are the same as each other (also the same with the pre-trained model). I'm really confused that why the weights are still changed significantly even if I have set the learning rate factor to 0? Is there anything I missed?

The example of the weights are :

0.raw:

<ComponentName> tdnn1.affine <NaturalGradientAffineComponent> <LearningRateFactor> 0 <MaxChange> 0.75 <LearningRate> 0.0012 <LinearParams> [
2.568433 0.7693889 -0.2381797 0.1525945 -2.246516 0.4466618 1.102006 -0.1432067 -0.5903091 0.8782812 0.1022643 0.08371142 -0.02404941 0.1363059 -0.2316905 -0.2228654 -0.1153994 0.04606071 -0.6212884 0.0971213 -0.4538765 -0.2163398 -0.1846409 0.5193074 1.637326 -0.5269247 0.2706615 -0.4929326 -0.8492494 0.2991662 -0.06685083 -0.484638 -0.07633615 0.04551812 -0.07778752 0.06136333 -0.04509452 0.1947115 -0.1971418 0.2260058 0.2135421 0.1612404 0.01028283 -0.1597385 0.00885525 -0.1910186 -2.443084 1.568953 -2.43982 0.1278329 3.425874 -0.5619374 -2.047221 1.148477 0.7843929 -0.8205558 -0.1938612 0.09076887 0.4777972 -0.6429375 0.2837325 0.4889021 -0.4847438 -0.2476536 0.6346303 0.02845256 -0.6005114 -0.1370015 -0.2807251 -6.516876 1.262585 -1.125957 -2.160191 3.790924 1.487685 -2.502697 -1.167388 2.776567 -0.1657089 -1.498531 -0.3129631 1.025429 -0.1579675 -0.5479382 0.5111938 0.2005295 -0.0936794 0.02654545 0.2907091 0.2999536 -0.07730728 -0.6610407 -7.990298 -1.967343 2.860101 1.211731 -1.953299 0.6001937 1.771805 -1.756328 -0.6340543 1.667708 -0.01905632 -0.6712452 -0.4922445 1.067875 0.1867559 -0.4521364 0.1057445 0.3543701 -0.09882455 0.2526414 0.7321079 0.4256617 -0.6819053
...

final.raw

<ComponentName> tdnn1.affine <NaturalGradientAffineComponent> <LearningRateFactor> 0 <MaxChange> 0.75 <LearningRate> 0 <LinearParams> [
0.02261571 0.006774665 -0.002097233 0.001343634 -0.01978116 0.003932972 0.009703459 -0.001260972 -0.005197822 0.007733486 0.0009004634 0.0007371005 -0.0002117614 0.001200207 -0.002040094 -0.001962385 -0.001016122 0.0004055761 -0.005470607 0.0008551774 -0.003996501 -0.001904927 -0.001625811 0.004572635 0.01441706 -0.004639709 0.002383247 -0.0043404 -0.007477856 0.002634235 -0.0005886379 -0.004267363 -0.0006721598 0.0004007989 -0.0006849389 0.0005403198 -0.0003970687 0.001714486 -0.001735885 0.001990038 0.001880291 0.001419765 9.054293e-05 -0.00140654 7.797277e-05 -0.001681966 -0.02151195 0.01381505 -0.02148324 0.001125602 0.03016569 -0.004948 -0.01802629 0.01011263 0.006906783 -0.007225201 -0.001706998 0.0007992426 0.004207127 -0.00566123 0.002498338 0.004304907 -0.00426829 -0.002180652 0.00558808 0.0002505321 -0.005287655 -0.001206334 -0.002471855 -0.05738275 0.01111738 -0.009914332 -0.01902101 0.03338008 0.01309947 -0.02203688 -0.01027915 0.02444839 -0.001459108 -0.01319495 -0.002755718 0.009029168 -0.001390943 -0.004824736 0.004501191 0.001765712 -0.0008248709 0.0002337395 0.002559767 0.002641168 -0.0006807104 -0.005820629 -0.07035661 -0.01732296 0.02518389 0.0106696 -0.01719929 0.005284857 0.01560119 -0.01546491 -0.005583003 0.0146846 -0.0001677957 -0.005910477 -0.004334336 0.009402923 0.001644435 -0.00398118 0.0009311081 0.003120321 -0.0008701761 0.002224572 0.006446391 0.003748061 -0.006004354
...

I found another thread discussing this problem but it used chain model and I don't think it is related to my case.

https://groups.google.com/forum/#!searchin/kaldi-help/learning$20rate$20factor%7Csort:date/kaldi-help/MHow8jjiUyE/BhDoiSjdDQAJ

Thank you so much.

Daniel Povey

unread,

Sep 2, 2019, 8:37:03 AM9/2/19

to kaldi-help

Make sure you aren't using any kind of model-shrinkage option on the command line of train.py.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/93a955b0-80c7-4482-a4d1-208ffa31e029%40googlegroups.com.

Message has been deleted

Yi Liu

unread,

Sep 2, 2019, 11:41:37 AM9/2/19

to kaldi-help

Ah, I do use the shrinkages.

Thanks, Dan!

And I also found that if even the model weights can be fixed (disable the shrinkage), the batchnorm components will be updated anyway since there are flags to control the update of the statistics. I checked my model. The shrinkage scale the weights, and mean and var of the batchnorm components also became smaller to match the new weights. So I think after training the new model, the activation of a pre-trained model cannot be "exactly" the same with those of the original model (due to the norm layer). Is that correct?

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Daniel Povey

unread,

Sep 2, 2019, 8:04:01 PM9/2/19

to kaldi-help

Hm, yeah it won't be exactly the same I think. If you do nnet3-am-copy with --prepare-for-test=true before freezing the weights, it should merge the batchnorms with affine components (in most cases, although with TDNN-F layers there may be ones that it can't do that to, you'll have to check). That would remove the batchnorm part of the picture.

Also, in place of shrinkage, in more modern recipes we normally use the l2-regularize options on the layers in the xconfig. If done that way, the shrinkage wouldn't be applied in layers that have the learning rate set to zero.

Dan

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/a4d14947-daa1-4242-9a92-678c1e1a24e9%40googlegroups.com.

Yi Liu

unread,

Sep 2, 2019, 9:23:29 PM9/2/19

to kaldi-help

Yeah, that would freeze the batchnorm layers. I'm not sure whether it is necessary to do that. May need some experiments to validate the effect.

Also, I'm doing the x-vector training and the shrinkage is the default option which I think l2-regularization is a better option.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/a4d14947-daa1-4242-9a92-678c1e1a24e9%40googlegroups.com.

Daniel Povey

unread,

Sep 2, 2019, 11:16:21 PM9/2/19

to kaldi-help

Yeah, shrinkage is deprecated and all recipes should be moving towards l2, although it may take a little tuning to find the right constant.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/03f8b633-ae6c-4590-b612-52142113d1e7%40googlegroups.com.

Reply all

Reply to author

Forward