Hi, everyone. My neural network strcuture as following:
input dim=43 name=input
# please note that it is important to have input layer with the name=input
# as the layer immediately preceding the fixed-affine-layer to enable
# the use of short notation for the descriptor
fixed-affine-layer name=lda input=Append(-1,0,1) affine-transform- file=$dir/configs/lda.mat
# the first splicing is moved before the lda layer, so no splicing here
relu-batchnorm-layer name=asr_tdnn1 dim=625
tdnnf-layer name=asr_tdnnf2 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=1
tdnnf-layer name=asr_tdnnf3 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=1
tdnnf-layer name=asr_tdnnf4 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=3
tdnnf-layer name=asr_tdnnf5 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=3
tdnnf-layer name=asr_tdnnf6 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=3
relu-batchnorm-layer name=asv_tdnn1 input=Append(Offset(input,-2),Offset(input,-1),input,Offset(input,1),Offset(input,2)) dim=512
relu-batchnorm-layer name=asv_tdnn2 input=Append(Offset(asv_tdnn1,-2),asv_tdnn1,Offset(asv_tdnn1,2)) dim=512
relu-batchnorm-layer name=asv_tdnn3 input=Append(Offset(asv_tdnn2,-2),asv_tdnn2,Offset(asv_tdnn2,2)) dim=512
relu-batchnorm-layer name=asv_tdnn4 input=Append(Offset(asv_tdnn3,-3),asv_tdnn3,Offset(asv_tdnn3,3)) dim=512
relu-batchnorm-layer name=asv_tdnn5 input=Append(Offset(asv_tdnn4,-3),asv_tdnn4,Offset(asv_tdnn4,3)) dim=512
relu-batchnorm-layer name=asv_tdnn6 input=asv_tdnn5 dim=512
relu-batchnorm-layer name=combine input=Append(asr_tdnnf6,asv_tdnn6) dim=625
tdnnf-layer name=asr_tdnnf7 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=0
tdnnf-layer name=asr_tdnnf8 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=0
## adding the layers for chain branch
relu-batchnorm-layer name=prefinal-chain input=asr_tdnnf8 dim=625 target-rms=0.5
output-layer name=output include-log-softmax=false dim=$num_targets max-change=1.5
# adding the layers for xent branch
# This block prints the configs for a separate output that will be
# trained with a cross-entropy objective in the 'chain' models... this
# has the effect of regularizing the hidden parts of the model. we use
# 0.5 / args.xent_regularize as the learning rate factor- the factor of
# 0.5 / args.xent_regularize is suitable as it means the xent
# final-layer learns at a rate independent of the regularization
# constant; and the 0.5 was tuned so as to make the relative progress
# similar in the xent and regular final layers.
relu-batchnorm-layer name=prefinal-xent input=asr_tdnnf8 dim=625 target-rms=0.5
output-layer name=output-xent dim=$num_targets learning-rate-factor=$learning_rate_factor max-change=1.5
I tried to freeze the asv_tdnn* component, so I convert the asv_tdnn*.affine to a fixed affine and set the asv_tdnn*.batchnorm at test mode.
But when I run the training script, I meet an error. the following are the command and log.
# nnet3-chain-train --use-gpu=yes --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=0.141421356237 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.5 --srand=0 "nnet3-am-copy --raw=true --learning-rate=0.002 --edits='' --scale=1.0 exp/chain/tdnn_xvector_2c_sp/0.mdl - |" exp/chain/tdnn_xvector_2c_sp/den.fst "ark,bg:nnet3-chain-copy-egs --frame-shift=2 ark:exp/chain/tdnn_1c_sp/egs/cegs.2.ark ark:- | nnet3-chain-shuffle-egs --buffer-size=5000 --srand=0 ark:- ark:- | nnet3-chain-merge-egs --minibatch-size=64 ark:- ark:- |" exp/chain/tdnn_xvector_2c_sp/1.2.raw
# Started at Tue Nov 24 11:17:20 CST 2020
#
nnet3-chain-train --use-gpu=yes --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=0.141421356237 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.5 --srand=0 "nnet3-am-copy --raw=true --learning-rate=0.002 --edits='' --scale=1.0 exp/chain/tdnn_xvector_2c_sp/0.mdl - |" exp/chain/tdnn_xvector_2c_sp/den.fst 'ark,bg:nnet3-chain-copy-egs --frame-shift=2 ark:exp/chain/tdnn_1c_sp/egs/cegs.2.ark ark:- | nnet3-chain-shuffle-egs --buffer-size=5000 --srand=0 ark:- ark:- | nnet3-chain-merge-egs --minibatch-size=64 ark:- ark:- |' exp/chain/tdnn_xvector_2c_sp/1.2.raw
WARNING (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuId():cu-device.cc:228) Not in compute-exclusive mode. Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:408) Selecting from 4 GPUs
LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(0): GeForce RTX 2080 Ti free:10738M, used:281M, total:11019M, free/total:0.9745
LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(1): GeForce RTX 2080 Ti free:10738M, used:281M, total:11019M, free/total:0.9745
LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(2): GeForce RTX 2080 Ti free:10742M, used:277M, total:11019M, free/total:0.974863
LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(3): GeForce RTX 2080 Ti free:10742M, used:277M, total:11019M, free/total:0.974863
LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:471) Device: 2, mem_ratio: 0.974863
LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuId():cu-device.cc:352) Trying to select device: 2
LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:481) Success selecting device 2 free mem ratio: 0.974863
LOG (nnet3-chain-train[5.5.824~1-63c32]:FinalizeActiveGpu():cu-device.cc:308) The active GPU is [2]: GeForce RTX 2080 Ti free:10156M, used:863M, total:11019M, free/total:0.921684 version 7.5
nnet3-am-copy --raw=true --learning-rate=0.002 --edits= --scale=1.0 exp/chain/tdnn_xvector_2c_sp/0.mdl -
LOG (nnet3-am-copy[5.5.824~1-63c32]:main():nnet3-am-copy.cc:153) Copied neural net from exp/chain/tdnn_xvector_2c_sp/0.mdl to raw format as -
nnet3-chain-merge-egs --minibatch-size=64 ark:- ark:-
nnet3-chain-shuffle-egs --buffer-size=5000 --srand=0 ark:- ark:-
nnet3-chain-copy-egs --frame-shift=2 ark:exp/chain/tdnn_1c_sp/egs/cegs.2.ark ark:-
ASSERTION_FAILED (nnet3-chain-train[5.5.824~1-63c32]:HouseBackward():qr.cc:124) Assertion failed: (KALDI_ISFINITE(sigma) && "Tridiagonalizing matrix that is too large or has NaNs.")
[ Stack-Trace: ]
/home/luoxiaojie/kaldi/src/lib/libkaldi-base.so(+0x1fbbb) [0x7f4bb3f66bbb]
/home/luoxiaojie/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x28f) [0x7f4bb3f6738d]
nnet3-chain-train(kaldi::MessageLogger::Log::operator=(kaldi::MessageLogger const&)+0x1c) [0x459cc2]
/home/luoxiaojie/kaldi/src/lib/libkaldi-base.so(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0xb0) [0x7f4bb3f675a5]
/home/luoxiaojie/kaldi/src/lib/libkaldi-matrix.so(void kaldi::HouseBackward<float>(int, float const*, float*, float*)+0x2a3) [0x7f4bb4240850]
/home/luoxiaojie/kaldi/src/lib/libkaldi-matrix.so(kaldi::SpMatrix<float>::Tridiagonalize(kaldi::MatrixBase<float>*)+0x1ec) [0x7f4bb423e148]
/home/luoxiaojie/kaldi/src/lib/libkaldi-matrix.so(kaldi::SpMatrix<float>::Eig(kaldi::VectorBase<float>*, kaldi::MatrixBase<float>*) const+0xe6) [0x7f4bb423e958]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirectionsInternal(float, float, bool, kaldi::Vector<float> const&, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*)+0x75f) [0x7f4bb67f0495]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, float*)+0x199) [0x7f4bb67eeb75]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::Init(kaldi::CuMatrixBase<float> const&)+0x137) [0x7f4bb67ee90f]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, float*)+0x84) [0x7f4bb67eea60]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NaturalGradientAffineComponent::Update(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&)+0x16e) [0x7f4bb679fc72]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::AffineComponent::Backprop(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, void*, kaldi::nnet3::Component*, kaldi::CuMatrixBase<float>*) const+0xe4) [0x7f4bb6793420]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0x7e3) [0x7f4bb6869f2d]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::Run()+0x1e5) [0x7f4bb686b6ad]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::TrainInternal(kaldi::nnet3::NnetChainExample const&, kaldi::nnet3::NnetComputation const&)+0xe8) [0x7f4bb6907996]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)+0x245) [0x7f4bb6907803]
nnet3-chain-train(main+0x516) [0x45912f]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f4bb2d97840]
nnet3-chain-train(_start+0x29) [0x458299]
ERROR (nnet3-chain-merge-egs[5.5.824~1-63c32]:Write():kaldi-matrix.cc:1404) Failed to write matrix to stream.
However, when I run the training script without freeze, it's normal.
Thank you