ASSERTION_FAILED (nnet3-chain-train[5.5.824~1-63c32]:HouseBackward():qr.cc:124) Assertion failed: (KALDI_ISFINITE(sigma) && "Tridiagonalizing matrix that is too large or has NaNs.")

257 views
Skip to first unread message

laine...@gmail.com

unread,
Nov 23, 2020, 10:33:57 PM11/23/20
to kaldi-help
Hi, everyone. My neural network strcuture as following:
input dim=43 name=input

  # please note that it is important to have input layer with the name=input
  # as the layer immediately preceding the fixed-affine-layer to enable
  # the use of short notation for the descriptor
  fixed-affine-layer name=lda input=Append(-1,0,1) affine-transform-    file=$dir/configs/lda.mat

  # the first splicing is moved before the lda layer, so no splicing here
  relu-batchnorm-layer name=asr_tdnn1 dim=625
  tdnnf-layer name=asr_tdnnf2 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=1  
  tdnnf-layer name=asr_tdnnf3 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=1
  tdnnf-layer name=asr_tdnnf4 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=3
  tdnnf-layer name=asr_tdnnf5 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=3
  tdnnf-layer name=asr_tdnnf6 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=3

  relu-batchnorm-layer name=asv_tdnn1 input=Append(Offset(input,-2),Offset(input,-1),input,Offset(input,1),Offset(input,2)) dim=512
  relu-batchnorm-layer name=asv_tdnn2 input=Append(Offset(asv_tdnn1,-2),asv_tdnn1,Offset(asv_tdnn1,2)) dim=512
  relu-batchnorm-layer name=asv_tdnn3 input=Append(Offset(asv_tdnn2,-2),asv_tdnn2,Offset(asv_tdnn2,2)) dim=512
  relu-batchnorm-layer name=asv_tdnn4 input=Append(Offset(asv_tdnn3,-3),asv_tdnn3,Offset(asv_tdnn3,3)) dim=512
  relu-batchnorm-layer name=asv_tdnn5 input=Append(Offset(asv_tdnn4,-3),asv_tdnn4,Offset(asv_tdnn4,3)) dim=512
  relu-batchnorm-layer name=asv_tdnn6 input=asv_tdnn5 dim=512
  
  relu-batchnorm-layer name=combine input=Append(asr_tdnnf6,asv_tdnn6) dim=625 
  tdnnf-layer name=asr_tdnnf7 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=0
  tdnnf-layer name=asr_tdnnf8 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=0  

  
  ## adding the layers for chain branch
  relu-batchnorm-layer name=prefinal-chain input=asr_tdnnf8 dim=625 target-rms=0.5
  output-layer name=output include-log-softmax=false dim=$num_targets max-change=1.5

  # adding the layers for xent branch
  # This block prints the configs for a separate output that will be
  # trained with a cross-entropy objective in the 'chain' models... this
  # has the effect of regularizing the hidden parts of the model.  we use
  # 0.5 / args.xent_regularize as the learning rate factor- the factor of
  # 0.5 / args.xent_regularize is suitable as it means the xent
  # final-layer learns at a rate independent of the regularization
  # constant; and the 0.5 was tuned so as to make the relative progress
  # similar in the xent and regular final layers.
  relu-batchnorm-layer name=prefinal-xent input=asr_tdnnf8 dim=625 target-rms=0.5
  output-layer name=output-xent dim=$num_targets learning-rate-factor=$learning_rate_factor max-change=1.5

I tried to freeze the asv_tdnn*  component, so I convert the asv_tdnn*.affine to a fixed affine and set the asv_tdnn*.batchnorm at test mode.
But when I run the training script, I meet an error.  the following are the command and log.

# nnet3-chain-train --use-gpu=yes --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=0.141421356237 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.5 --srand=0 "nnet3-am-copy --raw=true --learning-rate=0.002 --edits='' --scale=1.0 exp/chain/tdnn_xvector_2c_sp/0.mdl - |" exp/chain/tdnn_xvector_2c_sp/den.fst "ark,bg:nnet3-chain-copy-egs                          --frame-shift=2                         ark:exp/chain/tdnn_1c_sp/egs/cegs.2.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=0 ark:- ark:- | nnet3-chain-merge-egs                         --minibatch-size=64 ark:- ark:- |" exp/chain/tdnn_xvector_2c_sp/1.2.raw 
# Started at Tue Nov 24 11:17:20 CST 2020
#
nnet3-chain-train --use-gpu=yes --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=0.141421356237 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.5 --srand=0 "nnet3-am-copy --raw=true --learning-rate=0.002 --edits='' --scale=1.0 exp/chain/tdnn_xvector_2c_sp/0.mdl - |" exp/chain/tdnn_xvector_2c_sp/den.fst 'ark,bg:nnet3-chain-copy-egs                          --frame-shift=2                         ark:exp/chain/tdnn_1c_sp/egs/cegs.2.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=0 ark:- ark:- | nnet3-chain-merge-egs                         --minibatch-size=64 ark:- ark:- |' exp/chain/tdnn_xvector_2c_sp/1.2.raw 
WARNING (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuId():cu-device.cc:228) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:408) Selecting from 4 GPUs
LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(0): GeForce RTX 2080 Ti free:10738M, used:281M, total:11019M, free/total:0.9745
LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(1): GeForce RTX 2080 Ti free:10738M, used:281M, total:11019M, free/total:0.9745
LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(2): GeForce RTX 2080 Ti free:10742M, used:277M, total:11019M, free/total:0.974863
LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(3): GeForce RTX 2080 Ti free:10742M, used:277M, total:11019M, free/total:0.974863
LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:471) Device: 2, mem_ratio: 0.974863
LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuId():cu-device.cc:352) Trying to select device: 2
LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:481) Success selecting device 2 free mem ratio: 0.974863
LOG (nnet3-chain-train[5.5.824~1-63c32]:FinalizeActiveGpu():cu-device.cc:308) The active GPU is [2]: GeForce RTX 2080 Ti free:10156M, used:863M, total:11019M, free/total:0.921684 version 7.5
nnet3-am-copy --raw=true --learning-rate=0.002 --edits= --scale=1.0 exp/chain/tdnn_xvector_2c_sp/0.mdl - 
LOG (nnet3-am-copy[5.5.824~1-63c32]:main():nnet3-am-copy.cc:153) Copied neural net from exp/chain/tdnn_xvector_2c_sp/0.mdl to raw format as -
nnet3-chain-merge-egs --minibatch-size=64 ark:- ark:- 
nnet3-chain-shuffle-egs --buffer-size=5000 --srand=0 ark:- ark:- 
nnet3-chain-copy-egs --frame-shift=2 ark:exp/chain/tdnn_1c_sp/egs/cegs.2.ark ark:- 
ASSERTION_FAILED (nnet3-chain-train[5.5.824~1-63c32]:HouseBackward():qr.cc:124) Assertion failed: (KALDI_ISFINITE(sigma) && "Tridiagonalizing matrix that is too large or has NaNs.")
[ Stack-Trace: ]
/home/luoxiaojie/kaldi/src/lib/libkaldi-base.so(+0x1fbbb) [0x7f4bb3f66bbb]
/home/luoxiaojie/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x28f) [0x7f4bb3f6738d]
nnet3-chain-train(kaldi::MessageLogger::Log::operator=(kaldi::MessageLogger const&)+0x1c) [0x459cc2]
/home/luoxiaojie/kaldi/src/lib/libkaldi-base.so(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0xb0) [0x7f4bb3f675a5]
/home/luoxiaojie/kaldi/src/lib/libkaldi-matrix.so(void kaldi::HouseBackward<float>(int, float const*, float*, float*)+0x2a3) [0x7f4bb4240850]
/home/luoxiaojie/kaldi/src/lib/libkaldi-matrix.so(kaldi::SpMatrix<float>::Tridiagonalize(kaldi::MatrixBase<float>*)+0x1ec) [0x7f4bb423e148]
/home/luoxiaojie/kaldi/src/lib/libkaldi-matrix.so(kaldi::SpMatrix<float>::Eig(kaldi::VectorBase<float>*, kaldi::MatrixBase<float>*) const+0xe6) [0x7f4bb423e958]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirectionsInternal(float, float, bool, kaldi::Vector<float> const&, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*)+0x75f) [0x7f4bb67f0495]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, float*)+0x199) [0x7f4bb67eeb75]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::Init(kaldi::CuMatrixBase<float> const&)+0x137) [0x7f4bb67ee90f]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, float*)+0x84) [0x7f4bb67eea60]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NaturalGradientAffineComponent::Update(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&)+0x16e) [0x7f4bb679fc72]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::AffineComponent::Backprop(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, void*, kaldi::nnet3::Component*, kaldi::CuMatrixBase<float>*) const+0xe4) [0x7f4bb6793420]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0x7e3) [0x7f4bb6869f2d]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::Run()+0x1e5) [0x7f4bb686b6ad]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::TrainInternal(kaldi::nnet3::NnetChainExample const&, kaldi::nnet3::NnetComputation const&)+0xe8) [0x7f4bb6907996]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)+0x245) [0x7f4bb6907803]
nnet3-chain-train(main+0x516) [0x45912f]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f4bb2d97840]
nnet3-chain-train(_start+0x29) [0x458299]

ERROR (nnet3-chain-merge-egs[5.5.824~1-63c32]:Write():kaldi-matrix.cc:1404) Failed to write matrix to stream.

However, when I  run the training script without freeze,  it's normal.

Thank you





Daniel Povey

unread,
Nov 23, 2020, 10:50:58 PM11/23/20
to kaldi-help
Try reducing the learning rates, could be instability.

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/46b13a84-c0d2-4063-afae-a56ca7e3619bn%40googlegroups.com.

laine...@gmail.com

unread,
Nov 23, 2020, 11:12:16 PM11/23/20
to kaldi-help
Thank you for your reply, but I meet another error.

nnet3-chain-train --use-gpu=yes --verbose=3 --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=exp/chain/tdnn_xvector_2c_sp/cache.4 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.5 --srand=4 "nnet3-am-copy --raw=true --learning-rate=0.0002 --edits='' --scale=1.0 exp/chain/tdnn_xvector_2c_sp/4.mdl - |" exp/chain/tdnn_xvector_2c_sp/den.fst 'ark,bg:nnet3-chain-copy-egs                          --frame-shift=1                         ark:exp/chain/tdnn_1c_sp/egs/cegs.10.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=4 ark:- ark:- | nnet3-chain-merge-egs                         --minibatch-size=128 ark:- ark:- |' exp/chain/tdnn_xvector_2c_sp/5.2.raw

VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.096164
VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.0668569
VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.173937
VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.0832209
VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.137356
VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.200008
VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.165137
VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.0713673
VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.213587
VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.0751119
VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.000491708
VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.101577
WARNING (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1055) Ratio is nan (should be >= 1.0); component is asr_tdnnf4.linear
ASSERTION_FAILED (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1057) Assertion failed: (ratio > 0.9)

[ Stack-Trace: ]
/home/luoxiaojie/kaldi/src/lib/libkaldi-base.so(+0x1fbbb) [0x7f4575e0fbbb]
/home/luoxiaojie/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x28f) [0x7f4575e1038d]
nnet3-chain-train(kaldi::MessageLogger::Log::operator=(kaldi::MessageLogger const&)+0x1c) [0x459cc2]
/home/luoxiaojie/kaldi/src/lib/libkaldi-base.so(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0xb0) [0x7f4575e105a5]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::ConstrainOrthonormalInternal(float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::CuMatrixBase<float>*)+0x2fc) [0x7f4578705376]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::ConstrainOrthonormal(kaldi::nnet3::Nnet*)+0x2f6) [0x7f45787059dc]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::TrainInternal(kaldi::nnet3::NnetChainExample const&, kaldi::nnet3::NnetComputation const&)+0x1f2) [0x7f45787b0aa0]
/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)+0x245) [0x7f45787b0803]
nnet3-chain-train(main+0x516) [0x45912f]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f4574c40840]
nnet3-chain-train(_start+0x29) [0x458299]
Message has been deleted

Daniel Povey

unread,
Nov 23, 2020, 11:53:10 PM11/23/20
to kaldi-help
You could.  I suspect you may have called CollapseModel() at some point ( or --prepare-for-test, which calls that), which
would merge the batchnorm into the affine components.  It's probably not ideal if you plan to train afterward.

On Tue, Nov 24, 2020 at 12:15 PM laine...@gmail.com <laine...@gmail.com> wrote:
Should I set orthonormal-constraint=0.0

laine...@gmail.com

unread,
Nov 24, 2020, 12:09:08 AM11/24/20
to kaldi-help
I want to freeze some of batchnorm components , so I write a program to set some batchnorm components at test mode, which just called SetTestMode method.

for(int32 i = 0; i < batchnorm_names.size(); i++) {
     int32 index = nnet.GetComponentIndex(batchnorm_names[i]);
     KALDI_ASSERT(index != -1 && "Expect batchnorm exist");
     Component* component = nnet.GetComponent(index);
     BatchNormComponent *bc = dynamic_cast<BatchNormComponent*>(component);
     if (bc != NULL) {
          KALDI_LOG << "Set "<< batchnorm_names[i] << "to test mode";
          bc->SetTestMode(true);
      }
}

how should I do to freeze batchnorm component rightly? Thank you


Daniel Povey

unread,
Nov 24, 2020, 12:17:38 AM11/24/20
to kaldi-help
That should be fine.
Actually those errors in orthogonality are probably just due to training getting a bit unstable.
Reducing the learning rate may help.

laine...@gmail.com

unread,
Nov 25, 2020, 8:44:56 AM11/25/20
to kaldi-help
I found that once I tried to set the batchnorm to test mode, it will occur the above error...

Jan Trmal

unread,
Nov 25, 2020, 3:48:08 PM11/25/20
to kaldi-help
yeah, after setting the test mode, the networks might not be completely suitable for training continuation, AFAIK
y.

Reply all
Reply to author
Forward
0 new messages