ASSERTION_FAILED (nnet3-chain-train[5.5.824~1-63c32]:HouseBackward():qr.cc:124) Assertion failed: (KALDI_ISFINITE(sigma) && "Tridiagonalizing matrix that is too large or has NaNs.")

laine...@gmail.com

unread,

Nov 23, 2020, 10:33:57 PM11/23/20

to kaldi-help

Hi, everyone. My neural network strcuture as following:

input dim=43 name=input

# please note that it is important to have input layer with the name=input

# as the layer immediately preceding the fixed-affine-layer to enable

# the use of short notation for the descriptor

fixed-affine-layer name=lda input=Append(-1,0,1) affine-transform- file=$dir/configs/lda.mat

# the first splicing is moved before the lda layer, so no splicing here

relu-batchnorm-layer name=asr_tdnn1 dim=625

tdnnf-layer name=asr_tdnnf2 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=1

tdnnf-layer name=asr_tdnnf3 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=1

tdnnf-layer name=asr_tdnnf4 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=3

tdnnf-layer name=asr_tdnnf5 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=3

tdnnf-layer name=asr_tdnnf6 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=3

relu-batchnorm-layer name=asv_tdnn1 input=Append(Offset(input,-2),Offset(input,-1),input,Offset(input,1),Offset(input,2)) dim=512

relu-batchnorm-layer name=asv_tdnn2 input=Append(Offset(asv_tdnn1,-2),asv_tdnn1,Offset(asv_tdnn1,2)) dim=512

relu-batchnorm-layer name=asv_tdnn3 input=Append(Offset(asv_tdnn2,-2),asv_tdnn2,Offset(asv_tdnn2,2)) dim=512

relu-batchnorm-layer name=asv_tdnn4 input=Append(Offset(asv_tdnn3,-3),asv_tdnn3,Offset(asv_tdnn3,3)) dim=512

relu-batchnorm-layer name=asv_tdnn5 input=Append(Offset(asv_tdnn4,-3),asv_tdnn4,Offset(asv_tdnn4,3)) dim=512

relu-batchnorm-layer name=asv_tdnn6 input=asv_tdnn5 dim=512

relu-batchnorm-layer name=combine input=Append(asr_tdnnf6,asv_tdnn6) dim=625

tdnnf-layer name=asr_tdnnf7 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=0

tdnnf-layer name=asr_tdnnf8 $tdnnf_opts dim=625 bottleneck-dim=256 time-stride=0

## adding the layers for chain branch

relu-batchnorm-layer name=prefinal-chain input=asr_tdnnf8 dim=625 target-rms=0.5

output-layer name=output include-log-softmax=false dim=$num_targets max-change=1.5

# adding the layers for xent branch

# This block prints the configs for a separate output that will be

# trained with a cross-entropy objective in the 'chain' models... this

# has the effect of regularizing the hidden parts of the model. we use

# 0.5 / args.xent_regularize as the learning rate factor- the factor of

# 0.5 / args.xent_regularize is suitable as it means the xent

# final-layer learns at a rate independent of the regularization

# constant; and the 0.5 was tuned so as to make the relative progress

# similar in the xent and regular final layers.

relu-batchnorm-layer name=prefinal-xent input=asr_tdnnf8 dim=625 target-rms=0.5

output-layer name=output-xent dim=$num_targets learning-rate-factor=$learning_rate_factor max-change=1.5

I tried to freeze the asv_tdnn* component, so I convert the asv_tdnn*.affine to a fixed affine and set the asv_tdnn*.batchnorm at test mode.

But when I run the training script, I meet an error. the following are the command and log.

# nnet3-chain-train --use-gpu=yes --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=0.141421356237 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.5 --srand=0 "nnet3-am-copy --raw=true --learning-rate=0.002 --edits='' --scale=1.0 exp/chain/tdnn_xvector_2c_sp/0.mdl - |" exp/chain/tdnn_xvector_2c_sp/den.fst "ark,bg:nnet3-chain-copy-egs --frame-shift=2 ark:exp/chain/tdnn_1c_sp/egs/cegs.2.ark ark:- | nnet3-chain-shuffle-egs --buffer-size=5000 --srand=0 ark:- ark:- | nnet3-chain-merge-egs --minibatch-size=64 ark:- ark:- |" exp/chain/tdnn_xvector_2c_sp/1.2.raw

# Started at Tue Nov 24 11:17:20 CST 2020

#

nnet3-chain-train --use-gpu=yes --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=0.141421356237 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.5 --srand=0 "nnet3-am-copy --raw=true --learning-rate=0.002 --edits='' --scale=1.0 exp/chain/tdnn_xvector_2c_sp/0.mdl - |" exp/chain/tdnn_xvector_2c_sp/den.fst 'ark,bg:nnet3-chain-copy-egs --frame-shift=2 ark:exp/chain/tdnn_1c_sp/egs/cegs.2.ark ark:- | nnet3-chain-shuffle-egs --buffer-size=5000 --srand=0 ark:- ark:- | nnet3-chain-merge-egs --minibatch-size=64 ark:- ark:- |' exp/chain/tdnn_xvector_2c_sp/1.2.raw

WARNING (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuId():cu-device.cc:228) Not in compute-exclusive mode. Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode

LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:408) Selecting from 4 GPUs

LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(0): GeForce RTX 2080 Ti free:10738M, used:281M, total:11019M, free/total:0.9745

LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(1): GeForce RTX 2080 Ti free:10738M, used:281M, total:11019M, free/total:0.9745

LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(2): GeForce RTX 2080 Ti free:10742M, used:277M, total:11019M, free/total:0.974863

LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(3): GeForce RTX 2080 Ti free:10742M, used:277M, total:11019M, free/total:0.974863

LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:471) Device: 2, mem_ratio: 0.974863

LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuId():cu-device.cc:352) Trying to select device: 2

LOG (nnet3-chain-train[5.5.824~1-63c32]:SelectGpuIdAuto():cu-device.cc:481) Success selecting device 2 free mem ratio: 0.974863

LOG (nnet3-chain-train[5.5.824~1-63c32]:FinalizeActiveGpu():cu-device.cc:308) The active GPU is [2]: GeForce RTX 2080 Ti free:10156M, used:863M, total:11019M, free/total:0.921684 version 7.5

nnet3-am-copy --raw=true --learning-rate=0.002 --edits= --scale=1.0 exp/chain/tdnn_xvector_2c_sp/0.mdl -

LOG (nnet3-am-copy[5.5.824~1-63c32]:main():nnet3-am-copy.cc:153) Copied neural net from exp/chain/tdnn_xvector_2c_sp/0.mdl to raw format as -

nnet3-chain-merge-egs --minibatch-size=64 ark:- ark:-

nnet3-chain-shuffle-egs --buffer-size=5000 --srand=0 ark:- ark:-

nnet3-chain-copy-egs --frame-shift=2 ark:exp/chain/tdnn_1c_sp/egs/cegs.2.ark ark:-

ASSERTION_FAILED (nnet3-chain-train[5.5.824~1-63c32]:HouseBackward():qr.cc:124) Assertion failed: (KALDI_ISFINITE(sigma) && "Tridiagonalizing matrix that is too large or has NaNs.")

[ Stack-Trace: ]

/home/luoxiaojie/kaldi/src/lib/libkaldi-base.so(+0x1fbbb) [0x7f4bb3f66bbb]

/home/luoxiaojie/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x28f) [0x7f4bb3f6738d]

nnet3-chain-train(kaldi::MessageLogger::Log::operator=(kaldi::MessageLogger const&)+0x1c) [0x459cc2]

/home/luoxiaojie/kaldi/src/lib/libkaldi-base.so(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0xb0) [0x7f4bb3f675a5]

/home/luoxiaojie/kaldi/src/lib/libkaldi-matrix.so(void kaldi::HouseBackward<float>(int, float const*, float*, float*)+0x2a3) [0x7f4bb4240850]

/home/luoxiaojie/kaldi/src/lib/libkaldi-matrix.so(kaldi::SpMatrix<float>::Tridiagonalize(kaldi::MatrixBase<float>*)+0x1ec) [0x7f4bb423e148]

/home/luoxiaojie/kaldi/src/lib/libkaldi-matrix.so(kaldi::SpMatrix<float>::Eig(kaldi::VectorBase<float>*, kaldi::MatrixBase<float>*) const+0xe6) [0x7f4bb423e958]

/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirectionsInternal(float, float, bool, kaldi::Vector<float> const&, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*)+0x75f) [0x7f4bb67f0495]

/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, float*)+0x199) [0x7f4bb67eeb75]

/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::Init(kaldi::CuMatrixBase<float> const&)+0x137) [0x7f4bb67ee90f]

/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, float*)+0x84) [0x7f4bb67eea60]

/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NaturalGradientAffineComponent::Update(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&)+0x16e) [0x7f4bb679fc72]

/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::AffineComponent::Backprop(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, void*, kaldi::nnet3::Component*, kaldi::CuMatrixBase<float>*) const+0xe4) [0x7f4bb6793420]

/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0x7e3) [0x7f4bb6869f2d]

/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::Run()+0x1e5) [0x7f4bb686b6ad]

/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::TrainInternal(kaldi::nnet3::NnetChainExample const&, kaldi::nnet3::NnetComputation const&)+0xe8) [0x7f4bb6907996]

/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)+0x245) [0x7f4bb6907803]

nnet3-chain-train(main+0x516) [0x45912f]

/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f4bb2d97840]

nnet3-chain-train(_start+0x29) [0x458299]

ERROR (nnet3-chain-merge-egs[5.5.824~1-63c32]:Write():kaldi-matrix.cc:1404) Failed to write matrix to stream.

However, when I run the training script without freeze, it's normal.

Thank you

Daniel Povey

unread,

Nov 23, 2020, 10:50:58 PM11/23/20

to kaldi-help

Try reducing the learning rates, could be instability.

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/46b13a84-c0d2-4063-afae-a56ca7e3619bn%40googlegroups.com.

laine...@gmail.com

unread,

Nov 23, 2020, 11:12:16 PM11/23/20

to kaldi-help

Thank you for your reply, but I meet another error.

nnet3-chain-train --use-gpu=yes --verbose=3 --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=exp/chain/tdnn_xvector_2c_sp/cache.4 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.5 --srand=4 "nnet3-am-copy --raw=true --learning-rate=0.0002 --edits='' --scale=1.0 exp/chain/tdnn_xvector_2c_sp/4.mdl - |" exp/chain/tdnn_xvector_2c_sp/den.fst 'ark,bg:nnet3-chain-copy-egs --frame-shift=1 ark:exp/chain/tdnn_1c_sp/egs/cegs.10.ark ark:- | nnet3-chain-shuffle-egs --buffer-size=5000 --srand=4 ark:- ark:- | nnet3-chain-merge-egs --minibatch-size=128 ark:- ark:- |' exp/chain/tdnn_xvector_2c_sp/5.2.raw

VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.096164

VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.0668569

VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.173937

VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.0832209

VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.137356

VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.200008

VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.165137

VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.0713673

VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.213587

VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.0751119

VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.000491708

VLOG[2] (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1093) Error in orthogonality is 0.101577

WARNING (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1055) Ratio is nan (should be >= 1.0); component is asr_tdnnf4.linear

ASSERTION_FAILED (nnet3-chain-train[5.5.824~1-63c32]:ConstrainOrthonormalInternal():nnet-utils.cc:1057) Assertion failed: (ratio > 0.9)

[ Stack-Trace: ]

/home/luoxiaojie/kaldi/src/lib/libkaldi-base.so(+0x1fbbb) [0x7f4575e0fbbb]

/home/luoxiaojie/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x28f) [0x7f4575e1038d]

nnet3-chain-train(kaldi::MessageLogger::Log::operator=(kaldi::MessageLogger const&)+0x1c) [0x459cc2]

/home/luoxiaojie/kaldi/src/lib/libkaldi-base.so(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0xb0) [0x7f4575e105a5]

/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::ConstrainOrthonormalInternal(float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::CuMatrixBase<float>*)+0x2fc) [0x7f4578705376]

/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::ConstrainOrthonormal(kaldi::nnet3::Nnet*)+0x2f6) [0x7f45787059dc]

/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::TrainInternal(kaldi::nnet3::NnetChainExample const&, kaldi::nnet3::NnetComputation const&)+0x1f2) [0x7f45787b0aa0]

/home/luoxiaojie/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)+0x245) [0x7f45787b0803]

nnet3-chain-train(main+0x516) [0x45912f]

/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f4574c40840]

nnet3-chain-train(_start+0x29) [0x458299]

Message has been deleted

Daniel Povey

unread,

Nov 23, 2020, 11:53:10 PM11/23/20

to kaldi-help

You could. I suspect you may have called CollapseModel() at some point ( or --prepare-for-test, which calls that), which

would merge the batchnorm into the affine components. It's probably not ideal if you plan to train afterward.

On Tue, Nov 24, 2020 at 12:15 PM laine...@gmail.com <laine...@gmail.com> wrote:

Should I set orthonormal-constraint=0.0

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/b4a8dcf1-0959-45da-98ac-5ff9e71bb605n%40googlegroups.com.

laine...@gmail.com

unread,

Nov 24, 2020, 12:09:08 AM11/24/20

to kaldi-help

I want to freeze some of batchnorm components , so I write a program to set some batchnorm components at test mode, which just called SetTestMode method.

for(int32 i = 0; i < batchnorm_names.size(); i++) {

int32 index = nnet.GetComponentIndex(batchnorm_names[i]);

KALDI_ASSERT(index != -1 && "Expect batchnorm exist");

Component* component = nnet.GetComponent(index);

BatchNormComponent *bc = dynamic_cast<BatchNormComponent*>(component);

if (bc != NULL) {

KALDI_LOG << "Set "<< batchnorm_names[i] << "to test mode";

bc->SetTestMode(true);

}

how should I do to freeze batchnorm component rightly? Thank you

Daniel Povey

unread,

Nov 24, 2020, 12:17:38 AM11/24/20

to kaldi-help

That should be fine.

Actually those errors in orthogonality are probably just due to training getting a bit unstable.

Reducing the learning rate may help.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/4b3f51ec-13d6-4107-a15b-4bf5770b23e5n%40googlegroups.com.

laine...@gmail.com

unread,

Nov 25, 2020, 8:44:56 AM11/25/20

to kaldi-help

I found that once I tried to set the batchnorm to test mode, it will occur the above error...

Jan Trmal

unread,

Nov 25, 2020, 3:48:08 PM11/25/20

to kaldi-help

yeah, after setting the test mode, the networks might not be completely suitable for training continuation, AFAIK

y.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/f1d1fee6-1352-4942-acc6-b58271a1821an%40googlegroups.com.

Reply all

Reply to author

Forward