Another assert failed in nnet3-chain-train

467 views
Skip to first unread message

Xiang Li

unread,
Jan 11, 2016, 10:22:04 PM1/11/16
to kaldi-help
Hi, Dan,
I've updated the code, after you fix the bug in chain derivative-computation.
All tests in matrix/ and cudamatrix/ are OK.
Here is the log with the new code:
... 
LOG (nnet3-chain-train:Train():nnet-chain-training.cc:80) Parameter change too big: 4.0073 > --max-param-change=1, scaling by 0.249545 
KALDI_ASSERT: at nnet3-chain-train:HouseBackward:qr.cc:124, failed: KALDI_ISFINITE(sigma) && "Tridiagonalizing matrix that is too large or has NaNs."
Stack trace is:
kaldi::KaldiGetStackTrace()
kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)
void kaldi::HouseBackward<float>(int, float const*, float*, float*)
kaldi::SpMatrix<float>::Tridiagonalize(kaldi::MatrixBase<float>*)
kaldi::SpMatrix<float>::Eig(kaldi::VectorBase<float>*, kaldi::MatrixBase<float>*) const
kaldi::nnet3::OnlineNaturalGradient::PreconditionDirectionsInternal(int, float, kaldi::Vector<float> const&, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*, kaldi::CuVectorBase<float>*, float*)
kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, kaldi::CuVectorBase<float>*, float*)
kaldi::nnet3::NaturalGradientAffineComponent::Update(std::string const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&)
kaldi::nnet3::AffineComponent::Backprop(std::string const&, kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::nnet3::Component*, kaldi::CuMatrixBase<float>*) const
kaldi::nnet3::NnetComputer::ExecuteCommand(int)
kaldi::nnet3::NnetComputer::Backward()
kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)
nnet3-chain-train(main+0x3f0) [0x8b3f6c]
... 

Before you fixed the bug, my job had failed due to another assert failure,
here is the log with the old code:
...
LOG (nnet3-chain-train:Train():nnet-chain-training.cc:80) Parameter change too big: 4.00642 > --max-param-change=1, scaling by 0.2496
ERROR (nnet3-chain-train:HouseBackward():qr.cc:146) NaN encountered in HouseBackward
(no stack trace here) 
...

Both of them failed due to a same final-affine backprop command, when computing the natural gradient.


 

Daniel Povey

unread,
Jan 11, 2016, 10:28:06 PM1/11/16
to kaldi-help
Make sure that you recompiled all the binaries in cudamatrix.  If it still happens, please run with --verbose=1 and show me what the last part of the log looks like.
Dan


--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Daniel Povey

unread,
Jan 11, 2016, 10:38:04 PM1/11/16
to kaldi-help
sorry, I mean the binaries in chainbin.  and obviously the code in chain/.  You may have to run 'make depend' to make sure things get compiled correctly.

Xiang Li

unread,
Jan 11, 2016, 11:14:03 PM1/11/16
to kaldi-help, dpo...@gmail.com
Is this enough ?
LOG (nnet3-chain-train:ExecuteCommand():nnet-compute.cc:282) c62: # begin backward commands
LOG (nnet3-chain-train:ExecuteCommand():nnet-compute.cc:282) c63: m35 = undefined(8192,3233)
LOG (nnet3-chain-train:ExecuteCommand():nnet-compute.cc:282) c64: m35 = m36(128:8319, 0:3232)
LOG (nnet3-chain-train:ExecuteCommand():nnet-compute.cc:282) c65: m36 = []
LOG (nnet3-chain-train:ExecuteCommand():nnet-compute.cc:282) c66: m33 = undefined(8192,3233)
LOG (nnet3-chain-train:ExecuteCommand():nnet-compute.cc:282) c67: final-log-softmax.Backprop(NULL, [], m34(128:8319, 0:3232), m35, [component-pointer], &m33)
LOG (nnet3-chain-train:ExecuteCommand():nnet-compute.cc:282) c68: m35 = []
LOG (nnet3-chain-train:ExecuteCommand():nnet-compute.cc:282) c69: m31 = zeros(8192,850)
ERROR (nnet3-chain-train:ExecuteCommand():nnet-compute.cc:286) Error running command c70: final-affine.Backprop(NULL, m30(128:8319, 0:849), [], m33, [component-pointer], &m31)
WARNING (nnet3-chain-train:Close():kaldi-io.cc:496) Pipe nnet3-chain-copy-egs --truncate-deriv-weights=0 --frame-shift=2 ark:exp/chain/tdnn_o/egs/cegs.75.ark ark:- | nnet3-chain-shuffle-egs --buffer-size=5000 --srand=610 ark:- ark:-| nnet3-chain-merge-egs --minibatch-size=128 ark:- ark:- | had nonzero return status 36096
WARNING (nnet3-chain-train:~Mutex():kaldi-mutex.cc:45) Error destroying pthread mutex; ignoring it as it could be a known issue that affects Haswell processors, see https://sourceware.org/bugzilla/show_bug.cgi?id=16657 If your processor is not Haswell and you see this message, it could be a bug in Kaldi.  However it could be that multi-threaded code terminated messily.
ERROR (nnet3-chain-train:ExecuteCommand():nnet-compute.cc:286) Error running command c70: final-affine.Backprop(NULL, m30(128:8319, 0:849), [], m33, [component-pointer], &m31)
[stack trace: ]
kaldi::KaldiGetStackTrace()
kaldi::KaldiErrorMessage::~KaldiErrorMessage()

kaldi::nnet3::NnetComputer::ExecuteCommand(int)
kaldi::nnet3::NnetComputer::Backward()
kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)
nnet3-chain-train(main+0x3f0) [0x8b3f6c]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd) [0x7f0287c95ead]
nnet3-chain-train() [0x8b3a99]



在 2016年1月12日星期二 UTC+8上午11:28:06,Dan Povey写道:

Daniel Povey

unread,
Jan 11, 2016, 11:18:02 PM1/11/16
to Xiang Li, kaldi-help
No, it's not- after it fails it prints out a lot of unrelated errors.   Look for a log message printed from chain-training.cc.
Dan

Xiang Li

unread,
Jan 11, 2016, 11:28:02 PM1/11/16
to kaldi-help, heibaid...@gmail.com, dpo...@gmail.com
Please check the attachment.

在 2016年1月12日星期二 UTC+8下午12:18:02,Dan Povey写道:
train.log

Daniel Povey

unread,
Jan 11, 2016, 11:43:06 PM1/11/16
to Xiang Li, kaldi-help
I'm not sure what this is.  If you send me the files necessary to reproduce this I should be able to fix it though-- that would be useful.
Dan

Daniel Povey

unread,
Jan 12, 2016, 7:12:55 PM1/12/16
to Xiang Li, kaldi-help
Thanks...
Issue fixed now.
Dan

Xiang Li

unread,
Jan 12, 2016, 7:29:18 PM1/12/16
to kaldi-help, heibaid...@gmail.com, dpo...@gmail.com
Thank you very much, Dan.

在 2016年1月13日星期三 UTC+8上午8:12:55,Dan Povey写道:
Reply all
Reply to author
Forward
0 new messages