Hi, Dan,
I've updated the code, after you fix the bug in chain derivative-computation.
All tests in matrix/ and cudamatrix/ are OK.
Here is the log with the new code:
...
LOG (nnet3-chain-train:Train():nnet-chain-training.cc:80) Parameter change too big: 4.0073 > --max-param-change=1, scaling by 0.249545
KALDI_ASSERT: at nnet3-chain-train:HouseBackward:qr.cc:124, failed: KALDI_ISFINITE(sigma) && "Tridiagonalizing matrix that is too large or has NaNs."
Stack trace is:
kaldi::KaldiGetStackTrace()
kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)
void kaldi::HouseBackward<float>(int, float const*, float*, float*)
kaldi::SpMatrix<float>::Tridiagonalize(kaldi::MatrixBase<float>*)
kaldi::SpMatrix<float>::Eig(kaldi::VectorBase<float>*, kaldi::MatrixBase<float>*) const
kaldi::nnet3::OnlineNaturalGradient::PreconditionDirectionsInternal(int, float, kaldi::Vector<float> const&, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*, kaldi::CuVectorBase<float>*, float*)
kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, kaldi::CuVectorBase<float>*, float*)
kaldi::nnet3::NaturalGradientAffineComponent::Update(std::string const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&)
kaldi::nnet3::AffineComponent::Backprop(std::string const&, kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::nnet3::Component*, kaldi::CuMatrixBase<float>*) const
kaldi::nnet3::NnetComputer::ExecuteCommand(int)
kaldi::nnet3::NnetComputer::Backward()
kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)
nnet3-chain-train(main+0x3f0) [0x8b3f6c]
...
Before you fixed the bug, my job had failed due to another assert failure,
here is the log with the old code:
...
LOG (nnet3-chain-train:Train():nnet-chain-training.cc:80) Parameter change too big: 4.00642 > --max-param-change=1, scaling by 0.2496
ERROR (nnet3-chain-train:HouseBackward():qr.cc:146) NaN encountered in HouseBackward
(no stack trace here)
...
Both of them failed due to a same final-affine backprop command, when computing the natural gradient.