Assert failed when training nnet3 chain

617 views
Skip to first unread message

remi....@gmail.com

unread,
Jan 4, 2016, 6:43:56 AM1/4/16
to kaldi-help
I tried to train a tdnn with the chain script run_tdnn_2o.sh from the version 13e855735b45f8629cdbebf3ac955b99d2bfbaee of Kaldi with my own data, and I got the following errors:
steps/nnet3/chain/train_tdnn.sh: Will train for 4 epochs = 1069 iterations
On iteration 0, learning rate is 0.003.
Training neural net (pass 0)
On iteration 1, learning rate is 0.00299796180388041.
Training neural net (pass 1)
On iteration 2, learning rate is 0.00299592499250862.
Training neural net (pass 2)
On iteration 3, learning rate is 0.00299388956494385.
Training neural net (pass 3)
On iteration 4, learning rate is 0.00299185552024593.
Training neural net (pass 4)
On iteration 5, learning rate is 0.00298982285747534.
Training neural net (pass 5)
queue.pl: job failed with status 255, log is in exp/chain/tdnn_2o_sp/log/train.5.1.log


In exp/chain/tdnn_2o_sp/log/train.5.1.log there is:
KALDI_ASSERT: at nnet3-chain-train:PreconditionDirectionsInternal:natural-gradient-online.cc:348, failed: tr_Xhat_XhatT == tr_Xhat_XhatT
Stack trace is:
kaldi::KaldiGetStackTrace()
kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)
kaldi::nnet3::OnlineNaturalGradient::PreconditionDirectionsInternal(int, float, kaldi::Vector<float> const&, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*, kaldi::CuVectorBase<float>*, float*)
kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, kaldi::CuVectorBase<float>*, float*)
kaldi::nnet3::NaturalGradientAffineComponent::Update(std::string const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&)
kaldi::nnet3::AffineComponent::Backprop(std::string const&, kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::nnet3::Component*, kaldi::CuMatrixBase<float>*) const
kaldi::nnet3::NnetComputer::ExecuteCommand(int)
kaldi::nnet3::NnetComputer::Backward()
kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)
nnet3-chain-train(main+0x3f6) [0x820093]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x2ac01ae9aec5]
nnet3-chain-train() [0x81fbd9]
WARNING (nnet3-chain-train:ExecuteCommand():nnet-compute.cc:279) Printing some background info since error was detected

The log is quite big because of the background info, but I can paste it somewhere if it can help finding what went wrong. 

Vijayaditya Peddinti

unread,
Jan 4, 2016, 1:32:38 PM1/4/16
to kaldi-help
Could you check the magnitude of the gradients in  exp/chain/tdnn_2o_sp/log/train.5.1.log. By default it is printed every 100 minibatches. You could print it with better resolution using the --print-interval option in nnet3-train.

--VIjay

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

remi....@gmail.com

unread,
Jan 4, 2016, 2:03:53 PM1/4/16
to kaldi-help
I get something like this:
LOG (nnet3-chain-train:Train():nnet-chain-training.cc:80) Parameter change too big: 9.08328 > --max-param-change=1, scaling by 0.110092
LOG (nnet3-chain-train:PrintStatsForThisPhase():nnet-training.cc:147) Average objective function for 'output' for minibatches 23-23 is -0.545507 over 6400 frames.
LOG (nnet3-chain-train:Train():nnet-chain-training.cc:80) Parameter change too big: 8.26246 > --max-param-change=1, scaling by 0.121029
LOG (nnet3-chain-train:PrintStatsForThisPhase():nnet-training.cc:147) Average objective function for 'output' for minibatches 24-24 is -0.53988 over 6400 frames.
LOG (nnet3-chain-train:Train():nnet-chain-training.cc:80) Parameter change too big: 7.5673 > --max-param-change=1, scaling by 0.132148
LOG (nnet3-chain-train:PrintStatsForThisPhase():nnet-training.cc:147) Average objective function for 'output' for minibatches 25-25 is -0.493549 over 6400 frames.
LOG (nnet3-chain-train:Train():nnet-chain-training.cc:80) Parameter change too big: 11.1135 > --max-param-change=1, scaling by 0.0899806
LOG (nnet3-chain-train:PrintStatsForThisPhase():nnet-training.cc:147) Average objective function for 'output' for minibatches 26-26 is -0.498687 over 6400 frames.
LOG (nnet3-chain-train:Train():nnet-chain-training.cc:80) Parameter change too big: 18.0382 > --max-param-change=1, scaling by 0.055438
LOG (nnet3-chain-train:PrintStatsForThisPhase():nnet-training.cc:147) Average objective function for 'output' for minibatches 27-27 is -0.524272 over 6400 frames.
LOG (nnet3-chain-train:Train():nnet-chain-training.cc:80) Parameter change too big: 10.5323 > --max-param-change=1, scaling by 0.0949463
LOG (nnet3-chain-train:PrintStatsForThisPhase():nnet-training.cc:147) Average objective function for 'output' for minibatches 28-28 is -0.536255 over 6400 frames.
LOG (nnet3-chain-train:Train():nnet-chain-training.cc:80) Parameter change too big: 14.1433 > --max-param-change=1, scaling by 0.0707047
LOG (nnet3-chain-train:PrintStatsForThisPhase():nnet-training.cc:147) Average objective function for 'output' for minibatches 29-29 is -0.519953 over 6400 frames.
LOG (nnet3-chain-train:Train():nnet-chain-training.cc:80) Parameter change too big: 9.60148 > --max-param-change=1, scaling by 0.104151
LOG (nnet3-chain-train:PrintStatsForThisPhase():nnet-training.cc:147) Average objective function for 'output' for minibatches 30-30 is -0.523895 over 6400 frames.
LOG (nnet3-chain-train:Train():nnet-chain-training.cc:80) Parameter change too big: 8.13516 > --max-param-change=1, scaling by 0.122923
LOG (nnet3-chain-train:PrintStatsForThisPhase():nnet-training.cc:147) Average objective function for 'output' for minibatches 31-31 is -0.518401 over 6400 frames.
KALDI_ASSERT: at nnet3-chain-train:PreconditionDirectionsInternal:natural-gradient-online.cc:348, failed: tr_Xhat_XhatT == tr_Xhat_XhatT

Vijayaditya Peddinti

unread,
Jan 4, 2016, 2:20:33 PM1/4/16
to kaldi-help
I usually see errors like the one here during LSTM training due to gradient explosion. I don't see such an issue here. So we would have to wait for Dan to respond.

--Vijay

Daniel Povey

unread,
Jan 4, 2016, 3:01:01 PM1/4/16
to kaldi-help
I have never seen that particular assert failing before, but I've seen similar issues.  Please do the following to try to debug it:
  - First rerun the same job to test whether it's repeatable.
  - Then update your Kaldi and see whether it's still repeatable.
  - Then run the tests in matrix/ and cudamatrix/.  Some versions of BLAS have bugs where NaNs just past the end of a matrix row can propagate into the computation (e.g. the current version of El Capitan on Mac has this issue, which apparently will be fixed in a point update).
  - If it's repeatable and if the tests passed, then in natural-gradient-online.cc, change  self_debug_(false) to  self_debug_(true)  and send me the log file by email (dpo...@gmail.com - too big for the list).
Dan


remi....@gmail.com

unread,
Jan 5, 2016, 6:47:31 AM1/5/16
to kaldi-help, dpo...@gmail.com
I have updated Kaldi and it seems to work for now.

remi....@gmail.com

unread,
Jan 5, 2016, 10:56:13 AM1/5/16
to kaldi-help, dpo...@gmail.com, remi....@gmail.com
It actually failed at a later iteration.
All tests have passed, I have recompiled natural-gradient-online.cc with self_debug_(true) and got a log, however it doesn't seem to have anything new in it. I'm sending it to you in another mail.

Daniel Povey

unread,
Jan 5, 2016, 1:43:22 PM1/5/16
to Rémi Francis, kaldi-help
I didn't see anything in the logs either.  Possibly the reason it fails at that point is that it does an explicit test for NaNs.  I suspect it could be a hardware error.  See if it always tends to happen on the same node.  Just restart from the stage that failed using the --stage option, e.g. call train_whatever.sh with --stage 19.
Dan

remi....@gmail.com

unread,
Jan 7, 2016, 5:52:27 AM1/7/16
to kaldi-help, remi....@gmail.com, dpo...@gmail.com
I wanted to debug this a bit more, so I added some asserts to find the point at which NaNs appear in natural-gradient-online.cc:
if (!locked) {
    // We're not updating the parameters, either because another thread is
    // working on updating them, or because another thread already did so from
    // the same or later starting point (making our update stale), or because
    // update_period_ > 1.  We just apply the preconditioning and return.

    // note: we don't bother with any locks before incrementing
    // num_updates_skipped_ below, because the worst that could happen is that,
    // on very rare occasions, we could skip one or two more updates than we
    // intended.
    num_updates_skipped_++;

    BaseFloat tr_Xt_XtT = TraceMatMat(*X_t, *X_t, kTrans);
    KALDI_ASSERT(tr_Xt_XtT == tr_Xt_XtT);  // Check for NaN.
    // X_hat_t = X_t - H_t W_t
    X_t->AddMatMat(-1.0, H_t, kNoTrans, W_t, kNoTrans, 1.0);
    BaseFloat tr_Xt_XtT_bis = TraceMatMat(*X_t, *X_t, kTrans);
    KALDI_ASSERT(tr_Xt_XtT_bis == tr_Xt_XtT_bis);  // Check for NaN.
    // each element i of row_prod will be inner product of row i of X_hat_t with
    // itself.
    BaseFloat row_prod_sum = row_prod->Sum();
    KALDI_ASSERT(row_prod_sum == row_prod_sum);  // Check for NaN.
    row_prod->AddDiagMat2(1.0, *X_t, kNoTrans, 0.0);
    BaseFloat tr_Xhat_XhatT = row_prod->Sum();
    KALDI_ASSERT(tr_Xhat_XhatT == tr_Xhat_XhatT);  // Check for NaN.
    BaseFloat gamma_t = (tr_Xhat_XhatT == 0.0 ? 1.0 :
                         sqrt(tr_Xt_XtT / tr_Xhat_XhatT));
    *scale = gamma_t;
    return;
  }

And now all asserts pass (I'm at iteration 500), which is strange because it should still compute the exact same thing.

Daniel Povey

unread,
Jan 7, 2016, 3:15:42 PM1/7/16
to Rémi Francis, kaldi-help
That's consistent with a hardware error.  Remember, the GeForce cards don't have memory error correction
Dan

remi....@gmail.com

unread,
Jan 8, 2016, 10:32:50 AM1/8/16
to kaldi-help, remi....@gmail.com, dpo...@gmail.com
I have a bunch of 980 and Titan X, and when it happens, it is always happen at the same stage on any device.
I never had any NaN issue like this on any of the other training that I've done, so I find it weird that it would come from the hardware. 

Jan Trmal

unread,
Jan 8, 2016, 11:02:07 AM1/8/16
to kaldi-help, Rémi Francis, Dan Povey
It could be a hardware issue caused by overheating -- in that case I could see why adding the test for NaN (in some sense possibly decreasing computational intensity) could help. But I'm just speculating
BTW, ad the ECC memory on Teslas -- AFAIK, you would have to specifically enable it (and "lose" some portion of the available memory which will be used to store ECC). Most systems do not have this enabled. 
y.

Daniel Povey

unread,
Jan 8, 2016, 1:51:29 PM1/8/16
to Jan Trmal, kaldi-help, Rémi Francis
It could be an overheating issue and appears randomly, which is why it's not repeatable.
What do you mean by "at the same stage"?  You mean at the same iteration of training?  Or the same code line?

Dan

Daniel Povey

unread,
Jan 8, 2016, 2:10:33 PM1/8/16
to Jan Trmal, kaldi-help, Rémi Francis
BTW, I recently had a very similar problem on the mac, El Capitan, and spent a while debugging it.
In the end it turned out that there was a BLAS library bug where NaNs from outside the matrix were propagating into the calculation.  Apparently it will be fixed in the next point release.  I've seen this issue previously in OpenBLAS also (NaNs propagating from outside)- I made a bug report and they fixed it.  However I have never seen it in CUBLAS.  Anyway, the point is that version of the CUBLAS library may be relevant.

Dan

remi....@gmail.com

unread,
Jan 8, 2016, 2:32:59 PM1/8/16
to kaldi-help, jtr...@gmail.com, remi....@gmail.com, dpo...@gmail.com
When the assert fails, it is always the same job, at the same batch at the same line.
Currently it crashes on the job 790.1 on minibatch 10, on any device it runs on (I've tried a bunch of 980 and Titan X, I don't have Teslas).
I don't know which cublas version is installed, I'll investigate over next week to understand what is happening.

Daniel Povey

unread,
Jan 8, 2016, 2:39:45 PM1/8/16
to Rémi Francis, kaldi-help, Jan Trmal
Oh- in that case it is more likely to be standard divergence.
Try running with --verbose=5 and send me the output.

Dan

Daniel Povey

unread,
Jan 8, 2016, 3:48:12 PM1/8/16
to Rémi Francis, kaldi-help, Jan Trmal
Another possibility is that it is a bug in CuVector::Sum().  That code does not look entirely right to me: in _pvec_sum, it calls __syncthreads() even though some threads in the warp have returned, which is not right: according to

"__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects."

So we have to suspect an error in Sum().  [I just wrote test code for it, not committed yet, but couldn't reproduce an error.]  I think the reason your other asserts may make a difference is that some other functions you called may allocate memory, which may change the memory layout, which I think could be critical to how this bug operates.  It's better, instead, if you revert your changes change the assert to an if-statement, and if the equality test fails, print out a bunch of stuff, such as the vector in question and the sum of X, and so on.

Dan




Daniel Povey

unread,
Jan 8, 2016, 6:30:23 PM1/8/16
to Rémi Francis, kaldi-help, Jan Trmal
OK, I made some changes to the CUDA code that's called inside CuVector::Sum() and CuVector::AddDiagMatMat().  In both cases it looks to me like __syncthreads was being called incorrectly because some threads were dropped before __syncthreads was called.  I can't find any documentation of exactly what is supposed to happen in this situation, but it's probably hardware dependent.  We never had a problem but then we have newer GPUs.
I pushed this to the chain branch.  But please try to reproduce the problem before you pull the change, so we can have some clarity on whether it fixed it.
You can run the offending job from the command line, this should make it easier to reproduce.
Dan

remi....@gmail.com

unread,
Jan 11, 2016, 7:32:56 AM1/11/16
to kaldi-help, remi....@gmail.com, jtr...@gmail.com, dpo...@gmail.com
Unfortunately that didn't fix the issue.
I've made a tarball with the files necessary to reproduce this: https://drive.google.com/file/d/0B3GgS_Icht4fYmhPNUF0U3JUNTA/view?usp=sharing 
It's about 100M and contains 3 files: exp/chain/tdnn_2o_sp/790.mdl exp/chain/tdnn_2o_sp/den.fst exp/chain/tdnn_2o_sp/egs/cegs.225.ark 
You just have to untar it, to source the path and run the following command:
nnet3-chain-train --verbose=5 --apply-deriv-weights=false --max-param-change=1.0 --print-interval=1 "nnet3-am-copy --raw=true --learning-rate=0.00322530991369705 exp/chain/tdnn_2o_sp/790.mdl -|" exp/chain/tdnn_2o_sp/den.fst "ark:nnet3-chain-copy-egs --truncate-deriv-weights=0 --frame-shift=1 ark:exp/chain/tdnn_2o_sp/egs/cegs.225.ark ark:- | nnet3-chain-shuffle-egs --buffer-size=5000 --srand=790 ark:- ark:-| nnet3-chain-merge-egs --minibatch-size=128 ark:- ark:- |" exp/chain/tdnn_2o_sp/791.1.raw

I've also uploaded the log I get with --verbose=5: https://drive.google.com/file/d/0B3GgS_Icht4fYVRfYnp5RUhKOGM/view?usp=sharing

If you have tell please tell me if you can reproduce the issue.

Daniel Povey

unread,
Jan 11, 2016, 6:31:07 PM1/11/16
to Rémi Francis, kaldi-help, Jan Trmal
Thanks.
I just pushed a fix for the issue.  It was an overflow in the chain derivative-computation code.
Dan

Reply all
Reply to author
Forward
0 new messages