==> I couldn't find anything in the logs that suggest anything went wrong at the earlier stage.
I tried to align with the LDA+MLLT model instead of LDA+MLLT+SAT and the alignment worked perfectly. Due to limitation of compute power, I haven't yet tried to find out problematic utterances by recursively aligning the failed jobs with the LDA+MLLT+SAT model.
So with the successful alignment by the LDA+MLLT, I went ahead to train tdnn chain model using this alignment. I assume that alignment quality will not be very sensitive to the WER of the model used for alignment
While tdnn training I get warnings at many iterations which looks like:
[steps/libs/nnet3/train/common.py:134 - get_successful_models - WARNING ] Only 4/5 of the models have been accepted for averaging, based on log files exp/chain/tdnn_sp_v3/log/train.1559.%.log
The corresponding values of objective function are:
exp/chain/tdnn_sp_v3/log/train.1559.1.log:LOG (nnet3-chain-train[5.4.54~1-22fb]:PrintTotalStats():nnet-training.cc:349) Overall average objective function for 'output' is -0.235595 + -0.0127855 = -0.248381 over 497408 frames.
exp/chain/tdnn_sp_v3/log/train.1559.1.log:LOG (nnet3-chain-train[5.4.54~1-22fb]:PrintTotalStats():nnet-training.cc:346) Overall average objective function for 'output-xent' is -1.38552 over 497408 frames.
exp/chain/tdnn_sp_v3/log/train.1559.2.log:LOG (nnet3-chain-train[5.4.54~1-22fb]:PrintTotalStats():nnet-training.cc:349) Overall average objective function for 'output' is -0.106124 + -0.0127876 = -0.118911 over 497408 frames.
exp/chain/tdnn_sp_v3/log/train.1559.2.log:LOG (nnet3-chain-train[5.4.54~1-22fb]:PrintTotalStats():nnet-training.cc:346) Overall average objective function for 'output-xent' is -1.39787 over 497408 frames.
exp/chain/tdnn_sp_v3/log/train.1559.3.log:LOG (nnet3-chain-train[5.4.54~1-22fb]:PrintTotalStats():nnet-training.cc:349) Overall average objective function for 'output' is -0.106789 + -0.0127651 = -0.119554 over 497408 frames.
exp/chain/tdnn_sp_v3/log/train.1559.3.log:LOG (nnet3-chain-train[5.4.54~1-22fb]:PrintTotalStats():nnet-training.cc:346) Overall average objective function for 'output-xent' is -1.39634 over 497408 frames.
exp/chain/tdnn_sp_v3/log/train.1559.4.log:LOG (nnet3-chain-train[5.4.54~1-22fb]:PrintTotalStats():nnet-training.cc:349) Overall average objective function for 'output' is -3.48244 + -0.0124167 = -3.49486 over 491008 frames.
exp/chain/tdnn_sp_v3/log/train.1559.4.log:LOG (nnet3-chain-train[5.4.54~1-22fb]:PrintTotalStats():nnet-training.cc:346) Overall average objective function for 'output-xent' is -0.920236 over 491008 frames.
exp/chain/tdnn_sp_v3/log/train.1559.5.log:LOG (nnet3-chain-train[5.4.54~1-22fb]:PrintTotalStats():nnet-training.cc:349) Overall average objective function for 'output' is -0.106752 + -0.0127227 = -0.119475 over 497408 frames.
exp/chain/tdnn_sp_v3/log/train.1559.5.log:LOG (nnet3-chain-train[5.4.54~1-22fb]:PrintTotalStats():nnet-training.cc:346) Overall average objective function for 'output-xent' is -1.40838 over 497408 frames.
Seems model is diverging.
--------------------------------------------------
In the training log of the the many iterations, I have warnings like:
WARNING (nnet3-chain-train[5.4.54~1-22fb]:BetaGeneralFrameDebug():chain-denominator.cc:412) On time 0, alpha-beta product nan != 128 alpha-dash-sum = 140.8, beta-dash-sum = nan
WARNING (nnet3-chain-train[5.4.54~1-22fb]:BetaGeneralFrameDebug():chain-denominator.cc:425) On time 0, log-prob-deriv sum 124.994 != 128
WARNING (nnet3-chain-train[5.4.54~1-22fb]:BetaGeneralFrameDebug():chain-denominator.cc:428) Excessive error detected, will abandon this minibatch
WARNING (nnet3-chain-train[5.4.54~1-22fb]:ComputeChainObjfAndDeriv():chain-training.cc:214) Objective function is nan and denominator computation (if done) returned false, setting objective function to -10 per frame.
Could you please guide me to understand why my model is diverging. Is it due to some issue in alignment?