TDNN Training

1,451 views
Skip to first unread message

Ansar K.A.

unread,
May 15, 2018, 4:03:55 AM5/15/18
to kaldi-help
I am practicing to build a TDNN model with around 4 hrs of data. I am using mini_librispeech recipe. After started running run_tdnn.sh, I got the following error:

steps/nnet3/chain/build_tree.sh --frame-subsampling-factor 3 --context-opts --context-width=2 --central-position=1 --cmd run.pl --mem 2G 3500 data/train data/lang_chain exp/tri3b_ali exp/chain/tree_sp
steps/nnet3/chain/build_tree.sh: feature type is lda
steps/nnet3/chain/build_tree.sh: Using transforms from exp/tri3b_ali
steps/nnet3/chain/build_tree.sh: Initializing monophone model (for alignment conversion, in case topology changed)
steps/nnet3/chain/build_tree.sh: Accumulating tree stats
steps/nnet3/chain/build_tree.sh: Wrong #tree-accs

I don't understand the reason for this error. Any clues?

Ansar K.A.

unread,
May 15, 2018, 5:59:53 AM5/15/18
to kaldi-help
I have solved that issue. I was missing some calls in between.

But, now I ran in to a different error:

steps/nnet3/chain/build_tree.sh: Accumulating tree stats
run.pl: 19 / 20 failed, log is in exp/chain/tree_sp/log/acc_tree.*.log

Log:

LOG (transform-feats[5.4.100~1-1331a]:main():transform-feats.cc:158) Overall average [pseudo-]logdet is -91.4242 over 29129 frames.
LOG (transform-feats[5.4.100~1-1331a]:main():transform-feats.cc:161) Applied transform to 57 utterances; 0 had errors.
LOG (transform-feats[5.4.100~1-1331a]:main():transform-feats.cc:161) Applied transform to 0 utterances; 57 had errors.
LOG (subsample-feats[5.4.100~1-1331a]:main():subsample-feats.cc:115) Processed 0 feature matrices; 0 with errors.
LOG (subsample-feats[5.4.100~1-1331a]:main():subsample-feats.cc:117) Processed 0 input frames and 0 output frames.
LOG (acc-tree-stats[5.4.100~1-1331a]:main():acc-tree-stats.cc:118) Accumulated stats for 0 files, 0 failed due to no alignment, 0 failed for other reasons.
LOG (acc-tree-stats[5.4.100~1-1331a]:main():acc-tree-stats.cc:121) Number of separate stats (context-dependent states) is 0
WARNING (acc-tree-stats[5.4.100~1-1331a]:Close():kaldi-io.cc:515) Pipe apply-cmvn  --utt2spk=ark:data/train/split20/20/utt2spk scp:data/train/split20/20/cmvn.scp scp:da
ta/train/split20/20/feats.scp ark:- | splice-feats --left-context=3 --right-context=3 ark:- ark:- | transform-feats exp/tri3b_ali/final.mat ark:- ark:- | transform-feat
s --utt2spk=ark:data/train/split20/20/utt2spk ark,s,cs:exp/tri3b_ali/trans.20 ark:- ark:- | subsample-feats --n=3 ark:- ark:- | had nonzero return status 256
ERROR (acc-tree-stats[5.4.100~1-1331a]:~SequentialTableReaderArchiveImpl():util/kaldi-table-inl.h:678) TableReader: error detected closing archive 'apply-cmvn  --utt2sp
k=ark:data/train/split20/20/utt2spk scp:data/train/split20/20/cmvn.scp scp:data/train/split20/20/feats.scp ark:- | splice-feats --left-context=3 --right-context=3 ark:-
 ark:- | transform-feats exp/tri3b_ali/final.mat ark:- ark:- | transform-feats --utt2spk=ark:data/train/split20/20/utt2spk ark,s,cs:exp/tri3b_ali/trans.20 ark:- ark:- |
 subsample-feats --n=3 ark:- ark:- |'

[ Stack-Trace: ]
acc-tree-stats() [0x8c11a8]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::SequentialTableReaderArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::~SequentialTableReaderArchiveImpl()
kaldi::SequentialTableReaderArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::~SequentialTableReaderArchiveImpl()
kaldi::SequentialTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::~SequentialTableReader()
main
__libc_start_main
_start

terminate called after throwing an instance of 'std::runtime_error'
  what():  
bash: line 1: 17690 Broken pipe             convert-ali --frame-subsampling-factor=3 exp/tri3b_ali/final.mdl exp/chain/tree_sp/mono.mdl exp/chain/tree_sp/mono.tree "ark
:gunzip -c exp/tri3b_ali/ali.20.gz|" ark:-
     17691 Aborted                 (core dumped) | acc-tree-stats --context-width=2 --central-position=1 --ci-phones=1:2:3:4:5:6:7:8:9:10 exp/chain/tree_sp/mono.mdl "ar
k,s,cs:apply-cmvn  --utt2spk=ark:data/train/split20/20/utt2spk scp:data/train/split20/20/cmvn.scp scp:data/train/split20/20/feats.scp ark:- | splice-feats --left-contex
t=3 --right-context=3 ark:- ark:- | transform-feats exp/tri3b_ali/final.mat ark:- ark:- | transform-feats --utt2spk=ark:data/train/split20/20/utt2spk ark,s,cs:exp/tri3b
_ali/trans.20 ark:- ark:- | subsample-feats --n=3 ark:- ark:- |" ark:- exp/chain/tree_sp/20.treeacc
# Accounting: time=0 threads=1
# Ended (code 134) at Tue May 15 09:40:51 UTC 2018, elapsed time 0 seconds

Daniel Povey

unread,
May 15, 2018, 12:56:14 PM5/15/18
to kaldi-help
Most likely you are using alignments that were made from different
data than you are supplying to the script. E.g. maybe you changed
your data (data/train/?) at some point, perhaps while resolving the
first issue.

Dan
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kaldi-help/CAB%2BepYJ4ZUzc4MCUiT7VK%2BTnEfPpKYTQ%2BGrFOEJyD7KBJOP2HQ%40mail.gmail.com.
>
> For more options, visit https://groups.google.com/d/optout.

Ansar K.A.

unread,
May 18, 2018, 7:56:07 AM5/18/18
to kaldi-help
Hi,

I have restarted the training. The previous error is gone now.

But while running the steps/nnet3/chain/train.py script, it is throwing some exceptions. Out of 25 iterations, the exception happened on the 22nd iteration. 

2018-05-18 11:32:53,667 [steps/nnet3/chain/train.py:493 - train - INFO ] Iter: 22/24    Epoch: 11.83/15.0 (78.9% complete)    lr: 0.000813
    
run.pl: job failed, log is in exp/chain/tdnn1g_sp/log/train.22.5.log
2018-05-18 11:33:21,117 [steps/libs/common.py:231 - background_command_waiter - ERROR ] Command exited with status 1: run.pl --mem 4G --gpu 1 exp/chain/tdnn1g_sp/log/train.22.5.log                     nnet3-chain-train                       --apply-deriv-weights=False                     --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1                     --read-cache=exp/chain/tdnn1g_sp/cache.22  --xent-regularize=0.1                                          --print-interval=10 --momentum=0.0                     --max-param-change=2.0                     --backstitch-training-scale=0.0                     --backstitch-training-interval=1                     --l2-regularize-factor=0.2                     --srand=22                     "nnet3-am-copy --raw=true --learning-rate=0.000812982346941 --scale=1.0 exp/chain/tdnn1g_sp/22.mdl - |nnet3-copy --edits='set-dropout-proportion name=* proportion=0.126666666667' - - |" exp/chain/tdnn1g_sp/den.fst                     "ark,bg:nnet3-chain-copy-egs                         --frame-shift=0                         ark:exp/chain/tdnn1g_sp/egs/cegs.2.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=22 ark:- ark:- | nnet3-chain-merge-egs                         --minibatch-size=256,128,64 ark:- ark:- |"                     exp/chain/tdnn1g_sp/23.5.raw

Log:

# nnet3-chain-train --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=exp/chain/tdnn1g_sp/cache.22 --xent-regularize=0.1 --prin
t-interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.2 --srand=22 "nnet3-am-cop
y --raw=true --learning-rate=0.000812982346941 --scale=1.0 exp/chain/tdnn1g_sp/22.mdl - |nnet3-copy --edits='set-dropout-proportion name=* proportion=0.126666666667' - 
- |" exp/chain/tdnn1g_sp/den.fst "ark,bg:nnet3-chain-copy-egs                         --frame-shift=0                         ark:exp/chain/tdnn1g_sp/egs/cegs.2.ark ark
:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=22 ark:- ark:- | nnet3-chain-merge-egs                         -
-minibatch-size=256,128,64 ark:- ark:- |" exp/chain/tdnn1g_sp/23.5.raw 
# Started at Fri May 18 11:32:53 UTC 2018
#
nnet3-chain-train --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=exp/chain/tdnn1g_sp/cache.22 --xent-regularize=0.1 --print-
interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.2 --srand=22 "nnet3-am-copy 
--raw=true --learning-rate=0.000812982346941 --scale=1.0 exp/chain/tdnn1g_sp/22.mdl - |nnet3-copy --edits='set-dropout-proportion name=* proportion=0.126666666667' - - 
|" exp/chain/tdnn1g_sp/den.fst 'ark,bg:nnet3-chain-copy-egs                         --frame-shift=0                         ark:exp/chain/tdnn1g_sp/egs/cegs.2.ark ark:-
 |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=22 ark:- ark:- | nnet3-chain-merge-egs                         --m
inibatch-size=256,128,64 ark:- ark:- |' exp/chain/tdnn1g_sp/23.5.raw 
WARNING (nnet3-chain-train[5.4.100~1-1331a]:SelectGpuId():cu-device.cc:196) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive m
ode
LOG (nnet3-chain-train[5.4.100~1-1331a]:SelectGpuIdAuto():cu-device.cc:315) Selecting from 1 GPUs
LOG (nnet3-chain-train[5.4.100~1-1331a]:SelectGpuIdAuto():cu-device.cc:330) cudaSetDevice(0): Tesla K80 free:11185M, used:254M, total:11439M, free/total:0.977797
LOG (nnet3-chain-train[5.4.100~1-1331a]:SelectGpuIdAuto():cu-device.cc:379) Trying to select device: 0 (automatically), mem_ratio: 0.977797
LOG (nnet3-chain-train[5.4.100~1-1331a]:SelectGpuIdAuto():cu-device.cc:398) Success selecting device 0 free mem ratio: 0.977797
LOG (nnet3-chain-train[5.4.100~1-1331a]:FinalizeActiveGpu():cu-device.cc:247) The active GPU is [0]: Tesla K80  free:10890M, used:549M, total:11439M, free/total:0.95198
8 version 3.7
nnet3-copy '--edits=set-dropout-proportion name=* proportion=0.126666666667' - - 
nnet3-am-copy --raw=true --learning-rate=0.000812982346941 --scale=1.0 exp/chain/tdnn1g_sp/22.mdl - 
LOG (nnet3-am-copy[5.4.100~1-1331a]:main():nnet3-am-copy.cc:151) Copied neural net from exp/chain/tdnn1g_sp/22.mdl to raw format as -
LOG (nnet3-copy[5.4.100~1-1331a]:ReadEditConfig():nnet-utils.cc:1247) Set dropout proportions for 8 components.
LOG (nnet3-copy[5.4.100~1-1331a]:main():nnet3-copy.cc:114) Copied raw neural net from - to -
LOG (nnet3-chain-train[5.4.100~1-1331a]:NnetChainTrainer():nnet-chain-training.cc:53) Read computation cache from exp/chain/tdnn1g_sp/cache.22
nnet3-chain-merge-egs --minibatch-size=256,128,64 ark:- ark:- 
nnet3-chain-copy-egs --frame-shift=0 ark:exp/chain/tdnn1g_sp/egs/cegs.2.ark ark:- 
nnet3-chain-shuffle-egs --buffer-size=5000 --srand=22 ark:- ark:- 
ERROR (nnet3-chain-train[5.4.100~1-1331a]:RandUniform():cu-rand.cc:72) curandStatus_t 102 : "CURAND_STATUS_ALLOCATION_FAILED" returned from 'curandGenerateUniformWrap(g
en_, tmp.Data(), s)'

[ Stack-Trace: ]
nnet3-chain-train() [0x124bf7a]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::CuRand<float>::RandUniform(kaldi::CuMatrixBase<float>*)
kaldi::nnet3::DropoutComponent::Propagate(kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float>*) const
kaldi::nnet3::NnetComputer::ExecuteCommand()
kaldi::nnet3::NnetComputer::Run()
kaldi::nnet3::NnetChainTrainer::TrainInternal(kaldi::nnet3::NnetChainExample const&, kaldi::nnet3::NnetComputation const&)
kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)
main
__libc_start_main
_start

WARNING (nnet3-chain-train[5.4.100~1-1331a]:ExecuteCommand():nnet-compute.cc:435) Printing some background info since error was detected

I am using a p2.xlarge (AWS) instance for training.



> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kaldi-help/CAB%2BepYJ4ZUzc4MCUiT7VK%2BTnEfPpKYTQ%2BGrFOEJyD7KBJOP2HQ%40mail.gmail.com.
>
> For more options, visit https://groups.google.com/d/optout.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Daniel Povey

unread,
May 18, 2018, 4:22:59 PM5/18/18
to kaldi-help
You shouldn't really be running multiple jobs on a single GPU.
If you want to run that script on a machine that has just one GPU, one
way to do it is to set exclusive mode via
sudo nvidia-smi -c 3

and to the train.py script, change the option "--use-gpu=yes" to
"--use-gpu=wait"
which will cause it to run the GPU jobs sequentially, as each waits
till it can get exclusive use of the GPU.

Dan
>> > email to kaldi-help+...@googlegroups.com.
>> > To post to this group, send email to kaldi...@googlegroups.com.
>> > To view this discussion on the web visit
>> >
>> > https://groups.google.com/d/msgid/kaldi-help/CAB%2BepYJ4ZUzc4MCUiT7VK%2BTnEfPpKYTQ%2BGrFOEJyD7KBJOP2HQ%40mail.gmail.com.
>> >
>> > For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> Go to http://kaldi-asr.org/forums.html find out how to join
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "kaldi-help" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to kaldi-help+...@googlegroups.com.
>> To post to this group, send email to kaldi...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/kaldi-help/CAEWAuyS1ZSXgGyZGRq3cYgFhnGG1MZZBz%2BK7Y2T-8qOSuZyr1g%40mail.gmail.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kaldi-help/CAB%2BepYK29YLqPXBGeUFR_7YiPkjAbbSFvKW5fbRVQwqQkHBBCQ%40mail.gmail.com.

Ansar K.A.

unread,
May 19, 2018, 1:51:22 AM5/19/18
to kaldi-help
Thanks. It solved the issue.


>> > To post to this group, send email to kaldi...@googlegroups.com.
>> > To view this discussion on the web visit
>> >
>> > https://groups.google.com/d/msgid/kaldi-help/CAB%2BepYJ4ZUzc4MCUiT7VK%2BTnEfPpKYTQ%2BGrFOEJyD7KBJOP2HQ%40mail.gmail.com.
>> >
>> > For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> Go to http://kaldi-asr.org/forums.html find out how to join
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "kaldi-help" group.
>> To unsubscribe from this group and stop receiving emails from it, send an

>> To post to this group, send email to kaldi...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/kaldi-help/CAEWAuyS1ZSXgGyZGRq3cYgFhnGG1MZZBz%2BK7Y2T-8qOSuZyr1g%40mail.gmail.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit
>
> For more options, visit https://groups.google.com/d/optout.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.
Message has been deleted

Daniel Povey

unread,
Feb 18, 2022, 8:58:16 AM2/18/22
to kaldi-help
Number of files with suffix .treeacc is not the number expected: either too many, because you re-ran the script with fewer jobs (in which case delete
the old ones), or too few, which could be various problems, you'd have to check the logs that produced those files.


On Fri, Feb 18, 2022 at 9:42 PM Wu Jason <jasons...@gmail.com> wrote:
Hi bro, how do you solve this problem?  
 steps/nnet3/chain/build_tree.sh: Wrong #tree-accs

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group

---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/dc8afa3c-6bc3-4d82-a1b9-311c6a45f14en%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages