TDNN Training

Ansar K.A.

unread,

May 15, 2018, 4:03:55 AM5/15/18

to kaldi-help

I am practicing to build a TDNN model with around 4 hrs of data. I am using mini_librispeech recipe. After started running run_tdnn.sh, I got the following error:

steps/nnet3/chain/build_tree.sh --frame-subsampling-factor 3 --context-opts --context-width=2 --central-position=1 --cmd run.pl --mem 2G 3500 data/train data/lang_chain exp/tri3b_ali exp/chain/tree_sp

steps/nnet3/chain/build_tree.sh: feature type is lda

steps/nnet3/chain/build_tree.sh: Using transforms from exp/tri3b_ali

steps/nnet3/chain/build_tree.sh: Initializing monophone model (for alignment conversion, in case topology changed)

steps/nnet3/chain/build_tree.sh: Accumulating tree stats

steps/nnet3/chain/build_tree.sh: Wrong #tree-accs

I don't understand the reason for this error. Any clues?

Ansar K.A.

unread,

May 15, 2018, 5:59:53 AM5/15/18

to kaldi-help

I have solved that issue. I was missing some calls in between.

But, now I ran in to a different error:

steps/nnet3/chain/build_tree.sh: Accumulating tree stats

run.pl: 19 / 20 failed, log is in exp/chain/tree_sp/log/acc_tree.*.log

Log:

LOG (transform-feats[5.4.100~1-1331a]:main():transform-feats.cc:158) Overall average [pseudo-]logdet is -91.4242 over 29129 frames.

LOG (transform-feats[5.4.100~1-1331a]:main():transform-feats.cc:161) Applied transform to 57 utterances; 0 had errors.

LOG (transform-feats[5.4.100~1-1331a]:main():transform-feats.cc:161) Applied transform to 0 utterances; 57 had errors.

LOG (subsample-feats[5.4.100~1-1331a]:main():subsample-feats.cc:115) Processed 0 feature matrices; 0 with errors.

LOG (subsample-feats[5.4.100~1-1331a]:main():subsample-feats.cc:117) Processed 0 input frames and 0 output frames.

LOG (acc-tree-stats[5.4.100~1-1331a]:main():acc-tree-stats.cc:118) Accumulated stats for 0 files, 0 failed due to no alignment, 0 failed for other reasons.

LOG (acc-tree-stats[5.4.100~1-1331a]:main():acc-tree-stats.cc:121) Number of separate stats (context-dependent states) is 0

WARNING (acc-tree-stats[5.4.100~1-1331a]:Close():kaldi-io.cc:515) Pipe apply-cmvn --utt2spk=ark:data/train/split20/20/utt2spk scp:data/train/split20/20/cmvn.scp scp:da

ta/train/split20/20/feats.scp ark:- | splice-feats --left-context=3 --right-context=3 ark:- ark:- | transform-feats exp/tri3b_ali/final.mat ark:- ark:- | transform-feat

s --utt2spk=ark:data/train/split20/20/utt2spk ark,s,cs:exp/tri3b_ali/trans.20 ark:- ark:- | subsample-feats --n=3 ark:- ark:- | had nonzero return status 256

ERROR (acc-tree-stats[5.4.100~1-1331a]:~SequentialTableReaderArchiveImpl():util/kaldi-table-inl.h:678) TableReader: error detected closing archive 'apply-cmvn --utt2sp

k=ark:data/train/split20/20/utt2spk scp:data/train/split20/20/cmvn.scp scp:data/train/split20/20/feats.scp ark:- | splice-feats --left-context=3 --right-context=3 ark:-

ark:- | transform-feats exp/tri3b_ali/final.mat ark:- ark:- | transform-feats --utt2spk=ark:data/train/split20/20/utt2spk ark,s,cs:exp/tri3b_ali/trans.20 ark:- ark:- |

subsample-feats --n=3 ark:- ark:- |'

[ Stack-Trace: ]

acc-tree-stats() [0x8c11a8]

kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)

kaldi::MessageLogger::~MessageLogger()

kaldi::SequentialTableReaderArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::~SequentialTableReaderArchiveImpl()

kaldi::SequentialTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::~SequentialTableReader()

main

__libc_start_main

_start

terminate called after throwing an instance of 'std::runtime_error'

what():

bash: line 1: 17690 Broken pipe convert-ali --frame-subsampling-factor=3 exp/tri3b_ali/final.mdl exp/chain/tree_sp/mono.mdl exp/chain/tree_sp/mono.tree "ark

:gunzip -c exp/tri3b_ali/ali.20.gz|" ark:-

17691 Aborted (core dumped) | acc-tree-stats --context-width=2 --central-position=1 --ci-phones=1:2:3:4:5:6:7:8:9:10 exp/chain/tree_sp/mono.mdl "ar

k,s,cs:apply-cmvn --utt2spk=ark:data/train/split20/20/utt2spk scp:data/train/split20/20/cmvn.scp scp:data/train/split20/20/feats.scp ark:- | splice-feats --left-contex

t=3 --right-context=3 ark:- ark:- | transform-feats exp/tri3b_ali/final.mat ark:- ark:- | transform-feats --utt2spk=ark:data/train/split20/20/utt2spk ark,s,cs:exp/tri3b

_ali/trans.20 ark:- ark:- | subsample-feats --n=3 ark:- ark:- |" ark:- exp/chain/tree_sp/20.treeacc

# Accounting: time=0 threads=1

# Ended (code 134) at Tue May 15 09:40:51 UTC 2018, elapsed time 0 seconds

Daniel Povey

unread,

May 15, 2018, 12:56:14 PM5/15/18

to kaldi-help

Most likely you are using alignments that were made from different
data than you are supplying to the script. E.g. maybe you changed
your data (data/train/?) at some point, perhaps while resolving the
first issue.

Dan

> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kaldi-help/CAB%2BepYJ4ZUzc4MCUiT7VK%2BTnEfPpKYTQ%2BGrFOEJyD7KBJOP2HQ%40mail.gmail.com.
>
> For more options, visit https://groups.google.com/d/optout.

Ansar K.A.

unread,

May 18, 2018, 7:56:07 AM5/18/18

to kaldi-help

Hi,

I have restarted the training. The previous error is gone now.

But while running the steps/nnet3/chain/train.py script, it is throwing some exceptions. Out of 25 iterations, the exception happened on the 22nd iteration.

2018-05-18 11:32:53,667 [steps/nnet3/chain/train.py:493 - train - INFO ] Iter: 22/24 Epoch: 11.83/15.0 (78.9% complete) lr: 0.000813

run.pl: job failed, log is in exp/chain/tdnn1g_sp/log/train.22.5.log

2018-05-18 11:33:21,117 [steps/libs/common.py:231 - background_command_waiter - ERROR ] Command exited with status 1: run.pl --mem 4G --gpu 1 exp/chain/tdnn1g_sp/log/train.22.5.log nnet3-chain-train --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=exp/chain/tdnn1g_sp/cache.22 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.2 --srand=22 "nnet3-am-copy --raw=true --learning-rate=0.000812982346941 --scale=1.0 exp/chain/tdnn1g_sp/22.mdl - |nnet3-copy --edits='set-dropout-proportion name=* proportion=0.126666666667' - - |" exp/chain/tdnn1g_sp/den.fst "ark,bg:nnet3-chain-copy-egs --frame-shift=0 ark:exp/chain/tdnn1g_sp/egs/cegs.2.ark ark:- | nnet3-chain-shuffle-egs --buffer-size=5000 --srand=22 ark:- ark:- | nnet3-chain-merge-egs --minibatch-size=256,128,64 ark:- ark:- |" exp/chain/tdnn1g_sp/23.5.raw

Log:

# nnet3-chain-train --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=exp/chain/tdnn1g_sp/cache.22 --xent-regularize=0.1 --prin

t-interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.2 --srand=22 "nnet3-am-cop

y --raw=true --learning-rate=0.000812982346941 --scale=1.0 exp/chain/tdnn1g_sp/22.mdl - |nnet3-copy --edits='set-dropout-proportion name=* proportion=0.126666666667' -

- |" exp/chain/tdnn1g_sp/den.fst "ark,bg:nnet3-chain-copy-egs --frame-shift=0 ark:exp/chain/tdnn1g_sp/egs/cegs.2.ark ark

:- | nnet3-chain-shuffle-egs --buffer-size=5000 --srand=22 ark:- ark:- | nnet3-chain-merge-egs -

-minibatch-size=256,128,64 ark:- ark:- |" exp/chain/tdnn1g_sp/23.5.raw

# Started at Fri May 18 11:32:53 UTC 2018

#

nnet3-chain-train --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=exp/chain/tdnn1g_sp/cache.22 --xent-regularize=0.1 --print-

interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.2 --srand=22 "nnet3-am-copy

--raw=true --learning-rate=0.000812982346941 --scale=1.0 exp/chain/tdnn1g_sp/22.mdl - |nnet3-copy --edits='set-dropout-proportion name=* proportion=0.126666666667' - -

|" exp/chain/tdnn1g_sp/den.fst 'ark,bg:nnet3-chain-copy-egs --frame-shift=0 ark:exp/chain/tdnn1g_sp/egs/cegs.2.ark ark:-

| nnet3-chain-shuffle-egs --buffer-size=5000 --srand=22 ark:- ark:- | nnet3-chain-merge-egs --m

inibatch-size=256,128,64 ark:- ark:- |' exp/chain/tdnn1g_sp/23.5.raw

WARNING (nnet3-chain-train[5.4.100~1-1331a]:SelectGpuId():cu-device.cc:196) Not in compute-exclusive mode. Suggestion: use 'nvidia-smi -c 3' to set compute exclusive m

ode

LOG (nnet3-chain-train[5.4.100~1-1331a]:SelectGpuIdAuto():cu-device.cc:315) Selecting from 1 GPUs

LOG (nnet3-chain-train[5.4.100~1-1331a]:SelectGpuIdAuto():cu-device.cc:330) cudaSetDevice(0): Tesla K80 free:11185M, used:254M, total:11439M, free/total:0.977797

LOG (nnet3-chain-train[5.4.100~1-1331a]:SelectGpuIdAuto():cu-device.cc:379) Trying to select device: 0 (automatically), mem_ratio: 0.977797

LOG (nnet3-chain-train[5.4.100~1-1331a]:SelectGpuIdAuto():cu-device.cc:398) Success selecting device 0 free mem ratio: 0.977797

LOG (nnet3-chain-train[5.4.100~1-1331a]:FinalizeActiveGpu():cu-device.cc:247) The active GPU is [0]: Tesla K80 free:10890M, used:549M, total:11439M, free/total:0.95198

8 version 3.7

nnet3-copy '--edits=set-dropout-proportion name=* proportion=0.126666666667' - -

nnet3-am-copy --raw=true --learning-rate=0.000812982346941 --scale=1.0 exp/chain/tdnn1g_sp/22.mdl -

LOG (nnet3-am-copy[5.4.100~1-1331a]:main():nnet3-am-copy.cc:151) Copied neural net from exp/chain/tdnn1g_sp/22.mdl to raw format as -

LOG (nnet3-copy[5.4.100~1-1331a]:ReadEditConfig():nnet-utils.cc:1247) Set dropout proportions for 8 components.

LOG (nnet3-copy[5.4.100~1-1331a]:main():nnet3-copy.cc:114) Copied raw neural net from - to -

LOG (nnet3-chain-train[5.4.100~1-1331a]:NnetChainTrainer():nnet-chain-training.cc:53) Read computation cache from exp/chain/tdnn1g_sp/cache.22

nnet3-chain-merge-egs --minibatch-size=256,128,64 ark:- ark:-

nnet3-chain-copy-egs --frame-shift=0 ark:exp/chain/tdnn1g_sp/egs/cegs.2.ark ark:-

nnet3-chain-shuffle-egs --buffer-size=5000 --srand=22 ark:- ark:-

ERROR (nnet3-chain-train[5.4.100~1-1331a]:RandUniform():cu-rand.cc:72) curandStatus_t 102 : "CURAND_STATUS_ALLOCATION_FAILED" returned from 'curandGenerateUniformWrap(g

en_, tmp.Data(), s)'

[ Stack-Trace: ]

nnet3-chain-train() [0x124bf7a]

kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)

kaldi::MessageLogger::~MessageLogger()

kaldi::CuRand<float>::RandUniform(kaldi::CuMatrixBase<float>*)

kaldi::nnet3::DropoutComponent::Propagate(kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float>*) const

kaldi::nnet3::NnetComputer::ExecuteCommand()

kaldi::nnet3::NnetComputer::Run()

kaldi::nnet3::NnetChainTrainer::TrainInternal(kaldi::nnet3::NnetChainExample const&, kaldi::nnet3::NnetComputation const&)

kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)

main

__libc_start_main

_start

WARNING (nnet3-chain-train[5.4.100~1-1331a]:ExecuteCommand():nnet-compute.cc:435) Printing some background info since error was detected

I am using a p2.xlarge (AWS) instance for training.

> email to kaldi-help+unsubscribe@googlegroups.com.

> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kaldi-help/CAB%2BepYJ4ZUzc4MCUiT7VK%2BTnEfPpKYTQ%2BGrFOEJyD7KBJOP2HQ%40mail.gmail.com.
>
> For more options, visit https://groups.google.com/d/optout.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAEWAuyS1ZSXgGyZGRq3cYgFhnGG1MZZBz%2BK7Y2T-8qOSuZyr1g%40mail.gmail.com.

Daniel Povey

unread,

May 18, 2018, 4:22:59 PM5/18/18

to kaldi-help

You shouldn't really be running multiple jobs on a single GPU.
If you want to run that script on a machine that has just one GPU, one
way to do it is to set exclusive mode via
sudo nvidia-smi -c 3

and to the train.py script, change the option "--use-gpu=yes" to
"--use-gpu=wait"
which will cause it to run the GPU jobs sequentially, as each waits
till it can get exclusive use of the GPU.

Dan

>> > email to kaldi-help+...@googlegroups.com.

>> > To post to this group, send email to kaldi...@googlegroups.com.
>> > To view this discussion on the web visit
>> >
>> > https://groups.google.com/d/msgid/kaldi-help/CAB%2BepYJ4ZUzc4MCUiT7VK%2BTnEfPpKYTQ%2BGrFOEJyD7KBJOP2HQ%40mail.gmail.com.
>> >
>> > For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> Go to http://kaldi-asr.org/forums.html find out how to join
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "kaldi-help" group.
>> To unsubscribe from this group and stop receiving emails from it, send an

>> email to kaldi-help+...@googlegroups.com.

>> To post to this group, send email to kaldi...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/kaldi-help/CAEWAuyS1ZSXgGyZGRq3cYgFhnGG1MZZBz%2BK7Y2T-8qOSuZyr1g%40mail.gmail.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>

> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to kaldi-help+...@googlegroups.com.

> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/kaldi-help/CAB%2BepYK29YLqPXBGeUFR_7YiPkjAbbSFvKW5fbRVQwqQkHBBCQ%40mail.gmail.com.

Ansar K.A.

unread,

May 19, 2018, 1:51:22 AM5/19/18

to kaldi-help

Thanks. It solved the issue.

>> > email to kaldi-help+unsubscribe@googlegroups.com.

>> > To post to this group, send email to kaldi...@googlegroups.com.
>> > To view this discussion on the web visit
>> >
>> > https://groups.google.com/d/msgid/kaldi-help/CAB%2BepYJ4ZUzc4MCUiT7VK%2BTnEfPpKYTQ%2BGrFOEJyD7KBJOP2HQ%40mail.gmail.com.
>> >
>> > For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> Go to http://kaldi-asr.org/forums.html find out how to join
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "kaldi-help" group.
>> To unsubscribe from this group and stop receiving emails from it, send an

>> email to kaldi-help+unsubscribe@googlegroups.com.

>> To post to this group, send email to kaldi...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/kaldi-help/CAEWAuyS1ZSXgGyZGRq3cYgFhnGG1MZZBz%2BK7Y2T-8qOSuZyr1g%40mail.gmail.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to kaldi-help+unsubscribe@googlegroups.com.

> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/kaldi-help/CAB%2BepYK29YLqPXBGeUFR_7YiPkjAbbSFvKW5fbRVQwqQkHBBCQ%40mail.gmail.com.

>
> For more options, visit https://groups.google.com/d/optout.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAEWAuyRNOrtPzL8Y9bZOM5BY8Zsi4gsCjQa82BHoxiT-wGSxsw%40mail.gmail.com.

Message has been deleted

Daniel Povey

unread,

Feb 18, 2022, 8:58:16 AM2/18/22

to kaldi-help

Number of files with suffix .treeacc is not the number expected: either too many, because you re-ran the script with fewer jobs (in which case delete

the old ones), or too few, which could be various problems, you'd have to check the logs that produced those files.

On Fri, Feb 18, 2022 at 9:42 PM Wu Jason <jasons...@gmail.com> wrote:

Hi bro, how do you solve this problem?
steps/nnet3/chain/build_tree.sh: Wrong #tree-accs

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group

---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/dc8afa3c-6bc3-4d82-a1b9-311c6a45f14en%40googlegroups.com.

Reply all

Reply to author

Forward