error while training / combine the model

Foren Zas

unread,

Feb 26, 2021, 5:45:09 AM2/26/21

to kaldi...@googlegroups.com

Hi All,

I got stuck in training the tdnn.

the error is in exp/chain/e2e_tdnn_1a/log/combine.log

2021-02-26 06:28:45,178 [steps/nnet3/chain/e2e/train_e2e.py:462 - train - INFO ] Iter: 80/80 Jobs: 1 Epoch: 2.96/3.0 (98.8% comple
te) lr: 0.000003
2021-02-26 06:29:10,468 [steps/nnet3/chain/e2e/train_e2e.py:515 - train - INFO ] Doing final combination to produce final.mdl
2021-02-26 06:29:10,885 [steps/libs/nnet3/train/chain_objf/acoustic_model.py:571 - combine_models - INFO ] Combining set([64, 65, 66,
67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 62, 63]) models.
run.pl: job failed, log is in exp/chain/e2e_tdnn_1a/log/combine.log
Traceback (most recent call last): File "steps/nnet3/chain/e2e/train_e2e.py", line 558, in main
train(args, run_opts)
File "steps/nnet3/chain/e2e/train_e2e.py", line 524, in train
run_opts=run_opts)
File "steps/libs/nnet3/train/chain_objf/acoustic_model.py", line 622, in combine_models
scp_or_ark=scp_or_ark, egs_suffix=egs_suffix))
File "steps/libs/common.py", line 158, in execute_command
p.returncode, command))
Exception: Command exited with status 1: run.pl --mem 4G --gpu 1 exp/chain/e2e_tdnn_1a/log/combine.log nnet3-chain-com
bine --max-objective-evaluations=30 --l2-regularize=0.0 --leaky-hmm-coefficient=0.1 --
verbose=3 --use-gpu=wait exp/chain/e2e_tdnn_1a/den.fst exp/chain/e2e_tdnn_1a/81.mdl exp/chain/e2e_tdnn_1a/80.mdl exp/chain/e2e_tdnn_1a
/79.mdl exp/chain/e2e_tdnn_1a/78.mdl exp/chain/e2e_tdnn_1a/77.mdl exp/chain/e2e_tdnn_1a/76.mdl exp/chain/e2e_tdnn_1a/75.mdl exp/chain/
e2e_tdnn_1a/74.mdl exp/chain/e2e_tdnn_1a/73.mdl exp/chain/e2e_tdnn_1a/72.mdl exp/chain/e2e_tdnn_1a/71.mdl exp/chain/e2e_tdnn_1a/70.mdl

and attached screenshot of the error is in exp/chain/e2e_tdnn_1a/log/combine.log

Thanks

F. Zas

Daniel Povey

unread,

Feb 26, 2021, 6:58:44 AM2/26/21

to kaldi-help

error is not shown there.

please figure out how to paste as text!

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAH_8jhWpgBaRWKeHW67x2R3eopCHj2iKnQH9LDLp-XUENhNRxA%40mail.gmail.com.

kamar majhi

unread,

Feb 26, 2021, 8:42:49 AM2/26/21

to kaldi...@googlegroups.com

Sorry Dan,

2021-02-26 06:29:10,885 [steps/libs/nnet3/train/chain_objf/acoustic_model.py:571 - combine_models - INFO ] Combining set([64, 65, 66, 67, 68, [41/4627]1, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 62, 63]) models.

run.pl: job failed, log is in exp/chain/e2e_tdnn_1a/log/combine.log
Traceback (most recent call last):
File "steps/nnet3/chain/e2e/train_e2e.py", line 558, in main
train(args, run_opts)
File "steps/nnet3/chain/e2e/train_e2e.py", line 524, in train
run_opts=run_opts)
File "steps/libs/nnet3/train/chain_objf/acoustic_model.py", line 622, in combine_models
scp_or_ark=scp_or_ark, egs_suffix=egs_suffix))
File "steps/libs/common.py", line 158, in execute_command
p.returncode, command))
Exception: Command exited with status 1: run.pl --mem 4G --gpu 1 exp/chain/e2e_tdnn_1a/log/combine.log nnet3-chain-combine
--max-objective-evaluations=30 --l2-regularize=0.0 --leaky-hmm-coefficient=0.1 --verbose=3 --use-gpu=wait exp/chain

/e2e_tdnn_1a/den.fst exp/chain/e2e_tdnn_1a/81.mdl exp/chain/e2e_tdnn_1a/80.mdl exp/chain/e2e_tdnn_1a/79.mdl exp/chain/e2e_tdnn_1a/78.mdl exp/chain/e2e_tdnn_1a/77.mdl exp/chain/e2e_tdnn_1a/76.mdl exp/chain/e2e_tdnn_1a/75.mdl exp/chain/e2e_tdnn_1a/74.mdl exp/chain/e2e_tdnn_1a/73.mdl exp/chain/e2e_tdnn_1
a/72.mdl exp/chain/e2e_tdnn_1a/71.mdl exp/chain/e2e_tdnn_1a/70.mdl exp/chain/e2e_tdnn_1a/69.mdl exp/chain/e2e_tdnn_1a/68.mdl exp/chain/e2e_tdnn_1a/67.m
dl exp/chain/e2e_tdnn_1a/66.mdl exp/chain/e2e_tdnn_1a/65.mdl exp/chain/e2e_tdnn_1a/64.mdl exp/chain/e2e_tdnn_1a/63.mdl exp/chain/e2e_tdnn_1a/62.mdl
"ark,bg:nnet3-chain-copy-egs ark:exp/chain/e2e_tdnn_1a/egs/combine.cegs ark:- | nnet3-chain-merge-egs --minibatch-siz
e=150=128,64/300=100,64,32/600=50,32,16/1200=16,8 ark:- ark:- |" - \| nnet3-am-copy --set-raw-nnet=- exp/chain/e2e_
tdnn_1a/81.mdl exp/chain/e2e_tdnn_1a/final.mdl

Please find the error is in exp/chain/e2e_tdnn_1a/log/combine.log

nnet3-chain-copy-egs ark:exp/chain/e2e_tdnn_1a/egs/combine.cegs ark:-
nnet3-chain-merge-egs --minibatch-size=150=128,64/300=100,64,32/600=50,32,16/1200=16,8 ark:- ark:-
LOG (nnet3-chain-copy-egs[5.5.707~2-c9d8b]:main():nnet3-chain-copy-egs.cc:395) Read 24 neural-network training examples, wrote 24
LOG (nnet3-chain-merge-egs[5.5.707~2-c9d8b]:PrintSpecificStats():nnet-example-utils.cc:1159) Merged specific eg types as follows [format: <eg-size1>={<mb-size1>-><num-minibatches1>,<mbsize2>-><num-minibatches2>.../d=<num-discarded>},<egs-size2>={...},... (note,egs-size == number of input frames including context).
LOG (nnet3-chain-merge-egs[5.5.707~2-c9d8b]:PrintSpecificStats():nnet-example-utils.cc:1189) 161={,d=1},170={,d=3},227={,d=2},242={,d=16},260={,d=2}
LOG (nnet3-chain-merge-egs[5.5.707~2-c9d8b]:PrintAggregateStats():nnet-example-utils.cc:1155) Processed 24 egs of avg. size 229.9 into 0 minibatches, discarding 100% of egs. Avg minibatch size was -nan, #distinct types of egs/minibatches was 5/0
LOG (nnet3-chain-combine[5.5.707~2-c9d8b]:main():nnet3-chain-combine.cc:165) Read 0 examples.
ASSERTION_FAILED (nnet3-chain-combine[5.5.707~2-c9d8b]:main():nnet3-chain-combine.cc:166) Assertion failed: (!egs.empty())

[ Stack-Trace: ]
/root/usr/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0xb61) [0x7f6742bdf723]
/root/usr/kaldi/src/lib/libkaldi-base.so(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0x6c) [0x7f6742be0415]
nnet3-chain-combine(main+0x8a8) [0x5648214758f4]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f6741d03bf7]
nnet3-chain-combine(_start+0x2a) [0x564821474baa]

ERROR (nnet3-am-copy[5.5.707~2-c9d8b]:ExpectToken():io-funcs.cc:200) Failed to read token [started at file position -1], expected <Nnet3>

[ Stack-Trace: ]
/root/usr/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0xb61) [0x7f103a12c723]
/root/usr/kaldi/src/lib/libkaldi-nnet3.so(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x7f103ac08215]
/root/usr/kaldi/src/lib/libkaldi-base.so(kaldi::ExpectToken(std::istream&, bool, char const*)+0x160) [0x7f103a12e372]
/root/usr/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::Nnet::Read(std::istream&, bool)+0x69) [0x7f103aca0aef]
nnet3-am-copy(main+0xa1b) [0x5588f16fb68b]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f10397b5bf7]
nnet3-am-copy(_start+0x2a) [0x5588f16fab6a]

kaldi::KaldiFatalError
# Accounting: time=11 threads=1
# Ended (code 255) at Fri Feb 26 06:29:22 UTC 2021, elapsed time 11 seconds

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAEWAuyRsp8GO8hGEbU0jS82vSKYdx25yjA8tZVVYLDRSBTTabA%40mail.gmail.com.

Daniel Povey

unread,

Feb 26, 2021, 9:14:15 AM2/26/21

to kaldi-help

That is an issue where there are too many distinct sizes of egs and too few egs to combine, and none of them have enough to form one minibatch.

Please see whether this PR

https://github.com/kaldi-asr/kaldi/pull/4465

resolves it.

You should be able to invoke steps/nnet3/chain/e2e/train_e2e.py with --stage=N where N corresponds to the last iteration that ran, plus one.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAJbMS5VsNDL%3DX4YYRqU1PO%2BCun99Co7zvGtKoqV8gd4P0rc0fg%40mail.gmail.com.

Rafael Setyan

unread,

Apr 10, 2022, 9:26:05 AM4/10/22

to kaldi-help

I also got stuck to this error and https://github.com/kaldi-asr/kaldi/pull/4465 this did not help, Can I use one of the *.mdl as a final.mdl?

Daniel Povey

unread,

Apr 10, 2022, 9:59:47 AM4/10/22

to kaldi-help

Yes you can, e.g. just the last one. Next time show the full error message/warning though.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/fa5f2520-cd9f-493f-90c3-14c9acc84604n%40googlegroups.com.

Rafael Setyan

unread,

Apr 11, 2022, 1:32:47 AM4/11/22

to kaldi-help

Thanks a lot,

Here attached I am sending the training logs and combine.log.

Rafayel

combine.log

train-chain.txt

Daniel Povey

unread,

Apr 11, 2022, 1:56:25 AM4/11/22

to kaldi-help

You are using steps/chain/train.py not steps/chain/e2e/train_e2e.py.

You could make the same change from that PR, at steps/chain/train.py line 589.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/324874f6-fd2f-49d1-bee5-86ab0e819829n%40googlegroups.com.

AK Project

unread,

Jun 13, 2024, 6:04:14 AMJun 13

to kaldi-help

I encountered an error when executing steps/libs/nnet3/train/chain_objf/acoustic_model.py, here are the details of the error. Could you explain what's happening dan?

Thank you

combine.log

error log.log

Daniel Povey

unread,

Jun 14, 2024, 7:46:23 AMJun 14

to kaldi...@googlegroups.com

I think your validation set is too small and did not form at least one training example of sufficient size.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/81dc29c3-5ae1-4acb-beef-5460709b70b7n%40googlegroups.com.

Reply all

Reply to author

Forward