error while training / combine the model

299 views
Skip to first unread message

Foren Zas

unread,
Feb 26, 2021, 5:45:09 AM2/26/21
to kaldi...@googlegroups.com
Hi All,
I got stuck in training the tdnn.

the error is in exp/chain/e2e_tdnn_1a/log/combine.log

2021-02-26 06:28:45,178 [steps/nnet3/chain/e2e/train_e2e.py:462 - train - INFO ] Iter: 80/80   Jobs: 1   Epoch: 2.96/3.0 (98.8% comple
te)   lr: 0.000003
2021-02-26 06:29:10,468 [steps/nnet3/chain/e2e/train_e2e.py:515 - train - INFO ] Doing final combination to produce final.mdl
2021-02-26 06:29:10,885 [steps/libs/nnet3/train/chain_objf/acoustic_model.py:571 - combine_models - INFO ] Combining set([64, 65, 66,
67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 62, 63]) models.
run.pl: job failed, log is in exp/chain/e2e_tdnn_1a/log/combine.log
Traceback (most recent call last):                                                                                                      File "steps/nnet3/chain/e2e/train_e2e.py", line 558, in main
    train(args, run_opts)
  File "steps/nnet3/chain/e2e/train_e2e.py", line 524, in train
    run_opts=run_opts)
  File "steps/libs/nnet3/train/chain_objf/acoustic_model.py", line 622, in combine_models
    scp_or_ark=scp_or_ark, egs_suffix=egs_suffix))
  File "steps/libs/common.py", line 158, in execute_command
    p.returncode, command))
Exception: Command exited with status 1: run.pl --mem 4G --gpu 1 exp/chain/e2e_tdnn_1a/log/combine.log                 nnet3-chain-com
bine                 --max-objective-evaluations=30                 --l2-regularize=0.0 --leaky-hmm-coefficient=0.1                 --
verbose=3 --use-gpu=wait exp/chain/e2e_tdnn_1a/den.fst exp/chain/e2e_tdnn_1a/81.mdl exp/chain/e2e_tdnn_1a/80.mdl exp/chain/e2e_tdnn_1a
/79.mdl exp/chain/e2e_tdnn_1a/78.mdl exp/chain/e2e_tdnn_1a/77.mdl exp/chain/e2e_tdnn_1a/76.mdl exp/chain/e2e_tdnn_1a/75.mdl exp/chain/
e2e_tdnn_1a/74.mdl exp/chain/e2e_tdnn_1a/73.mdl exp/chain/e2e_tdnn_1a/72.mdl exp/chain/e2e_tdnn_1a/71.mdl exp/chain/e2e_tdnn_1a/70.mdl

and attached screenshot of   the error is in exp/chain/e2e_tdnn_1a/log/combine.log


kaldi_error.PNG




Thanks
F. Zas

Daniel Povey

unread,
Feb 26, 2021, 6:58:44 AM2/26/21
to kaldi-help
error is not shown there.
please figure out how to paste as text!

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAH_8jhWpgBaRWKeHW67x2R3eopCHj2iKnQH9LDLp-XUENhNRxA%40mail.gmail.com.

kamar majhi

unread,
Feb 26, 2021, 8:42:49 AM2/26/21
to kaldi...@googlegroups.com
Sorry Dan,

2021-02-26 06:29:10,885 [steps/libs/nnet3/train/chain_objf/acoustic_model.py:571 - combine_models - INFO ] Combining set([64, 65, 66, 67, 68, [41/4627]1, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 62, 63]) models.

run.pl: job failed, log is in exp/chain/e2e_tdnn_1a/log/combine.log
Traceback (most recent call last):
  File "steps/nnet3/chain/e2e/train_e2e.py", line 558, in main
    train(args, run_opts)
  File "steps/nnet3/chain/e2e/train_e2e.py", line 524, in train
    run_opts=run_opts)
  File "steps/libs/nnet3/train/chain_objf/acoustic_model.py", line 622, in combine_models
    scp_or_ark=scp_or_ark, egs_suffix=egs_suffix))
  File "steps/libs/common.py", line 158, in execute_command
    p.returncode, command))
Exception: Command exited with status 1: run.pl --mem 4G --gpu 1 exp/chain/e2e_tdnn_1a/log/combine.log                 nnet3-chain-combine
    --max-objective-evaluations=30                 --l2-regularize=0.0 --leaky-hmm-coefficient=0.1                 --verbose=3 --use-gpu=wait exp/chain
/e2e_tdnn_1a/den.fst exp/chain/e2e_tdnn_1a/81.mdl exp/chain/e2e_tdnn_1a/80.mdl exp/chain/e2e_tdnn_1a/79.mdl exp/chain/e2e_tdnn_1a/78.mdl exp/chain/e2e_tdnn_1a/77.mdl exp/chain/e2e_tdnn_1a/76.mdl exp/chain/e2e_tdnn_1a/75.mdl exp/chain/e2e_tdnn_1a/74.mdl exp/chain/e2e_tdnn_1a/73.mdl exp/chain/e2e_tdnn_1
a/72.mdl exp/chain/e2e_tdnn_1a/71.mdl exp/chain/e2e_tdnn_1a/70.mdl exp/chain/e2e_tdnn_1a/69.mdl exp/chain/e2e_tdnn_1a/68.mdl exp/chain/e2e_tdnn_1a/67.m
dl exp/chain/e2e_tdnn_1a/66.mdl exp/chain/e2e_tdnn_1a/65.mdl exp/chain/e2e_tdnn_1a/64.mdl exp/chain/e2e_tdnn_1a/63.mdl exp/chain/e2e_tdnn_1a/62.mdl
             "ark,bg:nnet3-chain-copy-egs  ark:exp/chain/e2e_tdnn_1a/egs/combine.cegs ark:- |                     nnet3-chain-merge-egs --minibatch-siz
e=150=128,64/300=100,64,32/600=50,32,16/1200=16,8                     ark:- ark:- |" - \|                 nnet3-am-copy --set-raw-nnet=- exp/chain/e2e_
tdnn_1a/81.mdl                 exp/chain/e2e_tdnn_1a/final.mdl


Please find the error is in exp/chain/e2e_tdnn_1a/log/combine.log

nnet3-chain-copy-egs ark:exp/chain/e2e_tdnn_1a/egs/combine.cegs ark:-
nnet3-chain-merge-egs --minibatch-size=150=128,64/300=100,64,32/600=50,32,16/1200=16,8 ark:- ark:-
LOG (nnet3-chain-copy-egs[5.5.707~2-c9d8b]:main():nnet3-chain-copy-egs.cc:395) Read 24 neural-network training examples, wrote 24
LOG (nnet3-chain-merge-egs[5.5.707~2-c9d8b]:PrintSpecificStats():nnet-example-utils.cc:1159) Merged specific eg types as follows [format: <eg-size1>={<mb-size1>-><num-minibatches1>,<mbsize2>-><num-minibatches2>.../d=<num-discarded>},<egs-size2>={...},... (note,egs-size == number of input frames including context).
LOG (nnet3-chain-merge-egs[5.5.707~2-c9d8b]:PrintSpecificStats():nnet-example-utils.cc:1189) 161={,d=1},170={,d=3},227={,d=2},242={,d=16},260={,d=2}
LOG (nnet3-chain-merge-egs[5.5.707~2-c9d8b]:PrintAggregateStats():nnet-example-utils.cc:1155) Processed 24 egs of avg. size 229.9 into 0 minibatches, discarding 100% of egs.  Avg minibatch size was -nan, #distinct types of egs/minibatches was 5/0
LOG (nnet3-chain-combine[5.5.707~2-c9d8b]:main():nnet3-chain-combine.cc:165) Read 0 examples.
ASSERTION_FAILED (nnet3-chain-combine[5.5.707~2-c9d8b]:main():nnet3-chain-combine.cc:166) Assertion failed: (!egs.empty())

[ Stack-Trace: ]
/root/usr/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0xb61) [0x7f6742bdf723]
/root/usr/kaldi/src/lib/libkaldi-base.so(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0x6c) [0x7f6742be0415]
nnet3-chain-combine(main+0x8a8) [0x5648214758f4]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f6741d03bf7]
nnet3-chain-combine(_start+0x2a) [0x564821474baa]

ERROR (nnet3-am-copy[5.5.707~2-c9d8b]:ExpectToken():io-funcs.cc:200) Failed to read token [started at file position -1], expected <Nnet3>


[ Stack-Trace: ]
/root/usr/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0xb61) [0x7f103a12c723]
/root/usr/kaldi/src/lib/libkaldi-nnet3.so(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x7f103ac08215]
/root/usr/kaldi/src/lib/libkaldi-base.so(kaldi::ExpectToken(std::istream&, bool, char const*)+0x160) [0x7f103a12e372]
/root/usr/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::Nnet::Read(std::istream&, bool)+0x69) [0x7f103aca0aef]
nnet3-am-copy(main+0xa1b) [0x5588f16fb68b]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f10397b5bf7]
nnet3-am-copy(_start+0x2a) [0x5588f16fab6a]

kaldi::KaldiFatalError
# Accounting: time=11 threads=1
# Ended (code 255) at Fri Feb 26 06:29:22 UTC 2021, elapsed time 11 seconds

Daniel Povey

unread,
Feb 26, 2021, 9:14:15 AM2/26/21
to kaldi-help
That is an issue where there are too many distinct sizes of egs and too few egs to combine, and none of them have enough to form one minibatch.
Please see whether this PR
resolves it.
You should be able to invoke steps/nnet3/chain/e2e/train_e2e.py with --stage=N where N corresponds to the last iteration that ran, plus one.

Rafael Setyan

unread,
Apr 10, 2022, 9:26:05 AM4/10/22
to kaldi-help
I also got stuck to this error and https://github.com/kaldi-asr/kaldi/pull/4465 this did not help, Can I use one of the *.mdl as a final.mdl?

Daniel Povey

unread,
Apr 10, 2022, 9:59:47 AM4/10/22
to kaldi-help
Yes you can, e.g. just the last one.  Next time show the full error message/warning though.


Rafael Setyan

unread,
Apr 11, 2022, 1:32:47 AM4/11/22
to kaldi-help
Thanks a lot,
Here attached I am sending the training logs and combine.log.

Rafayel

combine.log
train-chain.txt

Daniel Povey

unread,
Apr 11, 2022, 1:56:25 AM4/11/22
to kaldi-help
You are using steps/chain/train.py not steps/chain/e2e/train_e2e.py.
You could make the same change from that PR, at steps/chain/train.py line 589.

AK Project

unread,
Jun 13, 2024, 6:04:14 AM (6 days ago) Jun 13
to kaldi-help
I encountered an error when executing steps/libs/nnet3/train/chain_objf/acoustic_model.py, here are the details of the error. Could you explain what's happening dan?

Thank you
combine.log
error log.log

Daniel Povey

unread,
Jun 14, 2024, 7:46:23 AM (5 days ago) Jun 14
to kaldi...@googlegroups.com
I think your validation set is too small and did not form at least one training example of sufficient size.


Reply all
Reply to author
Forward
0 new messages