LibriSpeech chain model recipe fails at stage 15

496 views
Skip to first unread message

FG

unread,
Feb 7, 2018, 8:12:17 PM2/7/18
to kaldi-help

My training of LibriSpeech chain model fails at stage 15 when running steps/nnet3/chain/train.py

Here are the error messages:

2018-02-07 18:45:17,751 [steps/nnet3/chain/train.py:404 - train - INFO ] Copying the properties from exp/chain_cleaned/tdnn_1b_sp/egs to exp/chain_cleaned/tdnn_1b_sp
2018-02-07 18:45:17,779 [steps/nnet3/chain/train.py:409 - train - INFO ] Computing the preconditioning matrix for input features
2018-02-07 18:45:46,620 [steps/nnet3/chain/train.py:417 - train - INFO ] Preparing the initial acoustic model.
2018-02-07 18:45:48,047 [steps/nnet3/chain/train.py:451 - train - INFO ] Training will run for 4.0 epochs = 735 iterations
2018-02-07 18:45:48,097 [steps/nnet3/chain/train.py:493 - train - INFO ] Iter: 0/734    Epoch: 0.00/4.0 (0.0% complete)    lr: 0.003000   
2018-02-07 18:51:41,891 [steps/nnet3/chain/train.py:493 - train - INFO ] Iter: 1/734    Epoch: 0.00/4.0 (0.0% complete)    lr: 0.002997   
run.pl: job failed, log is in exp/chain_cleaned/tdnn_1b_sp/log/train.1.2.log
2018-02-07 18:51:48,389 [steps/libs/common.py:231 - background_command_waiter - ERROR ] Command exited with status 1: run.pl  --gpu 1 exp/chain_cleaned/tdnn_1b_sp/log/train.1.2.log                     nnet3-chain-train                       --apply-deriv-weights=False                     --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1                     --read-cache=exp/chain_cleaned/tdnn_1b_sp/cache.1  --xent-regularize=0.1                                          --print-interval=10 --momentum=0.0                     --max-param-change=2.0                     --backstitch-training-scale=0.0                     --backstitch-training-interval=1                     --l2-regularize-factor=0.333333333333                     --srand=1                     "nnet3-am-copy --raw=true --learning-rate=0.00299703421811 --scale=1.0 exp/chain_cleaned/tdnn_1b_sp/1.mdl - |" exp/chain_cleaned/tdnn_1b_sp/den.fst                     "ark,bg:nnet3-chain-copy-egs                         --frame-shift=2                         ark:exp/chain_cleaned/tdnn_1b_sp/egs/cegs.5.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=1 ark:- ark:- | nnet3-chain-merge-egs                         --minibatch-size=128 ark:- ark:- |"                     exp/chain_cleaned/tdnn_1b_sp/2.2.raw


And the log file (train.1.2.log) is in the attachment.

I run training on a single computer with 8-core CPU and one GPU.  It seems to me that it runs out of memory.  If that is the case, how can I reduce the memory consumption by changing parameter settings?

Thanks

train.1.2.log

Daniel Povey

unread,
Feb 7, 2018, 8:14:38 PM2/7/18
to kaldi-help
Notice this line in the log:

WARNING (nnet3-chain-train[5.3.58~1-b8083]:SelectGpuId():cu-device.cc:183) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode

All the GPU jobs are sharing the same GPU and it exhausts its memory.
You won't be able to train that exact model with just one GPU.  You could set --num-jobs-initial=1 and --num-jobs-final=1 and maybe halve the number of epochs.  But the result won't be exactly the same.


Dan


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/46368450-1889-4335-a5b9-c3475097dd91%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages