Which DNN script I must use for multi-GPU training ?

Ibsoftware Mannix54

unread,

Aug 2, 2017, 3:29:18 PM8/2/17

to kaldi-help

Hello,

I use DNN scripts located in kaldi/egs/tedlium/s5_r2/local/nnet3 :

- run_tdnn.sh
- run_lstm.sh ( tuning/run_lstm_1a.sh )

I use "ssh.pl" in order to use 3 PC ( each PC has 2 GPU ) over a RJ45 network, my data are on a NFS share,

I notice that these 2 scripts in tedlium nnet3 directory ( run_tdnn.sh and run_lstm ) are not designed to do the training on multi-PCs,

for example in run_tdnn.sh the GPU training is done by the script "steps/nnet3/tdnn/train.sh", inside this script the main loop for the GPU training uses only one PC machine :

$cmd $train_queue_opt $dir/log/train.$x.$n.log \
          nnet3-train $parallel_train_opts \
          --max-param-change=$max_param_change "$raw" \
          "ark,bg:nnet3-copy-egs --frame=$frame $context_opts ark:$cur_egs_dir/egs.$archive.ark ark:- | nnet3-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x ark:- ark:-| nnet3-merge-egs --minibatch-size=$this_minibatch_size --discard-partial-minibatches=true ark:- ark:- |" \
          $dir/$[$x+1].$n.raw || touch $dir/.error &

the $cmd command here doesn't have the argument "JOB=1:$nj", which means the training is done only on a single PC machine :/

do you know a tdnn script and a lstm script which are capable to do the GPU training by using all the PCs available in the RJ45 network ? ( by using ssh.pl, I don't have gridengine )

I want to reduce the GPU training time by using 3 PC ( each PC has 2 graphic cards, so 6 graphic cards can be used ), but I need a kaldi script which uses the option "JOB=1:$nj" for the GPU training, in order to speed-up the process

thanks

Daniel Povey

unread,

Aug 2, 2017, 3:36:11 PM8/2/17

to kaldi-help

None of the currently recommended scripts will work with ssh.pl; you'd
have to install GridEngine for this to work properly.
It would have been possible for us to modify ssh.pl so that it would
support all the features we use in GridEngine, but we have to limit
the scope of the Kaldi project; after a certain point it gets
ridiculous. What next? Rewrite NFS because it has bugs?
I think you should just figure out how to install GridEngine. It's
not easy but it's not as hard as speech recogniton; and we do provide
some guidance about how to set it up in
http://kaldi-asr.org/doc/queue.html.
If you are using Debian there are packages for it, you have to include
deb http://ftp.debian.org/debian jessie-backports main
in your /etc/apt/sources.list to see them, if using Debian 8.6.
For some other distributions such as Red Hat, IIRC it's a little trickier.

Dan

> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Ibsoftware Mannix54

unread,

Aug 2, 2017, 5:10:24 PM8/2/17

to kaldi-help, dpo...@gmail.com

thanks Dan for your answer,

so if I install gridengine and if I use queue.pl : what are the recommanded scripts for GPU computation on multi-PCs ?

currently the scripts "run_tdnn.sh" and "run_lstm.sh" in tedlium/nnet3 don't seem to use all the PCs for the GPU training, because of the lack of the argument "JOB=1:$nj", GPU computation will be done only on a single PC,

Daniel Povey

unread,

Aug 2, 2017, 5:13:36 PM8/2/17

to Ibsoftware Mannix54, kaldi-help

You are misunderstanding those scripts; some of those things are in
background threads. It does use multiple machines.
However, you have an out-of-date version of Kaldi; the script you
pointed to (kaldi/egs/tedlium/s5_r2/local/nnet3/run_tdnn.sh) in the
*current* versin of Kaldi doesnt use that shell-based training script,
it uses a python one.
In either case, though, it requires GridEngine and won't work
correctly with ssh.pl.

Dan

Ibsoftware Mannix54

unread,

Aug 2, 2017, 6:09:51 PM8/2/17

to kaldi-help, ibsof...@free.fr, dpo...@gmail.com

I tried also the last git version of kaldi,

and I notice that even the python version of tdnn/lstm scripts ( train_rnn.py, train_dnn.py ) in tedlium/nnet3 suffer of the same problem :

--> when the main GPU training step comes then only the GPU cards located on the first PC are used ( I checked with nvidia-smi, all 915 iterations are done on the first PC, the GPU on the other PCs are not used, nvidia-smi reports 0% GPU usage )

the others PC are used when $cmd uses the argument "JOB=1:$nj", but only for these steps :

"getting preconditioning matrix for input features"
"preparing initial vector for FixedScaleComponent before softmax"
"Getting average posterior for purposes of adjusting the priors"

but the longuest part in the script is when the iterations are done in the GPU, and this part ( what I see with my configuration ) is done on a single PC,

I have tested with ssh.pl, I have carrefully set some variables related to $nj, $num-threads, $num-process in kaldi scripts in order to match with the memory/cpu ressources of my 3 Pcs, that's why tdnn/lstm scripts works without errors ( no out of memory errors ) on my configuration with ssh.pl

Daniel Povey

unread,

Aug 2, 2017, 6:12:18 PM8/2/17

to Ibsoftware Mannix54, kaldi-help

It's a limitation of ssh.pl. We never claimed that ssh.pl is general
replacement for GridEngine, it's a hack that works only in some
special cases.

Ibsoftware Mannix54

unread,

Aug 2, 2017, 6:42:08 PM8/2/17

to kaldi-help, ibsof...@free.fr, dpo...@gmail.com

Ok thanks for the information,

now I understand better the issue with ssh.pl,

so I will install gridengine, which will probably resolve my problem,

thanks again

Reply all

Reply to author

Forward