Hello,
I use DNN scripts located in kaldi/egs/tedlium/s5_r2/local/nnet3 :
- run_tdnn.sh
- run_lstm.sh ( tuning/run_lstm_1a.sh )
I use "
ssh.pl" in order to use 3 PC ( each PC has 2 GPU ) over a RJ45 network, my data are on a NFS share,
I notice that these 2 scripts in tedlium nnet3 directory ( run_tdnn.sh and run_lstm ) are not designed to do the training on multi-PCs,
for example in run_tdnn.sh the GPU training is done by the script "steps/nnet3/tdnn/train.sh", inside this script the main loop for the GPU training uses only one PC machine :
$cmd $train_queue_opt $dir/log/train.$x.$n.log \
nnet3-train $parallel_train_opts \
--max-param-change=$max_param_change "$raw" \
"ark,bg:nnet3-copy-egs --frame=$frame $context_opts ark:$cur_egs_dir/egs.$archive.ark ark:- | nnet3-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x ark:- ark:-| nnet3-merge-egs --minibatch-size=$this_minibatch_size --discard-partial-minibatches=true ark:- ark:- |" \
$dir/$[$x+1].$n.raw || touch $dir/.error &
the $cmd command here doesn't have the argument "JOB=1:$nj", which means the training is done only on a single PC machine :/
do you know a tdnn script and a lstm script which are capable to do the GPU training by using all the PCs available in the RJ45 network ? ( by using
ssh.pl, I don't have gridengine )
I want to reduce the GPU training time by using 3 PC ( each PC has 2 graphic cards, so 6 graphic cards can be used ), but I need a kaldi script which uses the option "JOB=1:$nj" for the GPU training, in order to speed-up the process
thanks