Hi,
I have installed SGE for parallel AM training.
We have two different server with two GPUs on them and I can see all the hosts connected to queue.
All the steps in run.sh is executed in parallel on servers and each server get jobs and submit them properly.
However the NNET training step paused in pass0 without any further progress.
Note. I can run NNET on each machine using
run.pl . İt can use CUDA and GPUs for training without any problem.
command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*
option mem=* -l mem_free=$0,ram_free=$0
option mem=0 # Do not add anything to qsub_opts
option num_threads=* -pe smp $0
option num_threads=1 # Do not add anything to qsub_opts
option max_jobs_run=* -tc $0
option gpu=0 -q all.q
any help?