SGE and GPU

632 views
Skip to first unread message

bh.a...@gmail.com

unread,
Nov 19, 2015, 10:30:00 AM11/19/15
to kaldi-help
Hi,

I have installed SGE for parallel AM training.
We have two different server with two GPUs on them and I can see all the hosts connected to queue.
All the steps in run.sh is executed in parallel on servers and each server get jobs and submit them properly. 
However the NNET training step paused in pass0 without any further progress. 

Note. I can run NNET on each machine using run.pl . İt can use CUDA and GPUs for training without any problem.

this is my queue.pl file content:

command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*
option mem=* -l mem_free=$0,ram_free=$0
option mem=0          # Do not add anything to qsub_opts
option num_threads=* -pe smp $0
option num_threads=1  # Do not add anything to qsub_opts
option max_jobs_run=* -tc $0
option gpu=0 -q all.q


any help?

  

Jan Trmal

unread,
Nov 19, 2015, 10:49:42 AM11/19/15
to kaldi-help
You mean the tasks finish, disappear from the qstat output but no new tasks get submitted?
In that case, there might be some issue with the NFS synchronization -- queue.pl relies on existence of ".done" files. You could try to track what's happening what those files.

A more probable cause might be that you still have some task in the queue, but it won't get executed because of some conflicting resource specification (qstat -j job-id might help you debug these issues) 
y.

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

bh.a...@gmail.com

unread,
Nov 19, 2015, 10:55:29 AM11/19/15
to kaldi-help
No. 
qstat shows the job in  qw  mode without any progress in NNET pass.
I am not sure that my queue.conf contents is appropriate to use GPUs on each node. So how can I sure about this? Is there any sample?

thanks.

Jan Trmal

unread,
Nov 19, 2015, 11:05:19 AM11/19/15
to kaldi-help
Did you read this?
http://kaldi-asr.org/doc/queue.html
There is a concrete example how the queue.conf should look like. Your config seems like it's slightly incomplete.

As I said, qstat -j <jobid> could provide you with an information why the job is not getting scheduled (in the "scheduling info" section).
y.

Behnam Asefisaray

unread,
Nov 19, 2015, 11:08:10 AM11/19/15
to kaldi...@googlegroups.com
Yes I think I read that carefully.
 the scheduling info of qstat -j   is:

scheduling info:            cannot run in PE "smp" because it only offers 0 slots


thanks.

--
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/7aoXDLqkmb0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Behnam ASEFISARAY
Computer Engineering Department 
Ankara/Turkey 06800

Jan Trmal

unread,
Nov 19, 2015, 11:15:25 AM11/19/15
to kaldi-help
How does the output from "qconf -sp smp" look like?

Behnam Asefisaray

unread,
Nov 19, 2015, 11:16:54 AM11/19/15
to kaldi...@googlegroups.com
pe_name            smp
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $pe_slots
control_slaves     FALSE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE

Jan Trmal

unread,
Nov 19, 2015, 11:25:44 AM11/19/15
to bh.a...@gmail.com, Dan Povey
Taking this off the list.

I do not see your whole configs, but this error (0 slots available) can occur, when do not have  machines in the given queue that would allow for so many slots.
Please check how many slots the given task is requesting (from the qstat -j <job-id> output) and have a look at the queue config (qconf -sq <queue-name>) and there should be at least one machine in the  "slots" section that exports that many slots.

y.

bh.a...@gmail.com

unread,
Nov 20, 2015, 10:12:17 AM11/20/15
to kaldi-help, bh.a...@gmail.com, dpo...@gmail.com
I have 2 GPU cards in each of my Servres.
What should be the value of slots in all.q   config? 
Is that the number of total CUDA cores on each node?
I have a trouble with assigning slots number and I get the "cannot run in PE "smp" because it only offers 0 slots" eror in NNET step.
Also I set the gpu=4  as compşex value in my all.q because I have 4 GPU in total.

any help?
Reply all
Reply to author
Forward
0 new messages