SGE and GPU

bh.a...@gmail.com

unread,

Nov 19, 2015, 10:30:00 AM11/19/15

to kaldi-help

Hi,

I have installed SGE for parallel AM training.

We have two different server with two GPUs on them and I can see all the hosts connected to queue.

All the steps in run.sh is executed in parallel on servers and each server get jobs and submit them properly.

However the NNET training step paused in pass0 without any further progress.

Note. I can run NNET on each machine using run.pl . İt can use CUDA and GPUs for training without any problem.

this is my queue.pl file content:

command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*

option mem=* -l mem_free=$0,ram_free=$0

option mem=0 # Do not add anything to qsub_opts

option num_threads=* -pe smp $0

option num_threads=1 # Do not add anything to qsub_opts

option max_jobs_run=* -tc $0

option gpu=0 -q all.q

any help?

Jan Trmal

unread,

Nov 19, 2015, 10:49:42 AM11/19/15

to kaldi-help

You mean the tasks finish, disappear from the qstat output but no new tasks get submitted?

In that case, there might be some issue with the NFS synchronization -- queue.pl relies on existence of ".done" files. You could try to track what's happening what those files.

A more probable cause might be that you still have some task in the queue, but it won't get executed because of some conflicting resource specification (qstat -j job-id might help you debug these issues)

y.

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

bh.a...@gmail.com

unread,

Nov 19, 2015, 10:55:29 AM11/19/15

to kaldi-help

No.

qstat shows the job in qw mode without any progress in NNET pass.

I am not sure that my queue.conf contents is appropriate to use GPUs on each node. So how can I sure about this? Is there any sample?

thanks.

Jan Trmal

unread,

Nov 19, 2015, 11:05:19 AM11/19/15

to kaldi-help

Did you read this?
http://kaldi-asr.org/doc/queue.html

There is a concrete example how the queue.conf should look like. Your config seems like it's slightly incomplete.

As I said, qstat -j <jobid> could provide you with an information why the job is not getting scheduled (in the "scheduling info" section).

y.

Behnam Asefisaray

unread,

Nov 19, 2015, 11:08:10 AM11/19/15

to kaldi...@googlegroups.com

Yes I think I read that carefully.

the scheduling info of qstat -j is:

scheduling info: cannot run in PE "smp" because it only offers 0 slots

thanks.

--
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/7aoXDLqkmb0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Behnam ASEFISARAY

Computer Engineering Department

Hacettepe University

Ankara/Turkey 06800

E-mail: bh.a...@hacettepe.edu.tr

Jan Trmal

unread,

Nov 19, 2015, 11:15:25 AM11/19/15

to kaldi-help

How does the output from "qconf -sp smp" look like?

Behnam Asefisaray

unread,

Nov 19, 2015, 11:16:54 AM11/19/15

to kaldi...@googlegroups.com

pe_name smp

slots 9999

user_lists NONE

xuser_lists NONE

start_proc_args /bin/true

stop_proc_args /bin/true

allocation_rule $pe_slots

control_slaves FALSE

job_is_first_task TRUE

urgency_slots min

accounting_summary FALSE

Jan Trmal

unread,

Nov 19, 2015, 11:25:44 AM11/19/15

to bh.a...@gmail.com, Dan Povey

Taking this off the list.

I do not see your whole configs, but this error (0 slots available) can occur, when do not have machines in the given queue that would allow for so many slots.

Please check how many slots the given task is requesting (from the qstat -j <job-id> output) and have a look at the queue config (qconf -sq <queue-name>) and there should be at least one machine in the "slots" section that exports that many slots.

y.

bh.a...@gmail.com

unread,

Nov 20, 2015, 10:12:17 AM11/20/15

to kaldi-help, bh.a...@gmail.com, dpo...@gmail.com

I have 2 GPU cards in each of my Servres.

What should be the value of slots in all.q config?

Is that the number of total CUDA cores on each node?

I have a trouble with assigning slots number and I get the "cannot run in PE "smp" because it only offers 0 slots" eror in NNET step.

Also I set the gpu=4 as compşex value in my all.q because I have 4 GPU in total.

any help?

Reply all

Reply to author

Forward