Running on a Multi-Node Multi-GPU setup using SLURM

Akshay Chandrashekaran

unread,

Jun 14, 2017, 1:36:13 PM6/14/17

to kaldi-help

Hi,

I am planning to run some of the nnet3 / chain model builds on a SLURM cluster I have access to. There are multiple machines there, with access to 4 GPUs each. I was wondering if anyone here has experience in using the utils/slurm.pl within Kaldi to leverage such a hardware setup for nnet3.

I am a relative newbie when it comes to SLURM, so any and all tips will be highly appreciated..

Thanks

Akshay

Xiang Li

unread,

Jun 15, 2017, 1:26:22 AM6/15/17

to kaldi-help

If slurm is ready, you have to make sure there's a file system, like NFS or Lustre, that is accessible from each node of slurm cluster.

Also, modify the configuration lines in slurm.pl according to the cluster, for example, you have to modify the parameter for -p to the partition name in your cluster.

Finally, I had run into a situation where 'system("ls $qdir >/dev/null");' consumed too much time when there were too many *.log and *.sh in q dir, so I just commented that line out, and nothing bad happens from then.

Then I think you can replace run.pl with slurm.pl, and try it.

在 2017年6月15日星期四 UTC+8上午1:36:13，Akshay Chandrashekaran写道：

Jan Trmal

unread,

Jun 15, 2017, 1:34:51 AM6/15/17

to kaldi-help

You don't have to modify anything - just create conf/slurm.conf similarly to how you create conf/queue.conf for queue.pl (see kaldi docs for details).

Let us know if there is anything that can't be done using the config.

To the OP: I have used slurm on cluster with gpu and didn't have any issues - you will have to figure out the options (partitions and job limits) but then it should work reliably.

Y.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David van Leeuwen

unread,

Jun 15, 2017, 7:14:10 AM6/15/17

to kaldi-help

You have to be aware that not all scripts have been rewritten to the queueing-system-agnostic form yet, mind options like "-tc" in various "io-opts" options, which appear to be SGEisms.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Jan Trmal

unread,

Jun 15, 2017, 7:18:16 AM6/15/17

to kaldi-help

Yes, I agree, but those should be mostly gone (except for babel) in the Kaldi default egs. If there are some we omitted, please let us know.

Y

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

hhamm...@gmail.com

unread,

Dec 1, 2020, 1:41:09 AM12/1/20

to kaldi-help

Can anyone please give me a working slurm.conf configuration file? Slurm seems to be allocating resources but only one node is spiking in cpu load. The other nodes are doing nothing.

Any help is really appreciated!

For the reference, here are are the config files:

- I execute "run.sh" from sbatch (it uses slurm itself). Here is the header of "run.sh":

#SBATCH --job-name=LDA_Kaldi

#SBATCH --account=ID

#SBATCH --output=lda.out

#SBATCH --partition=normal

#SBATCH --nodes=8

#SBATCH --ntasks-per-node=1

#SBATCH --cpus-per-task=16

#SBATCH --mem=60000

#SBATCH --time=1-00:00:00

#SBATCH --mail-type=ALL

#SBATCH --mail-user=EMAIL

- Here is my slurm.conf:

▽

command sbatch --export=PATH --ntasks-per-node=1 --partition=normal --nodes=8 --mem=64000 --time=24:00:00 --job-name=KALDI --account=ID

option mem=* --mem-per-cpu=$0

option mem=0 # Do not add anything to qsub_opts

option num_threads=* --cpus-per-task=$0 --ntasks-per-node=1

option num_threads=1 --cpus-per-task=1 --ntasks-per-node=1 # Do not add anything to qsub_opts

option max_jobs_run=* # Do nothing

option gpu=* -N1 -n1 -p gpu --mem=4GB --gres=gpu:$0 --cpus-per-task=6 --time=72:0:0 # in reality, we probably should have --cpus-per-task=$((6*$0))

option gpu=0

and here is my cmd.sh:

▽

export train_cmd="slurm.pl --config conf/slurm.conf"

Where am I going wrong?

Thank you!

Jan Trmal

unread,

Dec 2, 2020, 5:17:03 AM12/2/20

to kaldi-help

the first part (run.sh) modification are not of any importance -- you should be able to remove them.

as for your issue -- it might be that you just don't have the right expectation of how kaldi behaves -- there can be quite a few tasks that run on a single node/as a single job and sometimes things can even run locally (without slurm at all). As long as it's not giving any slurm error, you should be fine, I think.

y.

Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group

---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/1eb01f1c-e3db-45be-876d-281fb166986dn%40googlegroups.com.

hhamm...@gmail.com

unread,

Dec 2, 2020, 5:23:23 AM12/2/20

to kaldi-help

Thank you, but it does not really make sense that the script (for example train_mono) using run.pl to be faster than running on 7 other nodes the same script using slurm.pl. Does it? Unless I am mistaken, parallelization should make computations faster right?

Hussein.

Jan Trmal

unread,

Dec 2, 2020, 6:12:49 AM12/2/20

to kaldi-help

well, that depends on many things -- how long the jobs stay in the queue before scheduled(run), if all parallel tasks will get run at the same time and so on...

I think your local slurm admins might be able to help more efficiently than I, sorry.

y.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/a764256a-2d54-44df-9b9f-b756480e1dbcn%40googlegroups.com.

Reply all

Reply to author

Forward