[slurm-dev] Gres issue

5 views
Skip to first unread message

Dmitrij S. Kryzhevich

unread,
Nov 16, 2016, 4:35:28 AM11/16/16
to slurm-dev

Hi,

I have some issues with gres usage. I'm running slurm of 16.05.4 version
and I have a small stand with 4 nodes+master. The best description of it
would be to paste confs:
slurm.conf: http://paste.org.ru/?m8v7ca
gres.conf: http://paste.org.ru/?ouspnz
They are populated on each node.

And the problem is following:

[dkryzhevich@gpu ~]$ srun -N 1 --gres gpu:c2050 <whatever>
srun: error: Unable to allocate resources: Requested node configuration
is not available
[dkryzhevich@gpu ~]$

Relevant logs: http://paste.org.ru/?mj4dfs
Whatever I did with --gres flag it just does not start. What am I
missing here?

I tried to remove Type column from gres.conf and all nodes have gone
into "drain" state. I tried to remove all details from Gres column in
slurm.conf in addition (i.e. "NodeName=node2 Gres=gpu:1 CoresPerSocket=2
ThreadsPerCore=2 State=UNKNOWN") and task was submitted but I want the
ability to specify type of card in case I really need it.

And two small unrelevant questions.
1. Is it possible to submit a job from any node, or is it master only?
Start secondary slurmctl daemon on each node may be, I don't know.
2. Is it possible to start a job on two separate nodes with nvidia cards
in a way something like
$ srun --gres gpu:2
? The point is to use 2-3-4 cards installed on different nodes with some
MPI connection between threads.

BR,
Dmitrij

Michael Di Domenico

unread,
Nov 16, 2016, 8:04:22 AM11/16/16
to slurm-dev

this might be nothing, but i usually call --gres with an equals

srun --gres=gpu:k10:8

i'm not sure if the equals is optional or not

Christopher Samuel

unread,
Nov 16, 2016, 7:31:53 PM11/16/16
to slurm-dev

On 17/11/16 00:04, Michael Di Domenico wrote:

> this might be nothing, but i usually call --gres with an equals
>
> srun --gres=gpu:k10:8
>
> i'm not sure if the equals is optional or not

It depends on the library used to pass options, I'm used to it being
mandatory but apparently with Slurm it's not - just tested it out and using:

--gres mic

results in my job being scheduled on a Phi node with OFFLOAD_DEVICES=0
set in its environment.

All the best,
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci

Christopher Samuel

unread,
Nov 16, 2016, 7:35:01 PM11/16/16
to slurm-dev

On 17/11/16 11:31, Christopher Samuel wrote:

> It depends on the library used to pass options,

Oops - that should be parse, not pass.

Need more caffeine..

Eliot Eshelman

unread,
Nov 21, 2016, 1:16:42 PM11/21/16
to slurm-dev
I have experienced similar issues. I can assure you version 16.05 supports heterogeneous mixes of GPUs (even different GPUs within the same node).

Please check and double-check the following:
  • slurm.conf is the same across all nodes
  • gres.conf is correct for each node
  • The ordering of the GPUs matches in slurm.conf and gres.conf

Finally, be sure to do a full SLURM service restart on the Head Node and all Compute Nodes. While this may not always be necessary, I had some cases where a full restart was needed for jobs to be accepted.

Best,
Eliot
--
Eliot Eshelman
Microway, Inc.

Dmitrij S. Kryzhevich

unread,
Nov 22, 2016, 12:28:05 AM11/22/16
to slurm-dev
Thanks for reply!

I have experienced similar issues. I can assure you version 16.05 supports heterogeneous mixes of GPUs (even different GPUs within the same node).

Please check and double-check the following:
  • slurm.conf is the same across all nodes
Tested. I have a small script doing that for me. And it's just too simple to make a mistake. Anyway it was rechecked. Twice.

  • gres.conf is correct for each node
Here is some... well... I just don't know how to verify it. In slurmctl startup logs all equipment is listed correctly. Is it enough?

  • The ordering of the GPUs matches in slurm.conf and gres.conf
It's a stand. Only one GPU on each node (with one exception). No place to go wrong. Anyway, checked.

Finally, be sure to do a full SLURM service restart

Done. Nothing changed.

I believe I'm missing something very simple. But what is it?

BR,
Dmitrij

Dmitrij S. Kryzhevich

unread,
Nov 22, 2016, 12:48:45 AM11/22/16
to slurm-dev

I did apply
SelectType=select/cons_res
And it begins to work. What's wrong with select/linear could be here?
Reply all
Reply to author
Forward
0 new messages