Hi,
I have some issues with gres usage. I'm running slurm of 16.05.4 version
and I have a small stand with 4 nodes+master. The best description of it
would be to paste confs:
slurm.conf:
http://paste.org.ru/?m8v7ca
gres.conf:
http://paste.org.ru/?ouspnz
They are populated on each node.
And the problem is following:
[dkryzhevich@gpu ~]$ srun -N 1 --gres gpu:c2050 <whatever>
srun: error: Unable to allocate resources: Requested node configuration
is not available
[dkryzhevich@gpu ~]$
Relevant logs:
http://paste.org.ru/?mj4dfs
Whatever I did with --gres flag it just does not start. What am I
missing here?
I tried to remove Type column from gres.conf and all nodes have gone
into "drain" state. I tried to remove all details from Gres column in
slurm.conf in addition (i.e. "NodeName=node2 Gres=gpu:1 CoresPerSocket=2
ThreadsPerCore=2 State=UNKNOWN") and task was submitted but I want the
ability to specify type of card in case I really need it.
And two small unrelevant questions.
1. Is it possible to submit a job from any node, or is it master only?
Start secondary slurmctl daemon on each node may be, I don't know.
2. Is it possible to start a job on two separate nodes with nvidia cards
in a way something like
$ srun --gres gpu:2
? The point is to use 2-3-4 cards installed on different nodes with some
MPI connection between threads.
BR,
Dmitrij