[slurm-users] Submitting to multiple paritions problem with gres specified

1,313 views
Skip to first unread message

Bas van der Vlies

unread,
Mar 8, 2021, 11:30:11 AM3/8/21
to Slurm User Community List
Hi,

On this cluster I have version 20.02.6 installed. We have different
partitions for cpu type and gpu types. we want to make it easy for the
user who not care where there job runs and for the experienced user they
can specify the gres type: cpu_type or gpu

I have defined 2 cpu partitions:
* cpu_e5_2650_v1
* cpu_e5_2650_v2

and 2 gres cpu_type:
* e5_2650_v1
* e5_2650_v2


When no partitions are specified it will submit to both partitions:
* srun --exclusive --gres=cpu_type:e5_2650_v1 --pty /bin/bash -->
r16n18 wich has defined this gres and is in partition cpu_e5_2650_v1

Now I submit at the same time another job:
* srun --exclusive --gres=cpu_type:e5_2650_v1 --pty /bin/bash

This fails with: `srun: error: Unable to allocate resources: Requested
node configuration is not available`

I would expect it gets queued in the partition `cpu_e5_2650_v1`.


When I specify the partition on the command line:
* srun --exclusive -p cpu_e5_2650_v1_shared
--gres=cpu_type:e5_2650_v1 --pty /bin/bash

srun: job 1856 queued and waiting for resources


So the question is can slurm handle submitting to multiple partitions
when we specify gres attributes?

Regards


--
Bas van der Vlies
| HPCV Supercomputing | Internal Services | SURF |
https://userinfo.surfsara.nl |
| Science Park 140 | 1098 XG Amsterdam | Phone: +31208001300 |
| bas.van...@surf.nl

Bas van der Vlies

unread,
Mar 8, 2021, 1:50:37 PM3/8/21
to slurm...@lists.schedmd.com
Same problem with 20.11.4:
```
[2021-03-08T19:46:09.378] _pick_best_nodes: JobId=1861 never runnable in
partition cpu_e5_2650_v2
[2021-03-08T19:46:09.378] debug2: job_allocate: setting JobId=1861 to
"BadConstraints" due to a flaw in the job request (Requested node
configuration is not available)
[2021-03-08T19:46:09.378] _slurm_rpc_allocate_resources: Requested node
configuration is not available
```

Prentice Bisbal

unread,
Mar 8, 2021, 4:02:38 PM3/8/21
to slurm...@lists.schedmd.com
Rather than specifying the processor types as GRES, I would recommending
defining them as features of the nodes and let the users specify the
features as constraints to their jobs. Since the newer processors are
backwards compatible with the older processors, list the older
processors as features of the newer nodes, too.

For example, say you have some nodes that support AVX512, and some that
only support AVX2. node01 is older and supports only AVX2. Node02 is
newer and supports AVX512, but is backwards compatible and supports
AVX2. I would have something like this in my slurm.conf file:

NodeName=node01 Feature=avx2 ...
NodeName=node02 Feature=avx512,avx2 ...

I have a very hetergeneous cluster with several different generations of
AMD and Intel processors, we use this method quite effectively.

If you want to continue down the road you've already started on, can you
provide more information, like the partition definitions and the gres
definitions? In general, Slurm should support submitting to multiple
partitions.

Prentice

Ward Poelmans

unread,
Mar 9, 2021, 3:17:01 AM3/9/21
to slurm...@lists.schedmd.com
Hi Prentice,

On 8/03/2021 22:02, Prentice Bisbal wrote:

> I have a very hetergeneous cluster with several different generations of
> AMD and Intel processors, we use this method quite effectively.

Could you elaborate a bit more and how you manage that? Do you force you
users to pick a feature? What if a user submits a multi node job, can
you make sure it will not start over a mix of avx512 and avx2 nodes?

> If you want to continue down the road you've already started on, can you
> provide more information, like the partition definitions and the gres
> definitions? In general, Slurm should support submitting to multiple
> partitions.

As far as I understood it, you can give a comma separated list of
partitions to sbatch but it's not possible to this by default?

Ward

Ewan Roche

unread,
Mar 9, 2021, 3:37:40 AM3/9/21
to Slurm User Community List
Hello Ward,
as a variant on what has already been suggested we also have the CPU type as a feature:

Feature=E5v1,AVX
Feature=E5v1,AVX
Feature=E5v3,AVX,AVX2
Feature=S6g1,AVX,AVX2,AVX512

This allows people that want the same architecture and not just the same instruction set for a multi-node job can say:

sbatch —constraint=E5v1

Apart from multiple partitions approach another hack/workaround is to abuse the topology plugin to create fake switches with nodes of each CPU type connected and no links between these switches.

Switchname=sw0 Nodes=node[01-02,06-07]
Switchname=sw1 Nodes=node[03-05,08-10]

As there is no link between these “switches” Slurm will never schedule a job on node01 and node03.

Ewan Roche

Division Calcul et Soutien à la Recherche
UNIL | Université de Lausanne

Bas van der Vlies

unread,
Mar 9, 2021, 3:45:53 AM3/9/21
to Slurm User Community List, Prentice Bisbal
Hi Prentice,

Ansers inline

On 08/03/2021 22:02, Prentice Bisbal wrote:
> Rather than specifying the processor types as GRES, I would recommending
> defining them as features of the nodes and let the users specify the
> features as constraints to their jobs. Since the newer processors are
> backwards compatible with the older processors, list the older
> processors as features of the newer nodes, too.
>
We already do this with features on our other cluster. We assign nodes
different feature and user select these. I can add a new feature of
which cpu type it is. Sometime you want avx512 and specific processor.

On other cluster we have 5 different GPU's and a lot of partitions. I
want to make it simple for our users. So we have a 'job_submit.lua'
script that submits to multiple parttions and if the user specify the
GRES type then slurm selects the right partition(s)

On this cluster we do not have GPU's but i can test with other GRES type
'cpu_type'. And I think the last partition in the list determines the
behavior. So if a use a GRES that is supported by the last partition the
job gets queued:
* srun -N1 --gres=cpu_type:e5_2650_v2 --pty /bin/bash
* srun --exclusive --gres=cpu_type:e5_2650_v2 --pty /bin/bash

srun: job 1865 queued and waiting for resources

So to me it seems that one of the partition is BUSY but can run the job.
I will test it on our GPU cluster but expect the same behaviour.


>
> If you want to continue down the road you've already started on, can you
> provide more information, like the partition definitions and the gres
> definitions? In general, Slurm should support submitting to multiple
> partitions.

slurm.conf
```PartitionName=cpu_e5_2650_v1 DefMemPerCPU=11000 Default=No
DefaultTime=5 DisableRootJobs=YES MaxNodes=2 MaxTime=5-00
Nodes=r16n[18-20] OverSubscribe=EXCLUSIVE QOS=normal State=UP


PartitionName=cpu_e5_2650_v2 DefMemPerCPU=11000 Default=No DefaultTime=5
DisableRootJobs=YES MaxNodes=2 MaxTime=5-00 Nodes=r16n[21-22]
OverSubscribe=EXCLUSIVE QOS=normal State=UP


NodeName=r16n18 CoresPerSocket=8 Features=sandybridge,sse4,avx
Gres=cpu_type:e5_2650_v1:no_consume:4T MemSpecLimit=1024
NodeHostname=r16n18.mona.surfsara.nl RealMemory=188000 Sockets=2
State=UNKNOWN ThreadsPerCore=1 Weight=10

NodeName=r16n21 CoresPerSocket=8 Features=sandybridge,sse4,avx
Gres=cpu_type:e5_2650_v2:no_consume:4T MemSpecLimit=1024
NodeHostname=r16n21.mona.surfsara.nl RealMemory=188000 Sockets=2
State=UNKNOWN ThreadsPerCore=1 Weight=10

gres.conf

NodeName=r16n[18-20] Count=4T Flags=CountOnly Name=cpu_type
Type=e5_2650_v1
NodeName=r16n[21-22] Count=4T Flags=CountOnly Name=cpu_type Type=e5_2650_v2

Bas van der Vlies

unread,
Mar 9, 2021, 8:21:47 AM3/9/21
to Slurm User Community List, Prentice Bisbal
I have found the problem and will submit a patch. If we find a partition
were a job can run but all nodes are busy. Save this state and return
this when all partitions are checked and job can not run in any.

Do not know if this is the right approach

regards

Bas van der Vlies

unread,
Mar 9, 2021, 9:11:00 AM3/9/21
to Slurm User Community List, Prentice Bisbal
For those who are interested:
* https://bugs.schedmd.com/show_bug.cgi?id=11044

Prentice Bisbal

unread,
Mar 12, 2021, 4:30:38 PM3/12/21
to slurm...@lists.schedmd.com

On 3/9/21 3:16 AM, Ward Poelmans wrote:

Hi Prentice,

On 8/03/2021 22:02, Prentice Bisbal wrote:

I have a very hetergeneous cluster with several different generations of
AMD and Intel processors, we use this method quite effectively.
Could you elaborate a bit more and how you manage that? Do you force you
users to pick a feature? What if a user submits a multi node job, can
you make sure it will not start over a mix of avx512 and avx2 nodes?

I don't force the users to pick a feature, and to make matters worse, I think our login nodes are newer than some of the compute nodes, so it's entirely possible that if someone really optimizes their code for one of the login nodes, their job could get assigned to a node that doesn't understand the instruction set, resulting in the dreaded "Illegal Instruction" error. Suprisingly, this has only happened a few times in the 5 years I've been at this job.

I assume most users would want to use the newest and fastest processors if given the choice, so I set the priority weighting of the nodes so that the newest nodes are highest priority, and the oldest nodes the lowest priority.

The only way to make sure the processors stick to a certain instruction set, is if they specify the processor model, rather then than the instruction set family. For example

-C 7281 will get you only AMD EPYC 7281 processors

and

-C 6376 will get you only AMD Opteron 6376 processors

Using your example, if you don't want to mix AVX2 and AVX512 processors in the same job ever, you can "lie" to Slurm in your topology file and come up with a topology where the two subsets of nodes can't talk to each other. That way, Slurm will not mix nodes of the different instruction sets. The problem with this is that it's a "permanent" solution - it's not flexible. I would imagine there are times when you would want to use both your AVX2 and AVX512 processors in a single job.

I do something like this because we have 10 nodes set aside for serial jobs that are connected only by 1 GbE. We obviously don't want internode jobs running there, so in my topology file, each of those nodes has it's own switch that's not connected to any other switch.


If you want to continue down the road you've already started on, can you
provide more information, like the partition definitions and the gres
definitions? In general, Slurm should support submitting to multiple
partitions.
As far as I understood it, you can give a comma separated list of
partitions to sbatch but it's not possible to this by default?


Incorrect. Giving a comma separated list is possible and is the default behavior for Slurm. From the sbatch documentation (emphasis added to the relevant sentence):

-p, --partition=<partition_names>
Request a specific partition for the resource allocation. If not specified, the default behavior is to allow the slurm controller to select the default partition as designated by the system administrator. If the job can use more than one partition, specify their names in a comma separate list and the one offering earliest initiation will be used with no regard given to the partition name ordering (although higher priority partitions will be considered first). When the job is initiated, the name of the partition used will be placed first in the job record partition string.
But you can't have a job *span* multiple partitions, but I don't think that was ever your goal.


Prentice

Reply all
Reply to author
Forward
0 new messages