[slurm-users] GPU shards not exclusive

319 views
Skip to first unread message

Reed Dier via slurm-users

unread,
Feb 14, 2024, 5:42:30 PM2/14/24
to Slurm User Community List
I seem to have run into an edge case where I’m able to oversubscribe a specific subset of GPUs on one host in particular.

Slurm 22.05.8
Ubuntu 20.04
cgroups v1 (ProctrackType=proctrack/cgroup)

It seems to be partly a corner case with a couple of caveats.
This host has 2 different GPU types in the same host.
All GPUs are configured with 6 shards/gpu.

NodeName=gpu03              CPUs=88             RealMemory=768000   Sockets=2           CoresPerSocket=22       ThreadsPerCore=2    State=UNKNOWN   Feature=avx512      Gres=gpu:rtx:2,gpu:p40:6,shard:48

My gres.conf is auto-nvml, with nothing else in the file.

AutoDetect=nvml

Schedule and select from slurm.conf
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU_Memory
GresTypes=gpu,shard

Here’s a copy of slurmd’s log starting up with debug logging: https://pastebin.com/fbriTAZD
I checked there to make sure that it was allocating an even 6 shards per GPU as intended, and it did.
The only thing odd about the rtx cards is that they have an NVLink bridge between them, but I wouldn’t expect that to have any impact with respect to shards.

So far it only appears to manifest if anything is using the named gpu (--gres gpu:rtx:1), but not as a general gres (--gres gpu:1).
It also doesn’t occur with any count of named p40 gres either, only with the rtx gres.

The below was as “high” as I was able to capture it oversubscribed in very quick testing.
You can see there are 2 rtx and 43 shards allocated, which should be an impossible number, if we correctly assume an rtx==6 shards, and with 48 total shards, that should reduce the available pool of shards to 36 shards.

NodeName=gpu03 Arch=x86_64 CoresPerSocket=22
   CPUAlloc=88 CPUEfctv=88 CPUTot=88 CPULoad=0.27
   AvailableFeatures=avx512
   ActiveFeatures=avx512
   Gres=gpu:p40:6(S:0),gpu:rtx:2(S:0),shard:p40:36(S:0),shard:rtx:12(S:0)
   NodeAddr=gpu03 NodeHostName=gpu03 Version=22.05.8
   OS=Linux 5.4.0-164-generic #181-Ubuntu SMP Fri Sep 1 13:41:22 UTC 2023
   RealMemory=768000 AllocMem=45056 FreeMem=563310 Sockets=2 Boards=1
   State=ALLOCATED+RESERVED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=gpu,gpu-prod
   BootTime=2024-02-02T14:17:07 SlurmdStartTime=2024-02-14T16:28:17
   LastBusyTime=2024-02-14T16:38:03
   CfgTRES=cpu=88,mem=750G,billing=88,gres/gpu=8,gres/gpu:p40=6,gres/gpu:rtx=2,gres/shard=48
   AllocTRES=cpu=88,mem=44G,gres/gpu=2,gres/gpu:rtx=2,gres/shard=43
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

The above is 1 rtx:2 jobs, and 43 shard:1 jobs, which is as many as I can seem to push through with the given cpu cores on the system, I expect if there were 98-100 cores I could completely oversubscribe it.

If I do (1) rtx:2 + (24) shard:2 it actually limits it correctly to 36 shards, which is odd because the case that caused us to discover this was when 2 rtx:1 jobs were running for multiple days, and somehow shards got scheduled on those gpus.
   AllocTRES=cpu=40,mem=20G,gres/gpu=2,gres/gpu:rtx=2,gres/shard=36

If I flip it around to shards before rtx, then I’m unable to oversubscribe it.
If I add any P40 jobs before the rtx and then shard jobs, also unable to oversubscribe.

This sounded familiar to Bug 16484 although they claimed it was resolved by using auto-nvml which would preclude this node.

Here’s my very simple bash script that just srun's a sleep 20 so that it runs long enough to observe it.
#!/bin/bash
#rtx 2
for((i=1;i<=1;++i)) ;
do
srun --gres gpu:rtx:2 --reservation=shardtest -J "RTX   $i" sleep 20 &
done
sleep 1
#shard 48
for((i=1;i<=43;++i)) ;
do
srun --gres shard:1 --reservation=shardtest -J "shard $i" sleep 20 &
done
sleep 5 ; scontrol show node gpu03

Curious if anyone has any ideas beyond try a newer release?

Thanks,
Reed

wdennis--- via slurm-users

unread,
Feb 28, 2024, 8:29:54 AM2/28/24
to slurm...@lists.schedmd.com
Hi Reed,

Unfortunately, we had the same issue with 22.05.9; SchedMD advice was to upgrade to 23.11.x, and this appears to have resolved this issue for us. SchedMD support said to us, "We did a lot of work regarding shards in the 23.11 release."

HTH,
Will

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Reed Dier via slurm-users

unread,
Feb 29, 2024, 12:31:32 PM2/29/24
to wde...@nec-labs.com, slurm...@lists.schedmd.com
Hi Will,

I appreciate your corroboration.

After we upgraded to 23.02.$latest, it seemed to make it easier to reproduce than before.
However, the issue appears to have subsided, and the only change I can potentially attribute it to was after turning on
SlurmctldParameters=rl_enable 
in slurm.conf.

And here’s hoping that 23.11 will offer even more in the future.

Reed
Reply all
Reply to author
Forward
0 new messages