[slurm-users] Sharding not working correctly if several gpu types are defined

1,477 views

Skip to first unread message

EPF (Esben Peter Friis)

unread,

Jan 5, 2023, 10:52:03 AM1/5/23

to slurm...@lists.schedmd.com

Really great that there is now a way to share GPUs between several jobs - even with several GPUs per host. Thanks for adding this feature!

I have compiled (against cuda 11.8) and installed 22.05.7.

The test system is one host with 4 GPUS (3 x Nvidia A5000 + 1 x Nivida RTX5000)

nvidia-smi reports this:

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|===============================+======================+======================|

| 0 NVIDIA RTX A5000 On | 00000000:02:00.0 Off | Off |

| 42% 62C P2 88W / 230W | 207MiB / 24564MiB | 0% Default |

| | | N/A |

+-------------------------------+----------------------+----------------------+

| 1 NVIDIA RTX A5000 On | 00000000:03:00.0 Off | Off |

| 45% 61C P5 80W / 230W | 3MiB / 24564MiB | 0% Default |

| | | N/A |

+-------------------------------+----------------------+----------------------+

| 2 Quadro RTX 5000 On | 00000000:83:00.0 Off | Off |

| 51% 63C P0 67W / 230W | 3MiB / 16384MiB | 0% Default |

| | | N/A |

+-------------------------------+----------------------+----------------------+

| 3 NVIDIA RTX A5000 On | 00000000:84:00.0 Off | Off |

| 31% 52C P0 64W / 230W | 3MiB / 24564MiB | 0% Default |

| | | N/A |

+-------------------------------+----------------------+----------------------+

My gres.conf is this. The RTX5000 has less memory, so we configure it with less shards:

AutoDetect=nvml

Name=gpu Type=A5000 File=/dev/nvidia0

Name=gpu Type=A5000 File=/dev/nvidia1

Name=gpu Type=RTX5000 File=/dev/nvidia2

Name=gpu Type=A5000 File=/dev/nvidia3

Name=shard Count=24 File=/dev/nvidia0

Name=shard Count=24 File=/dev/nvidia1

Name=shard Count=16 File=/dev/nvidia2

Name=shard Count=24 File=/dev/nvidia3

if I don't configure gpus by type - like this in slurm.conf:

NodeName=koala NodeAddr=10.194.132.190 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 Gres=gpu:4,shard:88 Feature=gpu,ht

and run 7 jobs, each requesting 12 shards, it works exacly as expected: 2 jobs on each of the A5000's and one job on the RTX5000. (Subsequent jobs requesting 12 shards are correctly queued)

+-----------------------------------------------------------------------------+

| Processes: |

| GPU GI CI PID Type Process name GPU Memory |

| ID ID Usage |

|=============================================================================|

| 0 N/A N/A 904552 C ...ing_proj/.venv/bin/python 204MiB |

| 0 N/A N/A 1160663 C ...-2020-ubuntu20.04/bin/gmx 260MiB |

| 0 N/A N/A 1160758 C ...-2020-ubuntu20.04/bin/gmx 254MiB |

| 1 N/A N/A 1160643 C ...-2020-ubuntu20.04/bin/gmx 262MiB |

| 1 N/A N/A 1160647 C ...-2020-ubuntu20.04/bin/gmx 256MiB |

| 2 N/A N/A 1160659 C ...-2020-ubuntu20.04/bin/gmx 174MiB |

| 3 N/A N/A 1160644 C ...-2020-ubuntu20.04/bin/gmx 248MiB |

| 3 N/A N/A 1160755 C ...-2020-ubuntu20.04/bin/gmx 260MiB |

+-----------------------------------------------------------------------------+

That's great!
If we run jobs requiring one or more full GPUs, ee would like to be able to request specific GPU types as well

But if I configure the gpus also by name like this in slurm.conf:

NodeName=koala NodeAddr=10.194.132.190 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 Gres=gpu:A5000:3,gpu:RTX5000:1,shard:88 Feature=gpu,ht

and run 7 jobs, each requesting 12 shards, It does NOT Work. It starts 2 jobs on the first two A5000's, two job on the RTX5000, and one job on last A5000. Strangely, it still knows that it should not start more jobs - subsequent jobs are still queued.

+-----------------------------------------------------------------------------+

| Processes: |

| GPU GI CI PID Type Process name GPU Memory |

| ID ID Usage |

|=============================================================================|

| 0 N/A N/A 904552 C ...ing_proj/.venv/bin/python 204MiB |

| 0 N/A N/A 1176564 C ...-2020-ubuntu20.04/bin/gmx 258MiB |

| 0 N/A N/A 1176565 C ...-2020-ubuntu20.04/bin/gmx 258MiB |

| 1 N/A N/A 1176562 C ...-2020-ubuntu20.04/bin/gmx 258MiB |

| 1 N/A N/A 1176566 C ...-2020-ubuntu20.04/bin/gmx 258MiB |

| 2 N/A N/A 1176560 C ...-2020-ubuntu20.04/bin/gmx 172MiB |

| 2 N/A N/A 1176561 C ...-2020-ubuntu20.04/bin/gmx 172MiB |

| 3 N/A N/A 1176563 C ...-2020-ubuntu20.04/bin/gmx 258MiB |

+-----------------------------------------------------------------------------+

It is also strange that "scontrol show node" seems to list the shards correctly, even in this case:

NodeName=koala Arch=x86_64 CoresPerSocket=14

CPUAlloc=0 CPUEfctv=56 CPUTot=56 CPULoad=22.16

AvailableFeatures=gpu,ht

ActiveFeatures=gpu,ht

Gres=gpu:A5000:3(S:0-1),gpu:RTX5000:1(S:0-1),shard:A5000:72(S:0-1),shard:RTX5000:16(S:0-1)

NodeAddr=10.194.132.190 NodeHostName=koala Version=22.05.7

OS=Linux 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022

RealMemory=1 AllocMem=0 FreeMem=390036 Sockets=2 Boards=1

State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A

Partitions=urgent,high,medium,low

BootTime=2023-01-03T12:37:17 SlurmdStartTime=2023-01-05T16:24:53

LastBusyTime=2023-01-05T16:37:24

CfgTRES=cpu=56,mem=1M,billing=56,gres/gpu=4,gres/shard=88

AllocTRES=

CapWatts=n/a

CurrentWatts=0 AveWatts=0

ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

In all cases, my jobs are submitted with commands like this:

sbatch --gres=shard:12 --wrap 'bash -c " ... (command goes here) ... "'

The behavior is very consistent. I have played around with adding CUDA_DEVICE_ORDER=PCI_BUS_ID to the environment of slurmd and slurmctld, but it makes no difference.

Is this a bug or a feature?

Cheers,

Esben

EPF (Esben Peter Friis)

unread,

Jan 5, 2023, 11:14:40 AM1/5/23

to slurm...@lists.schedmd.com

Update:

If I call the smaller card "Quadro" rather that "RTX5000", is works correctly

in slurm.comf:

NodeName=koala NodeAddr=10.194.132.190 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 Gres=gpu:A5000:3,gpu:Quadro:1,shard:88 Feature=gpu,ht

is gres.conf:

AutoDetect=nvml

Name=gpu Type=A5000 File=/dev/nvidia0

Name=gpu Type=A5000 File=/dev/nvidia1

Name=gpu Type=Quadro File=/dev/nvidia2

Name=gpu Type=A5000 File=/dev/nvidia3

Name=shard Count=24 File=/dev/nvidia0

Name=shard Count=24 File=/dev/nvidia1

Name=shard Count=16 File=/dev/nvidia2

Name=shard Count=24 File=/dev/nvidia3

Does the name string have to be (part of) what nvidia-smi or NVML reports?

Cheers,

Esben

From: EPF (Esben Peter Friis) <E...@novozymes.com>
Sent: Thursday, January 5, 2023 16:51
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: Sharding not working correctly if several gpu types are defined

EPF (Esben Peter Friis)

unread,

Jan 6, 2023, 7:37:48 AM1/6/23

to slurm...@lists.schedmd.com

Another update:

Sorry, my bad!
This is already part of the Gres documentation:

"""

For Type to match a system-detected device, it must either exactly match or be a substring of the GPU name reported by slurmd via the AutoDetect mechanism. This GPU name will have all spaces replaced with underscores. To see the GPU name, set SlurmdDebug=debug2 in your slurm.conf and either restart or reconfigure slurmd and check the slurmd log.

"""

Only thing that is still not clear to me is that it also doesn't work if I remove the AutoDetect=nvml line from gres.conf.

Cheers, and have a nice weekend

Esben

From: EPF (Esben Peter Friis) <E...@novozymes.com>

Sent: Thursday, January 5, 2023 17:14
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: Re: Sharding not working correctly if several gpu types are defined

Reply all

Reply to author

Forward

0 new messages