[slurm-users] Slurm and MIG configuration help

524 views
Skip to first unread message

Edoardo Arnaudo

unread,
Apr 12, 2023, 5:19:23 PM4/12/23
to slurm...@lists.schedmd.com
Hi all! I've successfully managed to configure slurm on one head node and two different compute nodes, one using "old" consumer RTX cards, a new one using 4xA100 GPUS (80gb version).
I am now trying to set up a hybrid MIG configuration, where devices 0,1 are kept as is, while 2 and 3 are split into 3.40gb MIG instances.

MIG itself works well, I am able to keep 0,1 disabled and 2,3 enabled with 2x40gb.
Trying to configure slurm with this had me lost: I am trying countless variations, but there isn't a single one that has worked so far.
Here's what I have at the moment:

- My gres.conf has gone from the full list to literally just "AutoDetect=nvml", slurmd -G returns a somewhat reasonable output:

slurmd: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
slurmd: Gres Name=gpu Type=a100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=24-31 CoreCnt=128 Links=-1,4,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=a100_3g.39gb Count=1 Index=283 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap282,/dev/nvidia-caps/nvidia-cap283 Cores=56-63 CoreCnt=128 Links=-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=a100_3g.39gb Count=1 Index=418 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap417,/dev/nvidia-caps/nvidia-cap418 Cores=40-47 CoreCnt=128 Links=-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=a100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=8-15 CoreCnt=128 Links=4,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=a100_3g.39gb Count=1 Index=292 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap291,/dev/nvidia-caps/nvidia-cap292 Cores=56-63 CoreCnt=128 Links=0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=a100_3g.39gb Count=1 Index=427 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap426,/dev/nvidia-caps/nvidia-cap427 Cores=40-47 CoreCnt=128 Links=0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML

And here I have the first doubt: the MIG profile is supposed to be called 3g.40gb, why is it popping up as 3g.39gb?

- My slurm.conf is very similar to the documentation example, with:  Gres=gpu:a100:2,gpu:a100_3g.39gb:4
- I restarted slurmctld and slurmd on the node, everything appears to be working.

When I try to send a srun command, weird stuff happens: 
- srun --gres=gpu:a100:2 returns a non-mig device AND a mig device together
- sinfo only shows 2 a100 gpus "gpu:a100:2(S:1)", or gpu count too low (0 < 4) for the MIG devices and stays in drain state
- the fullly qualified name "gpu:a100_3g.39gb:1" returns  "Unable to allocate resources: Requested node configuration is not available".
Where do I start to fix this mess?

Thank you for your patience!
Cheers,

Edoardo

 
Reply all
Reply to author
Forward
0 new messages