[slurm-users] Slurm and MIG configuration help

548 views

Skip to first unread message

Edoardo Arnaudo

unread,

Apr 12, 2023, 5:19:23 PM4/12/23

to slurm...@lists.schedmd.com

Hi all! I've successfully managed to configure slurm on one head node and two different compute nodes, one using "old" consumer RTX cards, a new one using 4xA100 GPUS (80gb version).

I am now trying to set up a hybrid MIG configuration, where devices 0,1 are kept as is, while 2 and 3 are split into 3.40gb MIG instances.

MIG itself works well, I am able to keep 0,1 disabled and 2,3 enabled with 2x40gb.

Trying to configure slurm with this had me lost: I am trying countless variations, but there isn't a single one that has worked so far.

Here's what I have at the moment:

- My gres.conf has gone from the full list to literally just "AutoDetect=nvml", slurmd -G returns a somewhat reasonable output:

slurmd: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected

slurmd: Gres Name=gpu Type=a100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=24-31 CoreCnt=128 Links=-1,4,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML

slurmd: Gres Name=gpu Type=a100_3g.39gb Count=1 Index=283 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap282,/dev/nvidia-caps/nvidia-cap283 Cores=56-63 CoreCnt=128 Links=-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML

slurmd: Gres Name=gpu Type=a100_3g.39gb Count=1 Index=418 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap417,/dev/nvidia-caps/nvidia-cap418 Cores=40-47 CoreCnt=128 Links=-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML

slurmd: Gres Name=gpu Type=a100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=8-15 CoreCnt=128 Links=4,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML

slurmd: Gres Name=gpu Type=a100_3g.39gb Count=1 Index=292 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap291,/dev/nvidia-caps/nvidia-cap292 Cores=56-63 CoreCnt=128 Links=0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML

slurmd: Gres Name=gpu Type=a100_3g.39gb Count=1 Index=427 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap426,/dev/nvidia-caps/nvidia-cap427 Cores=40-47 CoreCnt=128 Links=0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML

And here I have the first doubt: the MIG profile is supposed to be called 3g.40gb, why is it popping up as 3g.39gb?

- My slurm.conf is very similar to the documentation example, with: Gres=gpu:a100:2,gpu:a100_3g.39gb:4

- I restarted slurmctld and slurmd on the node, everything appears to be working.

When I try to send a srun command, weird stuff happens:

- srun --gres=gpu:a100:2 returns a non-mig device AND a mig device together

- sinfo only shows 2 a100 gpus "gpu:a100:2(S:1)", or gpu count too low (0 < 4) for the MIG devices and stays in drain state

- the fullly qualified name "gpu:a100_3g.39gb:1" returns "Unable to allocate resources: Requested node configuration is not available".

Where do I start to fix this mess?

Thank you for your patience!

Cheers,

Edoardo

Reply all

Reply to author

Forward

0 new messages