[slurm-users] Dynamic MIG Question

23 views
Skip to first unread message

Aaron Kollmann

unread,
Nov 22, 2023, 1:22:52 PM11/22/23
to slurm...@schedmd.com

Hello All,

I am currently working in a research project and we are trying to find out whether we can use NVIDIAs multi-instance GPU (MIG) dynamically in SLURM.

For instance:

- a user requests a job and wants a GPU but none is available 

- now SLURM will reconfigure a MIG GPU to create a partition (e.g. 1g.5gb) which becomes available and allocated immediately

I can already reconfigure MIG + SLURM within a few seconds to start jobs on newly partitioned resources, but Jobs get killed when I restart slurmd on nodes with a changed MIG config. (see script example below)

Do you think it is possible to develop a plugin or change SLURM to the extent that dynamic MIG will be supported one day?

(The website says it is not supported)



Best

- Aaron




#!/usr/bin/bash

# Generate Start Config
killall slurmd
killall slurmctld
nvidia-smi mig -dci
nvidia-smi mig -dgi
nvidia-smi mig -cgi 19,14,5 -i 0 -C
nvidia-smi mig -cgi 0 -i 1 -C
cp -f ./slurm-19145-0.conf /etc/slurm/slurm.conf
slurmd -c
slurmctld -c
sleep 5

# Start a running and a pending job (the first job gets killed by slurm)
srun -w gx06 -c 2 --mem 1G --gres=gpu:a100_1g.5gb:1 sleep 300 &
srun -w gx06 -c 2 --mem 1G --gres=gpu:a100_1g.5gb:1 sleep 300 &
sleep 5

# Simulate MIG Config Change
nvidia-smi mig -i 1 -dci
nvidia-smi mig -i 1 -dgi
nvidia-smi mig -cgi 19,14,5 -i 1 -C
cp -f ./slurm-2x19145.conf /etc/slurm/slurm.conf
killall slurmd
killall slurmctld
slurmd
slurmctld

Davide DelVento

unread,
Nov 22, 2023, 3:23:54 PM11/22/23
to Slurm User Community List
I assume you mean the sentence about dynamic MIG at https://slurm.schedmd.com/gres.html#MIG_Management
Could it be supported? I think so, but only if one of their paying customers (that could be you) asks for it.
Reply all
Reply to author
Forward
0 new messages