[slurm-users] Usage of particular GPU out of 4 GPUs while submitting jobs to DGX Server

10 views
Skip to first unread message

Ravi Konila

unread,
Nov 19, 2023, 11:37:13 PM11/19/23
to slurm...@lists.schedmd.com
Hello Everyone
 
I am just beginner of slurm and started to use the same on our DGX Server which has 4 numbers of A100, 80GB GPUs.
Everything works fine, jobs goes to random GPUs (free available).
My question is related to submission of jobs to those GPUs. How do a student submit the job to a particular GPU out of 4 GPUs? For example, studentA should submit the job to GPU ID 1 instead of GPU ID 0.
 
Also we are planning for MIG in the server and we would like few students to submit the jobs to 20G partition and non critical jobs to 5G partition.
How should be the slurm.conf and gres.conf in this case.
 
Currently our configuration is as below:
 
gres.conf
Name=gpu    type=A100    file=/dev/nvidia[0-2,4]
 
------------
slurm.conf
.
.
.
GresTypes=gpu
NodeName=rl-dgxs-r21-l2 Gres=gpu:A100:4 CPUs=128 RealMemory=500000 State=UNKNOWN
PartitionName=LocalGPUQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP
 
-------------
 
Any suggestions or help in this regard is highly appreciated.
 
With Warm Regards
Ravi Konila

Daniel Letai

unread,
Nov 20, 2023, 3:10:27 AM11/20/23
to slurm...@lists.schedmd.com

Hi Ravi,

On 20/11/2023 6:36, Ravi Konila wrote:
Hello Everyone
 
<snipped>
My question is related to submission of jobs to those GPUs. How do a student submit the job to a particular GPU out of 4 GPUs? For example, studentA should submit the job to GPU ID 1 instead of GPU ID 0.

In classical HPC this is a counterproductive - you don't want to assign specific resources to jobs, as this would lead to jobs waiting needlessly while resources are available, so I think some background for this request might help understand the need and possible solutions.


That said, it might be possible by assigning different artificial types to each gpu, e.g. in gres.conf Name=gpu type=gpu0 file=/dev/nvidia0 etc...

Then submission would be of the form

sbatch --gpus=gpu0


The issue would be with submitting in the general case, where you want any gpu. For that you might have to fall back to using gres as in

sbatch --gres=gpu:3


This is obviously cumbersome and less convenient, and I'm not sure this is not an XY problem.

 
Also we are planning for MIG in the server and we would like few students to submit the jobs to 20G partition and non critical jobs to 5G partition.
How should be the slurm.conf and gres.conf in this case.
Can you elaborate on the use case? It's unclear to me if the students are expected to decide on their own when to submit to 20G and when to 5G, if students with access to 20G should also use the 5G together with the rest of the students, or if all students should have access to both partitions and some other criteria should be used to determine placement.

 
Currently our configuration is as below:
 
gres.conf
Name=gpu    type=A100    file=/dev/nvidia[0-2,4]
 
------------
slurm.conf
.
.
.
GresTypes=gpu
NodeName=rl-dgxs-r21-l2 Gres=gpu:A100:4 CPUs=128 RealMemory=500000 State=UNKNOWN
PartitionName=LocalGPUQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP
 
-------------
 
Any suggestions or help in this regard is highly appreciated.
 
With Warm Regards
Ravi Konila

Best regards,

--Dani_L.

Reply all
Reply to author
Forward
0 new messages