[slurm-users] GPU jobs not allocated correctly when requesting more than 1 CPU

268 views
Skip to first unread message

Rohith Mohan

unread,
Oct 21, 2022, 1:14:45 PM10/21/22
to slurm...@lists.schedmd.com
Hi,

I recently set up slurm for the first time on our small cluster and got everything working well except for one issue. When requesting jobs with GPU and CPU, requesting 1 GPU+1CPU is allocated correctly among the nodes but requesting 1GPU+2CPUs is not allocated correctly. I'm not sure exactly what's causing the issue and was hoping someone might have some suggestions.

Slurm version: 22.05.3
OS: RedHat 7.9 (head node), and RedHat 7.4 (compute nodes)
Hardware config: 1 head node, 5 compute nodes each with 2 GPUs and 8 CPUs

Some example scenarios to explain the problem:
Submitting a job requesting 1 CPU and 1 GPU works fine:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=4GB
#SBATCH --cpus-per-task=1
#SBATCH --gpus=1

- Job A requests 1 CPU, 1GPU and 4GB memory -> assigned to node1
- Job B requests 1 CPU, 1GPU and 4GB memory -> assigned to node1
- Job C requests 1 CPU, 1GPU and 4GB memory -> assigned to node2 as there's only 2 GPUs per node

Submitting a job requesting 2 CPUs and 1 GPU causes issues:
#SBATCH --cpus-per-task=2

- Job A requests 2 CPUs, 1GPU and 4GB memory -> assigned to node1
- Job B requests 2 CPUs, 1GPU and 4GB memory -> assigned to node2 even though node1 should still have resources available

Including what might be relevant info from slurm.conf below in case it's helpful:
DefMemPerCPU=2048
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU_Memory
DefCpuPerGPU=1
GresTypes=gpu
NodeName= computenodes [1-5] NodeAddr= computenodes[1-5] CPUs=8 RealMemory=64189 Gres=gpu:2 State=UNKNOWN
PartitionName=batch Nodes=ALL Default=YES MaxTime=INFINITE State=UP

Appreciate any suggestions/ideas!

Thanks,
Rohith

Diego Zuccato

unread,
Oct 26, 2022, 5:03:57 AM10/26/22
to slurm...@lists.schedmd.com
Il 21/10/2022 19:14, Rohith Mohan ha scritto:

IIUC this could be the source of your problem:

> SelectTypeParameters=CR_CPU_Memory

Maybe try CR_Core_memory . CR_CPU* does not have notion of
sockets/cores/threads.

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Reply all
Reply to author
Forward
0 new messages