[slurm-users] Cannot enable Gang scheduling

Helder Daniel

unread,

Jan 12, 2023, 9:22:44 PM1/12/23

to slurm...@schedmd.com, Programação de Sistemas

Hi,

I am trying to enable gang scheduling on a server with a CPU with 32 cores and 4 GPUs.

However, using Gang sched, the cpu jobs (or gpu jobs) are not being preempted after the time slice, which is set to 30 secs.

Below is a snapshot of squeue. There are 3 jobs each needing 32 cores. The first 2 jobs launched are never preempted. The 3rd job is forever (or at least until one of the other 2 ends) starving:

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
313 asimov01 cpu-only hdaniel PD 0:00 1 (Resources)
311 asimov01 cpu-only hdaniel R 1:52 1 asimov
312 asimov01 cpu-only hdaniel R 1:49 1 asimov

The same happens with GPU jobs. If I launch 5 jobs, requiring one GPU each, the 5th job will never run. The preemption is not working with the specified timeslice.

I tried several combinations:

SchedulerType=sched/builtin and backfill
SelectType=select/cons_tres and linear

I'll appreciate any help and suggestions

The slurm.conf is below.

Thanks

ClusterName=asimov
SlurmctldHost=localhost
MpiDefault=none
ProctrackType=proctrack/linuxproc # proctrack/cgroup
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm/slurmctld
SwitchType=switch/none
TaskPlugin=task/none # task/cgroup
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
#
# SCHEDULING
#FastSchedule=1 #obsolete
SchedulerType=sched/builtin #backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core #CR_Core_Memory let's only one job run at a time
PreemptType = preempt/partition_prio
PreemptMode = SUSPEND,GANG
SchedulerTimeSlice=30 #in seconds, default 30
#
# LOGGING AND ACCOUNTING
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageEnforce=associations
#ClusterName=bip-cluster
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
#
#
# COMPUTE NODES
#NodeName=asimov CPUs=64 RealMemory=500 State=UNKNOWN
#PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP

# Partitions
GresTypes=gpu
NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN
PartitionName=asimov01 Nodes=asimov Default=YES MaxTime=INFINITE MaxNodes=1 DefCpuPerGPU=2 State=UP

Kevin Broch

unread,

Jan 13, 2023, 6:16:50 AM1/13/23

to Slurm User Community List, slurm...@schedmd.com

Problem might be that OverSubscribe is not enabled? w/o it, I don't believe the time-slicing can be GANG scheduled

Can you do a "scontrol show partition" to verify that it is?

Helder Daniel

unread,

Jan 13, 2023, 7:09:18 AM1/13/23

to Kevin Broch, slurm...@schedmd.com, Slurm User Community List

Hi Kevin

I did a "scontrol show partition".

Oversubscribe was not enabled.

I enable it in slurm.conf with:

(...)

GresTypes=gpu
NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN

PartitionName=asimov01 OverSubscribe=FORCE Nodes=asimov Default=YES MaxTime=INFINITE MaxNodes=1 DefCpuPerGPU=2 State=UP

but now it is working only with CPU jobs. It does not preempt gpu jobs.

Lauching 3 cpu only jobs, each requiring 32 out of 64 cores it preempt after the timeslice as expected

sbatch --cpus-per-task=32 test-cpu.sh

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

352 asimov01 cpu-only hdaniel R 0:58 1 asimov
353 asimov01 cpu-only hdaniel R 0:25 1 asimov
351 asimov01 cpu-only hdaniel S 0:36 1 asimov

But launching 3 GPU jobs, each requiring 2 out of 4 GPUs it does not preempt the first 2 that start running.

It says that the 3rd job is hanging on resources.

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

356 asimov01 gpu hdaniel PD 0:00 1 (Resources)
354 asimov01 gpu hdaniel R 3:05 1 asimov
355 asimov01 gpu hdaniel R 3:02 1 asimov

Do I need to change anything else in the configuration to support also gpu gang scheduling?

Thanks

============================================================================

scontrol show partition asimov01
PartitionName=asimov01
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=1 MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=asimov
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=DefCpuPerGPU=2
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

--

com os melhores cumprimentos,

Helder Daniel
Universidade do Algarve
Faculdade de Ciências e Tecnologia
Departamento de Engenharia Electrónica e Informática
https://www.ualg.pt/pt/users/hdaniel

Helder Daniel

unread,

Jan 13, 2023, 7:30:03 AM1/13/23

to Kevin Broch, slurm...@schedmd.com, Slurm User Community List

PS: I checked the resources while running the 3 GPU jobs which where launched with:

sbatch --gpus-per-task=2 --cpus-per-task=1 cnn-multi.sh

The server have 64 cores (32 x2 with hyperthreading)

cat /proc/cpuinfo | grep processor | tail -n1
processor : 63

128 GB main memory:

hdaniel@asimov:~/Works/Turbines/02-CNN$ cat /proc/meminfo
MemTotal: 131725276 kB
MemFree: 106773356 kB
MemAvailable: 109398780 kB
Buffers: 161012 kB

(...)

And 4 GPUs each with 16GB memory:

Kevin Broch

unread,

Jan 13, 2023, 7:30:12 AM1/13/23

to Helder Daniel, slurm...@schedmd.com, Slurm User Community List

My guess, is that this isn't possible with GANG,SUSPEND. GPU memory isn't managed in Slurm so the idea of suspending GPU memory for another job to use the rest simply isn't possible.

Helder Daniel

unread,

Jan 13, 2023, 7:51:29 AM1/13/23

to Kevin Broch, slurm...@schedmd.com, Slurm User Community List

Oh, ok.

I guess I was expecting that the GPU job was suspended copying GPU memory to RAM memory.

I tried also: REQUEUE,GANG and CANCEL,GANG.

None of these options seems to be able to preempt GPU jobs

Kevin Broch

unread,

Jan 13, 2023, 8:02:09 AM1/13/23

to Helder Daniel, slurm...@schedmd.com, Slurm User Community List

Sorry to hear that. Hopefully others in the group have some ideas/explanations. I haven't had to deal with GPU resources in Slurm.

Helder Daniel

unread,

Jan 13, 2023, 8:19:54 AM1/13/23

to Kevin Broch, slurm...@schedmd.com, Slurm User Community List

Thanks for all your Help Kevin,

I really did miss the OverSubscribe option in the docs :-(

But now cpu job scheduling is working and I have a picture of the problem with gpu job scheduling to dig further :-)

Reply all

Reply to author

Forward