[slurm-users] Cannot enable Gang scheduling

175 views
Skip to first unread message

Helder Daniel

unread,
Jan 12, 2023, 9:22:44 PM1/12/23
to slurm...@schedmd.com, Programação de Sistemas
Hi,

I am trying to enable gang scheduling on a server with a CPU with 32 cores and 4 GPUs.

However, using Gang sched, the cpu jobs (or gpu jobs) are not being preempted after the time slice, which is set to 30 secs.

Below is a snapshot of squeue. There are 3 jobs each needing 32 cores. The first 2 jobs launched are never preempted. The 3rd job is forever (or at least until one of the other 2 ends) starving:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               313  asimov01 cpu-only  hdaniel PD       0:00      1 (Resources)
               311  asimov01 cpu-only  hdaniel  R       1:52      1 asimov
               312  asimov01 cpu-only  hdaniel  R       1:49      1 asimov

The same happens with GPU jobs. If I launch 5 jobs, requiring one GPU each, the 5th job will never run. The preemption is not working with the specified timeslice.

I tried several combinations:

SchedulerType=sched/builtin  and backfill
SelectType=select/cons_tres   and linear

I'll appreciate any help and suggestions
The slurm.conf is below.
Thanks

ClusterName=asimov
SlurmctldHost=localhost
MpiDefault=none
ProctrackType=proctrack/linuxproc # proctrack/cgroup
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm/slurmctld
SwitchType=switch/none
TaskPlugin=task/none # task/cgroup
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
#
# SCHEDULING
#FastSchedule=1 #obsolete
SchedulerType=sched/builtin #backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core    #CR_Core_Memory let's only one job run at a time
PreemptType = preempt/partition_prio
PreemptMode = SUSPEND,GANG
SchedulerTimeSlice=30           #in seconds, default 30
#
# LOGGING AND ACCOUNTING
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageEnforce=associations
#ClusterName=bip-cluster
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
#
#
# COMPUTE NODES
#NodeName=asimov CPUs=64 RealMemory=500 State=UNKNOWN
#PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP

# Partitions
GresTypes=gpu
NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN
PartitionName=asimov01 Nodes=asimov Default=YES MaxTime=INFINITE MaxNodes=1 DefCpuPerGPU=2 State=UP

Kevin Broch

unread,
Jan 13, 2023, 6:16:50 AM1/13/23
to Slurm User Community List, slurm...@schedmd.com
Problem might be that OverSubscribe is not enabled?  w/o it, I don't believe the time-slicing can be GANG scheduled

Can you do a "scontrol show partition" to verify that it is?

Helder Daniel

unread,
Jan 13, 2023, 7:09:18 AM1/13/23
to Kevin Broch, slurm...@schedmd.com, Slurm User Community List
Hi Kevin

I did a "scontrol show partition".
Oversubscribe was not enabled.
I enable it in slurm.conf with:

(...)
GresTypes=gpu
NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN
PartitionName=asimov01 OverSubscribe=FORCE Nodes=asimov Default=YES MaxTime=INFINITE MaxNodes=1 DefCpuPerGPU=2 State=UP

but now it is working only with CPU jobs. It does not preempt gpu jobs.
Lauching 3 cpu only jobs, each requiring 32 out of 64 cores it preempt after the timeslice as expected

sbatch --cpus-per-task=32 test-cpu.sh

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               352  asimov01 cpu-only  hdaniel  R       0:58      1 asimov
               353  asimov01 cpu-only  hdaniel  R       0:25      1 asimov
               351  asimov01 cpu-only  hdaniel  S       0:36      1 asimov

But launching 3 GPU jobs, each requiring 2 out of 4 GPUs it does not preempt the first 2 that start running.
It says that the 3rd job is hanging on resources.

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               356  asimov01      gpu  hdaniel PD       0:00      1 (Resources)
               354  asimov01      gpu  hdaniel  R       3:05      1 asimov
               355  asimov01      gpu  hdaniel  R       3:02      1 asimov

Do I need to change anything else in the configuration to support also gpu gang scheduling?
Thanks

============================================================================
scontrol show partition asimov01
PartitionName=asimov01
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=asimov
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
   State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=NONE
   JobDefaults=DefCpuPerGPU=2
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
--
com os melhores cumprimentos,

Helder Daniel
Universidade do Algarve
Faculdade de Ciências e Tecnologia
Departamento de Engenharia Electrónica e Informática
https://www.ualg.pt/pt/users/hdaniel

Helder Daniel

unread,
Jan 13, 2023, 7:30:03 AM1/13/23
to Kevin Broch, slurm...@schedmd.com, Slurm User Community List
PS: I checked the resources while running the 3 GPU jobs which where launched with:

sbatch --gpus-per-task=2 --cpus-per-task=1 cnn-multi.sh

The server have 64 cores (32 x2 with hyperthreading)

cat /proc/cpuinfo | grep processor | tail -n1
processor : 63

128 GB main memory:

hdaniel@asimov:~/Works/Turbines/02-CNN$ cat /proc/meminfo
MemTotal:       131725276 kB
MemFree:        106773356 kB
MemAvailable:   109398780 kB
Buffers:          161012 kB
(...)

And 4 GPUs each with 16GB memory:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4000    On   | 00000000:41:00.0 Off |                  Off |
| 45%   63C    P2    47W / 140W |  15370MiB / 16376MiB |     14%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A4000    On   | 00000000:42:00.0 Off |                  Off |
| 44%   63C    P2    45W / 140W |  15370MiB / 16376MiB |     14%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A4000    On   | 00000000:61:00.0 Off |                  Off |
| 50%   68C    P2    52W / 140W |  15370MiB / 16376MiB |     15%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A4000    On   | 00000000:62:00.0 Off |                  Off |
| 46%   64C    P2    47W / 140W |  15370MiB / 16376MiB |     14%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2146      G   /usr/lib/xorg/Xorg                  9MiB |
|    0   N/A  N/A      2472      G   /usr/bin/gnome-shell                4MiB |
|    0   N/A  N/A    524228      C   /bin/python                     15352MiB |
|    1   N/A  N/A      2146      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A    524228      C   /bin/python                     15362MiB |
|    2   N/A  N/A      2146      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A    524226      C   /bin/python                     15362MiB |
|    3   N/A  N/A      2146      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A    524226      C   /bin/python                     15362MiB |
+-----------------------------------------------------------------------------+

Kevin Broch

unread,
Jan 13, 2023, 7:30:12 AM1/13/23
to Helder Daniel, slurm...@schedmd.com, Slurm User Community List
My guess, is that this isn't possible with GANG,SUSPEND.  GPU memory isn't managed in Slurm so the idea of suspending GPU memory for another job to use the rest simply isn't possible.

Helder Daniel

unread,
Jan 13, 2023, 7:51:29 AM1/13/23
to Kevin Broch, slurm...@schedmd.com, Slurm User Community List
Oh, ok.
I guess I was expecting that the GPU job was suspended copying GPU memory to RAM memory.

I tried also: REQUEUE,GANG and CANCEL,GANG.

None of these options seems to be able to preempt GPU jobs

Kevin Broch

unread,
Jan 13, 2023, 8:02:09 AM1/13/23
to Helder Daniel, slurm...@schedmd.com, Slurm User Community List
Sorry to hear that. Hopefully others in the group have some ideas/explanations.  I haven't had to deal with GPU resources in Slurm.

Helder Daniel

unread,
Jan 13, 2023, 8:19:54 AM1/13/23
to Kevin Broch, slurm...@schedmd.com, Slurm User Community List
Thanks for all your Help Kevin,
I really did miss the OverSubscribe option in the docs :-(
But now cpu job scheduling is working and I have a picture of the problem with gpu job scheduling to dig further :-)






Reply all
Reply to author
Forward
0 new messages