Hello ~
Please help me.
Total GPU : 4
Large qos : 3 (max 3 gpus)
Base qos : 2 (max 2 gpus)
I have a total of four GPUs,
and when a job with a large QoS is using three GPUs and a job with a base QoS is created,
I want the large QoS job to wait for a certain period before the base QoS job starts.
However, as soon as the base QoS job is created, the large QoS job is immediately canceled without any waiting time.
But in the slurmctld log, there is a grace time log.
[2023-11-02T11:37:36.589] debug: setting 3600 sec preemption grace time for JobId=153 to reclaim resources for JobId=154
Could you help me understand what might be going wrong?
Here's my Slurm configuration details.
If you need any more information, please feel free to reply at any time.
### /etc/slurm/slurm.conf ###
# cat /etc/slurm/slurm.conf
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
# Global Configuration
ClusterName=cluster
SlurmctldHost=master01
SlurmUser=slurm
GresTypes=gpu
JobRequeue=1
ProctrackType=proctrack/cgroup
ReturnToService=2
StateSaveLocation=/NFS/slurm/ctld
SwitchType=switch/none
TaskPlugin=task/cgroup,task/affinity
# SLRUMCTLD
SlurmctldPidFile=/var/spool/slurm/slurmctld.pid
SlurmctldLogFile=/var/log/slurm//slurmctld.log
SlurmctldTimeout=30
SlurmctldDebug=debug5
# SLURMD
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmdPidFile=/var/spool/slurm/slurmd.pid
SlurmdSpoolDir=/var/spool/slurm/
SlurmdTimeout=30
SlurmdDebug=debug5
# SCHEDULING
SchedulerType=sched/backfill
# JOB PRIORITY
PriorityType=priority/multifactor
PriorityWeightQOS=10000
# Select Resource
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU
# Job
JobAcctGatherType=jobacct_gather/cgroup
JobCompUser=slurm
JobCompType=jobcomp/filetxt
JobCompLoc=/NFS/slurm/job-comp/slurm_jobcomp.log
MinJobAge=3600
# Account
AccountingStoreFlags=job_comment
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=master01
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStorageUser=slurm
AccountingStorageTRES=gres/gpu
AccountingStorageEnforce=limits,qos
# COMPUTE NODES
NodeName=compute01 CPUs=8 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=15731 State=UNKNOWN
NodeName=compute02 CPUs=8 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=7679 State=UNKNOWN
NodeName=compute03 CPUs=8 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=7679 State=UNKNOWN
PartitionName=cpu Nodes=compute0[1-3] Default=NO MaxTime=INFINITE State=UP
NodeName=gpu01 Gres=gpu:2 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=15731
NodeName=gpu02 Gres=gpu:1 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=15731
NodeName=gpu03 Gres=gpu:1 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=15731
PartitionName=gpu Nodes=gpu0[1-3] Default=YES MaxTime=INFINITE State=UP OverSubscribe=FORCE:4
# Preemption
PreemptMode=CANCEL
PreemptType=preempt/qos
### Slurmdbd ###
# sacctmgr show qos
Name Priority GraceTime Preempt PreemptExemptTime PreemptMode Flags UsageThres UsageFactor GrpTRES GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall MaxTRES MaxTRESPerNode MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU MaxTRESPA MaxJobsPA MaxSubmitPA MinTRES
---------- ---------- ---------- ---------- ------------------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- -------------
normal 0 00:00:00 cluster 1.000000
base 1000 00:00:00 large cluster 1.000000 gres/gpu=2
large 100 01:00:00 cluster 1.000000 gres/gpu=3
small 500 00:00:00 cluster 1.000000 gres/gpu=2
# sacctmgr show assoc
Cluster Account User Partition Share Priority GrpJobs GrpTRES GrpSubmit GrpWall GrpTRESMins MaxJobs MaxTRES MaxTRESPerNode MaxSubmit MaxWall MaxTRESMins QOS Def QOS GrpTRESRunMin
---------- ---------- ---------- ---------- --------- ---------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- -------------
cluster root 1
cluster root root 1
cluster suser01 1
cluster suser01 suser01 1 base,large,small base
cluster suser02 1
cluster suser02 suser02 1 base,large base
cluster suser03 1
cluster suser03 suser03 1 base,large base
cluster suser04 1
cluster suser04 suser04 1 base,large base
cluster susol 1
cluster susol susol 1
### Sample Job ###
suser01 $ cat 4-suser01-large-qos-srun_gpu-burn.sh
#!/bin/bash -l
#SBATCH -J 4-suser01-large-qos-srun_gpu-burn.sh
#SBATCH -G 3
#SBATCH -q large
cd /NFS/gpu-burn
srun ./gpu_burn -d 120
suser01 $ cat 4-suser01-base-qos-srun_gpu-burn.sh
#!/bin/bash -l
#SBATCH -J 4-suser01-base-qos-srun_gpu-burn
#SBATCH -G 2
cd /NFS/gpu-burn
srun ./gpu_burn -d 120
Thank you for your response. Thanks to your explanation, I was able to understand.
After writing and running a new test program that only logs on SIGTERM, I could confirm that the GraceTime was applied.
Thank you once again.
Below is a sample code for reference for others:
$ cat run-gpu.cu
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
#include <cuda_runtime.h>
void sigterm_handler(int signum) {
printf("Received SIGTERM, but not terminating\n");
}
__global__ void dummy_kernel(int *data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
data[idx] = idx;
}
int main() {
signal(SIGTERM, sigterm_handler);
int *device_data;
cudaMalloc((void **)&device_data, 1024 * sizeof(int));
dummy_kernel<<<1, 1024>>>(device_data);
cudaDeviceSynchronize();
while(1) {
sleep(1);
printf("Working with GPU...\n");
}
cudaFree(device_data);
return 0;
}