[slurm-users] GraceTime is not working, But there is log.

김형진

unread,

Nov 7, 2023, 8:29:19 PM11/7/23

to slurm...@schedmd.com

Hello ~

Please help me.

Total GPU : 4

Large qos : 3 (max 3 gpus)

Base qos : 2 (max 2 gpus)

I have a total of four GPUs,

and when a job with a large QoS is using three GPUs and a job with a base QoS is created,

I want the large QoS job to wait for a certain period before the base QoS job starts.

However, as soon as the base QoS job is created, the large QoS job is immediately canceled without any waiting time.

But in the slurmctld log, there is a grace time log.

[2023-11-02T11:37:36.589] debug: setting 3600 sec preemption grace time for JobId=153 to reclaim resources for JobId=154

Could you help me understand what might be going wrong?

Here's my Slurm configuration details.

If you need any more information, please feel free to reply at any time.

### /etc/slurm/slurm.conf ###

# cat /etc/slurm/slurm.conf

# slurm.conf file generated by configurator.html.

# Put this file on all nodes of your cluster.

# See the slurm.conf man page for more information.

#

# Global Configuration

ClusterName=cluster

SlurmctldHost=master01

SlurmUser=slurm

GresTypes=gpu

JobRequeue=1

ProctrackType=proctrack/cgroup

ReturnToService=2

StateSaveLocation=/NFS/slurm/ctld

SwitchType=switch/none

TaskPlugin=task/cgroup,task/affinity

# SLRUMCTLD

SlurmctldPidFile=/var/spool/slurm/slurmctld.pid

SlurmctldLogFile=/var/log/slurm//slurmctld.log

SlurmctldTimeout=30

SlurmctldDebug=debug5

# SLURMD

SlurmdLogFile=/var/log/slurm/slurmd.log

SlurmdPidFile=/var/spool/slurm/slurmd.pid

SlurmdSpoolDir=/var/spool/slurm/

SlurmdTimeout=30

SlurmdDebug=debug5

# SCHEDULING

SchedulerType=sched/backfill

# JOB PRIORITY

PriorityType=priority/multifactor

PriorityWeightQOS=10000

# Select Resource

SelectType=select/cons_tres

SelectTypeParameters=CR_CPU

# Job

JobAcctGatherType=jobacct_gather/cgroup

JobCompUser=slurm

JobCompType=jobcomp/filetxt

JobCompLoc=/NFS/slurm/job-comp/slurm_jobcomp.log

MinJobAge=3600

# Account

AccountingStoreFlags=job_comment

AccountingStorageType=accounting_storage/slurmdbd

AccountingStorageHost=master01

AccountingStoragePass=/var/run/munge/munge.socket.2

AccountingStorageUser=slurm

AccountingStorageTRES=gres/gpu

AccountingStorageEnforce=limits,qos

# COMPUTE NODES

NodeName=compute01 CPUs=8 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=15731 State=UNKNOWN

NodeName=compute02 CPUs=8 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=7679 State=UNKNOWN

NodeName=compute03 CPUs=8 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=7679 State=UNKNOWN

PartitionName=cpu Nodes=compute0[1-3] Default=NO MaxTime=INFINITE State=UP

NodeName=gpu01 Gres=gpu:2 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=15731

NodeName=gpu02 Gres=gpu:1 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=15731

NodeName=gpu03 Gres=gpu:1 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=15731

PartitionName=gpu Nodes=gpu0[1-3] Default=YES MaxTime=INFINITE State=UP OverSubscribe=FORCE:4

# Preemption

PreemptMode=CANCEL

PreemptType=preempt/qos

### Slurmdbd ###

# sacctmgr show qos

Name Priority GraceTime Preempt PreemptExemptTime PreemptMode Flags UsageThres UsageFactor GrpTRES GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall MaxTRES MaxTRESPerNode MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU MaxTRESPA MaxJobsPA MaxSubmitPA MinTRES

---------- ---------- ---------- ---------- ------------------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- -------------

normal 0 00:00:00 cluster 1.000000

base 1000 00:00:00 large cluster 1.000000 gres/gpu=2

large 100 01:00:00 cluster 1.000000 gres/gpu=3

small 500 00:00:00 cluster 1.000000 gres/gpu=2

# sacctmgr show assoc

Cluster Account User Partition Share Priority GrpJobs GrpTRES GrpSubmit GrpWall GrpTRESMins MaxJobs MaxTRES MaxTRESPerNode MaxSubmit MaxWall MaxTRESMins QOS Def QOS GrpTRESRunMin

---------- ---------- ---------- ---------- --------- ---------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- -------------

cluster root 1

cluster root root 1

cluster suser01 1

cluster suser01 suser01 1 base,large,small base

cluster suser02 1

cluster suser02 suser02 1 base,large base

cluster suser03 1

cluster suser03 suser03 1 base,large base

cluster suser04 1

cluster suser04 suser04 1 base,large base

cluster susol 1

cluster susol susol 1

### Sample Job ###

suser01 $ cat 4-suser01-large-qos-srun_gpu-burn.sh

#!/bin/bash -l

#SBATCH -J 4-suser01-large-qos-srun_gpu-burn.sh

#SBATCH -G 3

#SBATCH -q large

cd /NFS/gpu-burn

srun ./gpu_burn -d 120

suser01 $ cat 4-suser01-base-qos-srun_gpu-burn.sh

#!/bin/bash -l

#SBATCH -J 4-suser01-base-qos-srun_gpu-burn

#SBATCH -G 2

cd /NFS/gpu-burn

srun ./gpu_burn -d 120

Rémi Palancher

unread,

Nov 8, 2023, 2:59:43 AM11/8/23

to slurm...@lists.schedmd.com

Le 08/11/2023 à 02:28, 김형진 a écrit :
> Hello ~____
>
> …

>
> However, as soon as the base QoS job is created, the large QoS job is

> immediately canceled without any waiting time.____
>
> __ __
>
> But in the slurmctld log, there is a grace time log.____

>
> [2023-11-02T11:37:36.589] debug: setting 3600 sec preemption grace time

> for JobId=153 to reclaim resources for JobId=154____
>
> __ __
>
> Could you help me understand what might be going wrong?____

Note that Slurm sends SIGTERM signal by default to slurmstepd immediate
children (which might be gpu_burn in your case) at _the beginning_ of
the GraceTime, to notify them of approaching termination.

If the processes react to SIGTERM by terminating, which generally the
case, you may have the impression GraceTime is not honored.

To benefit from the GraceTime, your program must either trap SIGTERM
with a signal handler or you must enable send_user_signal
PreemptParameters flag and submit your job with --signal and another signal.

--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io/

김형진

unread,

Nov 8, 2023, 8:33:52 PM11/8/23

to Slurm User Community List

Thank you for your response. Thanks to your explanation, I was able to understand.

After writing and running a new test program that only logs on SIGTERM, I could confirm that the GraceTime was applied.

Thank you once again.

Below is a sample code for reference for others:

$ cat run-gpu.cu
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
#include <cuda_runtime.h>

void sigterm_handler(int signum) {
printf("Received SIGTERM, but not terminating\n");

}

__global__ void dummy_kernel(int *data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
data[idx] = idx;
}

int main() {
signal(SIGTERM, sigterm_handler);

int *device_data;
cudaMalloc((void **)&device_data, 1024 * sizeof(int));

dummy_kernel<<<1, 1024>>>(device_data);
cudaDeviceSynchronize();

while(1) {
sleep(1);
printf("Working with GPU...\n");
}

cudaFree(device_data);
return 0;
}

2023년 11월 8일 (수) 오후 5:02, Rémi Palancher <re...@rackslab.io>님이 작성:

Reply all

Reply to author

Forward