We recently upgraded from Slurm 19.05.8 to 20.11.3. In our
configuration, we have an interruptible partition named
'interruptible' for long-running, low-priority jobs that use
checkpoint/restart. Jobs that are preempted would be killed and
requeued rather than suspended. This configuration has been
working without issue for 2+ years without issue.
After the upgrade, this has stopped working. Preempted jobs are killed and not requeued. My slurm.conf file is configured to requeue preempted jobs:
$ grep -i requeue /etc/slurm/slurm.conf
#JobRequeue=1
PreemptMode=Requeue
And the user's sbatch script included the --requeue option.
The user reports the err output from his preempted jobs now says
slurmstepd: error: *** STEP 1075117.0 ON greene002 CANCELLED AT 2021-02-25T16:07:48 ***
And in the past it would
see PREEMPTED instead of cancelled.
Any ideas what would
cause this? I've reported this to Slurm support, and haven't
gotten anything back yet, so I figured I'd ask here, too. If
this is a bug, I can't be the only one who has experienced
this.
-- Prentice
We saw something that sounds similar to this. See this bug report: https://bugs.schedmd.com/show_bug.cgi?id=10196
SchedMD never found the root cause. They thought it might have something to do with a timing problem on Prolog scripts, but the thing that fixed it for us was to set GraceTime=0 on our preemptable QoS.
Mike Robbert
Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research Computing
Information and Technology Solutions (ITS)
303-273-3786 | mrob...@mines.edu
![]()
Our values: Trust | Integrity | Respect | Responsibility
From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Prentice Bisbal <pbi...@pppl.gov>
Reply-To: Slurm User Community List <slurm...@lists.schedmd.com>
Date: Friday, February 26, 2021 at 12:38
To: "slurm...@lists.schedmd.com" <slurm...@lists.schedmd.com>
Subject: [External] [slurm-users] Preemption not working in 20.11
CAUTION: This email originated from outside of the Colorado School of Mines organization. Do not click on links or open attachments unless you recognize the sender and know the content is safe.
Thanks for the info and link to your bug report. Unfortunately,
my GraceTime is already set to zero for that QOS:
$ sacctmgr show qos interruptible format=Name,gracetime
Name GraceTime
---------- ----------
interrupt+ 00:00:00
-- Prentice Bisbal Lead Software Engineer Research Computing Princeton Plasma Physics Laboratory http://www.pppl.gov