[slurm-users] Aborting a job from inside the prolog

121 views
Skip to first unread message

Alexander Grund

unread,
Jun 14, 2023, 8:56:47 AM6/14/23
to slurm...@lists.schedmd.com
Hi,

We are doing some checking on the users Job inside the prolog script and
upon failure of those checks the job should be canceled.

Our first approach with `scancel $SLURM_JOB_ID; exit 1` doesn't seem to
work as the (sbatch) job still gets re-queued.

Is this possible at all (i.e. prevent jobs from running if some check
fails) and what would be correct?

Thanks,
Alex

Gerhard Strangar

unread,
Jun 19, 2023, 11:32:54 AM6/19/23
to slurm...@lists.schedmd.com
Alexander Grund wrote:

> Our first approach with `scancel $SLURM_JOB_ID; exit 1` doesn't seem to
> work as the (sbatch) job still gets re-queued.

Try to exit with 0, because it's not your prolog that failed.

Alexander Grund

unread,
Jun 20, 2023, 4:56:18 AM6/20/23
to slurm...@lists.schedmd.com
Am 19.06.23 um 17:32 schrieb Gerhard Strangar:
> Try to exit with 0, because it's not your prolog that failed.

That seemingly works.
I do see a value in exiting with 1 to drain the node to investigate
why/what has exactly failed.

Although it may be better to not drain it, I'm a bit nervous with "exit
0" as it is very important that the job does not start/continue, i.e.
the user code (sbatch script/srun) is never executed in that case.
So I want to be sure that an `scancel` on the job in its prolog is
actually always preventing the job from running.

Gerhard Strangar

unread,
Jun 20, 2023, 11:44:07 AM6/20/23
to slurm...@lists.schedmd.com
Alexander Grund wrote:
> Although it may be better to not drain it, I'm a bit nervous with "exit
> 0" as it is very important that the job does not start/continue, i.e.
> the user code (sbatch script/srun) is never executed in that case.
> So I want to be sure that an `scancel` on the job in its prolog is
> actually always preventing the job from running.

Just return the exit code of scancel, then. If it failed, the prolog
failed and the job gets re-queued. If it didn't, the job was cancelled.

Reply all
Reply to author
Forward
0 new messages