[slurm-users] Jobs canceling when nodes become unreachable – need guidance

13 views
Skip to first unread message

Pharthiphan Asokan via slurm-users

unread,
May 4, 2026, 12:05:08 PMMay 4
to Ole Holm Nielsen via slurm-users
Hi,
We’re seeing an issue where jobs submitted via salloc are automatically cancelled when a compute node becomes temporarily unreachable.
Our goal is to keep jobs pending or requeued instead of being cancelled outright when a node drops offline briefly
Slurm sometimes cancels the job rather than requeuing it when the node is marked DOWN/ DRAIN/ DRAINING*.

Is there a recommended configuration or additional parameter that ensures jobs remain pending/requeued until the node returns, rather than being cancelled?

Any insights or examples from similar setups would be greatly appreciated.

Regards,
Pharthiphan

Ole Holm Nielsen via slurm-users

unread,
May 5, 2026, 2:50:34 PMMay 5
to slurm...@lists.schedmd.com
On 5/4/2026 5:33 PM, Pharthiphan Asokan via slurm-users wrote:
> We’re seeing an issue where jobs submitted via |salloc| are
> automatically cancelled when a compute node becomes temporarily unreachable.
> Our goal is to keep jobs pending or requeued instead of being cancelled
> outright when a node drops offline briefly
> Slurm sometimes cancels the job rather than requeuing it when the node
> is marked |DOWN/ DRAIN/ DRAINING*|.
>
> Is there a recommended configuration or additional parameter that
> ensures jobs remain pending/requeued until the node returns, rather than
> being cancelled?

DOWN nodes are very likely caused (rightly so, IMHO) by the
SlurmdTimeout in slurm.conf
> The interval, in seconds, that the Slurm controller waits for slurmd to respond before configuring that node's state to DOWN.
See https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout

The JobRequeue parameter controls job requeue.

IHTH,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

John Hearns via slurm-users

unread,
May 5, 2026, 3:14:49 PMMay 5
to Ole Holm Nielsen, Slurm User Community List
I would suggest making very sure that all compute nodes are time synced properly.
Then look at logs from the slurm controller and a computer mode side by side in two windows.


Why are these nodes not in contact with the slurm controller?
Reply all
Reply to author
Forward
0 new messages