[slurm-users] Jobs canceling when nodes become unreachable – need guidance

3 views
Skip to first unread message

Pharthiphan Asokan via slurm-users

unread,
May 4, 2026, 12:05:08 PM (15 hours ago) May 4
to Ole Holm Nielsen via slurm-users
Hi,
We’re seeing an issue where jobs submitted via salloc are automatically cancelled when a compute node becomes temporarily unreachable.
Our goal is to keep jobs pending or requeued instead of being cancelled outright when a node drops offline briefly
Slurm sometimes cancels the job rather than requeuing it when the node is marked DOWN/ DRAIN/ DRAINING*.

Is there a recommended configuration or additional parameter that ensures jobs remain pending/requeued until the node returns, rather than being cancelled?

Any insights or examples from similar setups would be greatly appreciated.

Regards,
Pharthiphan
Reply all
Reply to author
Forward
0 new messages