Hi,
We’re seeing an issue where jobs submitted via salloc are
automatically cancelled when a compute node becomes temporarily unreachable.
Our goal is to keep jobs pending or requeued instead of being cancelled outright when a node drops offline briefly
Slurm sometimes cancels the job rather than requeuing it when the node is marked DOWN/ DRAIN/ DRAINING*.
Is there a recommended configuration or additional parameter that ensures jobs remain pending/requeued until the node returns, rather than being cancelled?
Any insights or examples from similar setups would be greatly appreciated.
Regards,
Pharthiphan