[slurm-users] How do you handle GPU node failures during long jobs?

36 views
Skip to first unread message

Antonio Jose Alonso-Stepanov via slurm-users

unread,
Mar 14, 2026, 11:49:13 PMMar 14
to slurm...@lists.schedmd.com
Hi all,

I'm a Stanford CS student looking into how sites handle GPU node failures during long-running jobs. A couple questions:

When a GPU node goes down mid-job, do most sites use Slurm's requeue or --no-kill to handle it, or is it mostly manual drain and resubmit?

Is anyone using HealthCheckProgram to catch GPU issues (like ECC errors via DCGM), or do you handle GPU health monitoring outside of Slurm?

Curious what's worked and what hasn't. Thanks.

Antonio

Tina Friedrich via slurm-users

unread,
Mar 26, 2026, 6:58:42 AMMar 26
to slurm...@lists.schedmd.com
Hello,

at least for nvidia GPUs, we have the Node Health Check check dcgmi
health output - so we have health watchers set on the GPU, and if dcgmi
reports errors, that drains the nodes. We're trying to do something
similar for our AMD GPUs but there doesn't seem to be a 'live' health
check like that, so on those we periodically run a diagnostics script &
check the output of that as part of NHC.

We've also found failure conditions on some of our GPU nodes that dcgmi
health watchers don't pick up on, and have implemented separate checks
for those (again, they've been added to the NHC script).

My opinion is that it's always better to have the HealthCheckProgram
pick up on errors, rather than rely on 'manual' discovery.

We don't do anything about jobs on the nodes - I mean if a GPU dies
mid-job the job(s) using the GPU(s) will likely fail anyway, and the
node goes into drain state, so...

Tina
--
Tina Friedrich, Snr HPC Systems Administrator,
Advanced Research Computing (ARC), The University of Oxford
https://www.arc.ox.ac.uk/


--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Christopher Samuel via slurm-users

unread,
Mar 26, 2026, 5:56:48 PMMar 26
to slurm...@lists.schedmd.com
On 3/14/26 11:46 pm, Antonio Jose Alonso-Stepanov via slurm-users wrote:

> When a GPU node goes down mid-job, do most sites use Slurm's requeue or
> --no-kill to handle it, or is it mostly manual drain and resubmit?

That we leave to our users on how best they want to deal with it.

> Is anyone using HealthCheckProgram to catch GPU issues (like ECC errors
> via DCGM), or do you handle GPU health monitoring outside of Slurm?

We run non-intrusive checks via the health check script (so dumping the
XML for instance and parsing that for problems) and if we find any we'll
either drain the node (if it's a hardware issue that needs attention) or
queue it for a reboot with "scontrol reboot" if it's just a remap issue.

In the job epilog we run any tests of GPUs that use resources on the GPU
(eg "dcgmi diag -r 1") and if we find a problem we'll fail the epilog to
drain the node.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA
Reply all
Reply to author
Forward
0 new messages