[slurm-users] How do you handle GPU node failures during long jobs?

12 views
Skip to first unread message

Antonio Jose Alonso-Stepanov via slurm-users

unread,
Mar 14, 2026, 11:49:13 PM (11 days ago) Mar 14
to slurm...@lists.schedmd.com
Hi all,

I'm a Stanford CS student looking into how sites handle GPU node failures during long-running jobs. A couple questions:

When a GPU node goes down mid-job, do most sites use Slurm's requeue or --no-kill to handle it, or is it mostly manual drain and resubmit?

Is anyone using HealthCheckProgram to catch GPU issues (like ECC errors via DCGM), or do you handle GPU health monitoring outside of Slurm?

Curious what's worked and what hasn't. Thanks.

Antonio
Reply all
Reply to author
Forward
0 new messages