Hello,
at least for nvidia GPUs, we have the Node Health Check check dcgmi
health output - so we have health watchers set on the GPU, and if dcgmi
reports errors, that drains the nodes. We're trying to do something
similar for our AMD GPUs but there doesn't seem to be a 'live' health
check like that, so on those we periodically run a diagnostics script &
check the output of that as part of NHC.
We've also found failure conditions on some of our GPU nodes that dcgmi
health watchers don't pick up on, and have implemented separate checks
for those (again, they've been added to the NHC script).
My opinion is that it's always better to have the HealthCheckProgram
pick up on errors, rather than rely on 'manual' discovery.
We don't do anything about jobs on the nodes - I mean if a GPU dies
mid-job the job(s) using the GPU(s) will likely fail anyway, and the
node goes into drain state, so...
Tina
--
Tina Friedrich, Snr HPC Systems Administrator,
Advanced Research Computing (ARC), The University of Oxford
https://www.arc.ox.ac.uk/
--
slurm-users mailing list --
slurm...@lists.schedmd.com
To unsubscribe send an email to
slurm-us...@lists.schedmd.com