Hello,
I don't have a solution to offer, and this might be a bit off-topic, but I just wanted to say that
I'm glad to know we are not the only ones dealing with NVIDIA issues. We are experiencing this
"fallen off the bus" error on ~20% of our brand-new nodes (especially the H200 provided by Lenovo).
Finding an easy workaround using Slurm would be great, but it's a shame that NVIDIA is experiencing
these issues. I've never seen such a high failure rate on hardware this expensive.
</offtopic>
Regards,
--
Bartomeu Miró Mateu
Centre de Supercomputació i Intel·ligència Artificial de les Illes Balears
Àrea de Suport Experimental i Serveis Cientificotècnics
Universitat de les Illes Balears
https://bsai.uib.cat/