[slurm-users] GPU fallen of the bus

17 views
Skip to first unread message

Ratnasamy, Fritz via slurm-users

unread,
May 27, 2026, 12:51:09 AMMay 27
to Slurm User Community List
Hello, 


 We are noticing that some of the gpus on a specific node have "fallen of the bus". We would like to remove this specific gpu from the slurm scheduler. For example, let's say GPU0 has fallen off the bus, we would need the rest of the GPU1-8 to be available and make GPU0 not able to be allocated. How can we achieve that? I have read about blacklist on the slurm forum but it seems there is no satisfying solution. 
Best,

Fritz Ratnasamy
Data Scientist
Information Technology


Ward Poelmans via slurm-users

unread,
May 27, 2026, 5:29:47 AMMay 27
to slurm...@lists.schedmd.com
Hi Fritz,

On 27/05/2026 06:46, Ratnasamy, Fritz via slurm-users wrote:
>  We are noticing that some of the gpus on a specific node have "fallen of the bus". We would like to remove this specific gpu from the slurm scheduler. For example, let's say GPU0 has fallen off the bus, we would need the rest of the GPU1-8 to be available and make GPU0 not able to be allocated. How can we achieve that? I have read about blacklist on the slurm forum but it seems there is no satisfying solution.

We asked the same question recently. See: https://support.schedmd.com/show_bug.cgi?id=25180 and https://support.schedmd.com/show_bug.cgi?id=25181.

Our current cumbersome method is to drop the 'broken' GPU of the pcie bus and reconfigure slurm with one less GPU.

Ward

Bartomeu Miró Mateu via slurm-users

unread,
May 27, 2026, 6:31:48 AMMay 27
to slurm...@lists.schedmd.com
Hello,

I don't have a solution to offer, and this might be a bit off-topic, but I just wanted to say that
I'm glad to know we are not the only ones dealing with NVIDIA issues. We are experiencing this
"fallen off the bus" error on ~20% of our brand-new nodes (especially the H200 provided by Lenovo).

Finding an easy workaround using Slurm would be great, but it's a shame that NVIDIA is experiencing
these issues. I've never seen such a high failure rate on hardware this expensive.

</offtopic>

Regards,

--
Bartomeu Miró Mateu

Centre de Supercomputació i Intel·ligència Artificial de les Illes Balears
Àrea de Suport Experimental i Serveis Cientificotècnics
Universitat de les Illes Balears

https://bsai.uib.cat/

OpenPGP_signature.asc
Reply all
Reply to author
Forward
0 new messages