[slurm-users] How to tell SLURM to ignore specific GPUs

Paul Raines

unread,

Jan 30, 2022, 10:43:02 AM1/30/22

to slurm...@lists.schedmd.com

I have a large compute node with 10 RTX8000 cards at a remote colo.
One of the cards on it is acting up "falling of the bus" once a day
requiring a full power cycle to reset.

I want jobs to avoid that card as well as the card it is NVLINK'ed to.

So I modified gres.conf on that node as follows:

# cat /etc/slurm/gres.conf
AutoDetect=nvml
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1
#Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3
#Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia4
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia5
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia6
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia7
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia8
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia9

and it slurm.conf I changed for node def Gres=gpu:quadro_rtx_8000:10
to be Gres=gpu:quadro_rtx_8000:8. I restarted slurmctld and slurmd
after this.

I then put the node back from drain to idle. Jobs were sumbitted and
started on the node but they are using the GPU I told it to avoid

+--------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|====================================================================|
| 0 N/A N/A 63426 C python 11293MiB |
| 1 N/A N/A 63425 C python 11293MiB |
| 2 N/A N/A 63425 C python 10869MiB |
| 2 N/A N/A 63426 C python 10869MiB |
| 4 N/A N/A 63425 C python 10849MiB |
| 4 N/A N/A 63426 C python 10849MiB |
+--------------------------------------------------------------------+

How can I make SLURM not use GPU 2 and 4?

---------------------------------------------------------------
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129 USA

Timony, Mick

unread,

Jan 31, 2022, 9:46:36 AM1/31/22

to slurm...@lists.schedmd.com

You can use the nvidia-smi command to 'drain' the GPU's which will power-down the GPU's and no applications will use them.

This thread on stack overflow explains how to do that:

https://unix.stackexchange.com/a/654089/94412

You can create a script to run at boot and 'drain' the cards.

Regards

--Mick

Stephan Roth

unread,

Jan 31, 2022, 3:55:32 PM1/31/22

to slurm...@lists.schedmd.com

Not a solution, but some ideas & experiences concerning the same topic:

A few of our older GPUs used to show the error message "has fallen off
the bus" which was only resolved by a full power cycle as well.

Something changed, nowadays the error messages is "GPU lost" and a
normal reboot resolves the problem. This might be a result of an update
of the Nvidia drivers (currently 60.73.01), but I can't be sure.

The current behaviour allowed us to write a script checking GPU state
every 10 minutes and setting a node to drain&reboot state when such a
"lost GPU" is detected.
This has been working well for a couple of months now and saves us time.

It might help as well to re-seat all GPUs and PCI risers, this also
seemed to help in one of our GPU nodes. Again, I can't be sure, we'd
need to do try this with other - still failing - GPUs.

The problem is to identify the cards physically from the information we
have, like what's reported with nvidia-smi or available in
/proc/driver/nvidia/gpus/*/information
The serial number isn't shown for every type of GPU and I'm not sure the
ones shown match the stickers on the GPUs.
If anybody were to know of a practical solution for this, I'd be happy
to read it.

Eventually I'd like to pull out all cards which repeatedly get "lost"
and maybe move them all to a node for short debug jobs or throw them
away (they're all beyond warranty anyway).

Stephan

EPF (Esben Peter Friis)

unread,

Feb 1, 2022, 3:10:08 AM2/1/22

to Slurm User Community List

The numbering seen from nvidia-smi is not necessarily the same as the order of /dev/nvidiaXX.
There is a way to force that, though, using CUDA_DEVICE_ORDER.

See https://shawnliu.me/post/nvidia-gpu-id-enumeration-in-linux/

Cheers,

Esben

From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Timony, Mick <Michael...@hms.harvard.edu>
Sent: Monday, January 31, 2022 15:45
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] How to tell SLURM to ignore specific GPUs

Paul Raines

unread,

Feb 1, 2022, 9:42:19 AM2/1/22

to Slurm User Community List

First, thanks Tim for the nvidia-smi 'drain' pointer. That works
but I will still confused why what I did did not work

But Esben's reference explains it though I think the default
behavior very wierd in this case. I would think SLURM itself
should default things to CUDA_DEVICE_ORDER=PCI_BUS_ID

In order for this to work I guess we have to make sure that
CUDA_DEVICE_ORDER=PCI_BUS_ID is set on every process (slurmd, epilog,
prolog, and job itself) to be consistent and how to do that
easily is not completely evident.

Would just having a /etc/profile.d/cudaorder.sh guarantee it or
are their instances where it would be ignored there?

-- Paul Raines (http://help.nmr.mgh.harvard.edu)

> Paul Raines http://help.nmr.mgh.harvard.edu<https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhelp.nmr.mgh.harvard.edu%2F&data=04%7C01%7Cepf%40novozymes.com%7C295ce6d305984d38921e08d9e4c88781%7C43d5f49ee03a4d22a2285684196bb001%7C0%7C0%7C637792372191800703%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=oxpga348rfrZpOg0XSDepHfdUHirfgq46c6ZXcYoHvI%3D&reserved=0>

> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
> 149 (2301) 13th Street Charlestown, MA 02129 USA
>
>
> You can use the nvidia-smi command to 'drain' the GPU's which will power-down the GPU's and no applications will use them.
>
> This thread on stack overflow explains how to do that:
>

> https://unix.stackexchange.com/a/654089/94412<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Funix.stackexchange.com%2Fa%2F654089%2F94412&data=04%7C01%7Cepf%40novozymes.com%7C295ce6d305984d38921e08d9e4c88781%7C43d5f49ee03a4d22a2285684196bb001%7C0%7C0%7C637792372191956924%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Z8EBb1jxUD0sECO2R1m0CYIn4xy6HA%2Fx5AsqIBykoCY%3D&reserved=0>

Michael Di Domenico

unread,

Feb 2, 2022, 12:33:15 PM2/2/22

to Slurm User Community List

On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth <stepha...@ee.ethz.ch> wrote:
> The problem is to identify the cards physically from the information we
> have, like what's reported with nvidia-smi or available in
> /proc/driver/nvidia/gpus/*/information
> The serial number isn't shown for every type of GPU and I'm not sure the
> ones shown match the stickers on the GPUs.
> If anybody were to know of a practical solution for this, I'd be happy
> to read it.

i hadn't seen this proc driver reference before. checking a few of my
A100's and V100's and some off hand Quadro cards, i don't see the
serial number for any of them in the /proc. sadly this would be
pretty handy, does anyone know which cards do support this? i wonder
if there's some obscure something or other that needs to be turned on
to dump out the serial number in /proc instead of running nvidia-smi

Stephan Roth

unread,

Feb 3, 2022, 1:31:27 AM2/3/22

to slurm...@lists.schedmd.com

Sorry, I didn't state cleary what I was referring to.
I never saw the serial number in /proc/driver/nvidia/gpus/*/information,
but by using nvidia-smi. The information was also sometimes empty:

nvidia-smi -q |grep -E '^\s+Serial Number\s+:'
Serial Number : N/A

Stephan

Paul Raines

unread,

Feb 3, 2022, 10:13:47 AM2/3/22

to Stephan Roth, slurm...@lists.schedmd.com

That works fine on my boxes

[root@rtx-04 ~]# nvidia-smi -q -i 0 | grep Serial
Serial Number : 1321720....

Reply all

Reply to author

Forward