[slurm-users] nvml autodetect is ignoring gpus

2,049 views
Skip to first unread message

Benjamin Nacar

unread,
Nov 30, 2021, 10:13:36 AM11/30/21
to slurm...@lists.schedmd.com
Hi,

We're trying to use Slurm's built-in Nvidia GPU detection mechanism to avoid having to specify GPUs explicitly in slurm.conf and gres.conf. We're running Debian 11, and the version of Slurm available for Debian 11 is 20.11. However, the version of Slurm in the standard debian repositories was apparently not compiled on a system with the necessary Nvidia library installed, so we recompiled Slurm 20.11 from the Debian source package with no modifications.

With AutoDetect=nvml in gres.conf and GresTypes=gpu in slurm.conf, this is what we see on a 4-GPU host after restarting slurmd:

[2021-11-29T15:49:58.226] Node reconfigured socket/core boundaries SocketsPerBoard=12:2(hw) CoresPerSocket=1:6(hw)
[2021-11-29T15:50:02.397] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.399] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.551] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
[2021-11-29T15:50:02.551] gres/gpu: _normalize_gres_conf: WARNING: The following autodetected GPUs are being ignored:
[2021-11-29T15:50:02.551] GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11 Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
[2021-11-29T15:50:02.551] GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11 Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
[2021-11-29T15:50:02.551] GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11 Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
[2021-11-29T15:50:02.551] GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11 Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-11-29T15:50:02.614] slurmd version 20.11.4 started
[2021-11-29T15:50:02.630] slurmd started on Mon, 29 Nov 2021 15:50:02 -0500
[2021-11-29T15:50:02.630] CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 Memory=257840 TmpDisk=3951 Uptime=975072 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

Doing an "scontrol show node" for this host displays "Gres=(null)", and any attempts to submit a job with --gpus=1 results in "srun: error: Unable to allocate resources: Requested node configuration is not available".

Any idea what might be wrong?

Thanks,
~~ bnacar

--
Benjamin Nacar
Systems Programmer
Computer Science Department
Brown University
401.863.7621

Diego Zuccato

unread,
Dec 1, 2021, 1:36:49 AM12/1/21
to slurm...@lists.schedmd.com
Il 30/11/2021 16:12, Benjamin Nacar ha scritto:
> However, the version of Slurm in the standard debian repositories was apparently not compiled on a system with the necessary Nvidia library installed,
That's not a good news :( I have a GPU node arriving by the end of the
year. Does it only impact autodetection (so it "just" requires manual
config) or GPU jobs won't be able to start at all?

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Fernando Guillén Camba

unread,
Dec 1, 2021, 3:34:28 AM12/1/21
to slurm...@lists.schedmd.com

I also compiled Slurm 20.11.8 to have GPU support in AlmaLinux 8.4 but don't have any problem with  NVML detecting our A100s.

¿Maybe the NVML library version used for Slurm compilation has to match the library version of the compute node where the GPU is?

Also, I see that you're using Geforce_GTX. ¿Could it be that NVML only supports Tesla GPUs?

This is my relevant Slurm configuration:

slurm.conf:
GresTypes=gpu,mps
NodeName=hpc-gpu[3-4].... Gres=gpu:A100:1
gres.conf:
NodeName=hpc-gpu[1-4] AutoDetect=nvml


and the NVIDIA part:

NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5

and this is what I see in the log:

[2021-12-01T09:29:45.675] debug:  CPUs:64 Boards:1 Sockets:2 CoresPerSocket:32 ThreadsPerCore:1
[2021-12-01T09:29:45.675] debug:  gres/gpu: init: loaded
[2021-12-01T09:29:45.675] debug:  gres/mps: init: loaded
[2021-12-01T09:29:45.676] debug:  gpu/nvml: init: init: GPU NVML plugin loaded
[2021-12-01T09:29:46.298] debug2: gpu/nvml: _nvml_init: Successfully initialized NVML
[2021-12-01T09:29:46.298] debug:  gpu/nvml: _get_system_gpu_list_nvml: Systems Graphics Driver Version: 495.29.05
[2021-12-01T09:29:46.298] debug:  gpu/nvml: _get_system_gpu_list_nvml: NVML Library Version: 11.495.29.05
[2021-12-01T09:29:46.298] debug2: gpu/nvml: _get_system_gpu_list_nvml: Total CPU count: 64
[2021-12-01T09:29:46.298] debug2: gpu/nvml: _get_system_gpu_list_nvml: Device count: 1
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 0:
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Name: nvidia_a100-pcie-40gb
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     UUID: GPU-4cbb41e9-296b-ba72-d345-aa41fd7a8842
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Domain/Bus/Device: 0:33:0
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Bus ID: 00000000:21:00.0
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: -1
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Device File (minor number): /dev/nvidia0
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU Affinity Range - Machine: 16-23
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core Affinity Range - Abstract: 16-23
[2021-12-01T09:29:46.365] debug2: Possible GPU Memory Frequencies (1):
[2021-12-01T09:29:46.365] debug2: -------------------------------
[2021-12-01T09:29:46.365] debug2:     *1215 MHz [0]
[2021-12-01T09:29:46.365] debug2:         Possible GPU Graphics Frequencies (81):
[2021-12-01T09:29:46.365] debug2:         ---------------------------------
[2021-12-01T09:29:46.365] debug2:           *1410 MHz [0]
[2021-12-01T09:29:46.365] debug2:           *1395 MHz [1]
[2021-12-01T09:29:46.365] debug2:           ...
[2021-12-01T09:29:46.365] debug2:           *810 MHz [40]
[2021-12-01T09:29:46.365] debug2:           ...
[2021-12-01T09:29:46.365] debug2:           *225 MHz [79]
[2021-12-01T09:29:46.365] debug2:           *210 MHz [80]
[2021-12-01T09:29:46.555] debug2: gpu/nvml: _nvml_shutdown: Successfully shut down NVML
[2021-12-01T09:29:46.555] gpu/nvml: _get_system_gpu_list_nvml: 1 GPU system device(s) detected
[2021-12-01T09:29:46.555] debug:  Gres GPU plugin: Normalizing gres.conf with system GPUs
[2021-12-01T09:29:46.555] debug2: gres/gpu: _normalize_gres_conf: gres_list_conf:
[2021-12-01T09:29:46.555] debug2:     GRES[gpu] Type:A100 Count:1 Cores(64):(null)  Links:(null) Flags:HAS_TYPE File:(null)
[2021-12-01T09:29:46.556] debug:  gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration:
[2021-12-01T09:29:46.556] debug:      GRES[gpu] Type:A100 Count:1 Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-01T09:29:46.556] debug2: gres/gpu: _normalize_gres_conf: gres_list_gpu
[2021-12-01T09:29:46.556] debug2:     GRES[gpu] Type:A100 Count:1 Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-01T09:29:46.556] debug:  Gres GPU plugin: Final normalized gres.conf list:
[2021-12-01T09:29:46.556] debug:      GRES[gpu] Type:A100 Count:1 Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-01T09:29:46.556] debug:  Gres MPS plugin: Initalized gres.conf list:
[2021-12-01T09:29:46.556] debug:      GRES[gpu] Type:A100 Count:1 Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-01T09:29:46.556] debug:  Gres MPS plugin: Final gres.conf list:
[2021-12-01T09:29:46.556] debug:      GRES[gpu] Type:A100 Count:1 Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-01T09:29:46.556] Gres Name=gpu Type=A100 Count=1



Hope it helps.


El 30/11/21 a las 16:12, Benjamin Nacar escribió:
--
CiTIUS Fernando Guillén Camba
Unidade de Xestión de Infraestruturas TIC
E-mail: fernando...@usc.es · Phone: +34 881816409
Website: citius.usc.es · Twitter: citiususc

Quirin Lohr

unread,
Dec 1, 2021, 8:06:26 AM12/1/21
to slurm...@lists.schedmd.com
Hi,

you still need to specify the gpus in the node definition in slurm.conf.
At least the number, perhaps even the type reported by nvml must match
the node definition. (Gres=gpu:geforce_gtx_1080:4)

I think the error message can be ignored, the 1080 just does not support
this feature.
Quirin Lohr
Systemadministration
Technische Universität München
Fakultät für Informatik
Lehrstuhl für Bildverarbeitung und Künstliche Intelligenz

Boltzmannstrasse 3
85748 Garching

Tel. +49 89 289 17769
Fax +49 89 289 17757

quiri...@in.tum.de
www.vision.in.tum.de

Benjamin Nacar

unread,
Dec 1, 2021, 8:42:39 AM12/1/21
to Slurm User Community List
Confirmed that adding just the "Gres=" bit in slurm.conf works. That's what I get for reading the documentation too fast... thanks all!

~~ bnacar
Reply all
Reply to author
Forward
0 new messages