I also compiled Slurm 20.11.8 to have GPU support in AlmaLinux 8.4 but don't have any problem with NVML detecting our A100s.
¿Maybe the NVML library version used for Slurm compilation has to match the library version of the compute node where the GPU is?
Also, I see that you're using Geforce_GTX. ¿Could it be that NVML only supports Tesla GPUs?
This is my relevant Slurm configuration:
slurm.conf:
GresTypes=gpu,mps
NodeName=hpc-gpu[3-4].... Gres=gpu:A100:1
gres.conf:
NodeName=hpc-gpu[1-4] AutoDetect=nvml
and the NVIDIA part:
NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5
[2021-12-01T09:29:45.675] debug: CPUs:64 Boards:1 Sockets:2 CoresPerSocket:32 ThreadsPerCore:1
[2021-12-01T09:29:45.675] debug: gres/gpu: init: loaded
[2021-12-01T09:29:45.675] debug: gres/mps: init: loaded
[2021-12-01T09:29:45.676] debug: gpu/nvml: init: init: GPU NVML plugin loaded
[2021-12-01T09:29:46.298] debug2: gpu/nvml: _nvml_init: Successfully initialized NVML
[2021-12-01T09:29:46.298] debug: gpu/nvml: _get_system_gpu_list_nvml: Systems Graphics Driver Version: 495.29.05
[2021-12-01T09:29:46.298] debug: gpu/nvml: _get_system_gpu_list_nvml: NVML Library Version: 11.495.29.05
[2021-12-01T09:29:46.298] debug2: gpu/nvml: _get_system_gpu_list_nvml: Total CPU count: 64
[2021-12-01T09:29:46.298] debug2: gpu/nvml: _get_system_gpu_list_nvml: Device count: 1
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 0:
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_a100-pcie-40gb
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-4cbb41e9-296b-ba72-d345-aa41fd7a8842
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:33:0
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:21:00.0
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: -1
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia0
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 16-23
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 16-23
[2021-12-01T09:29:46.365] debug2: Possible GPU Memory Frequencies (1):
[2021-12-01T09:29:46.365] debug2: -------------------------------
[2021-12-01T09:29:46.365] debug2: *1215 MHz [0]
[2021-12-01T09:29:46.365] debug2: Possible GPU Graphics Frequencies (81):
[2021-12-01T09:29:46.365] debug2: ---------------------------------
[2021-12-01T09:29:46.365] debug2: *1410 MHz [0]
[2021-12-01T09:29:46.365] debug2: *1395 MHz [1]
[2021-12-01T09:29:46.365] debug2: ...
[2021-12-01T09:29:46.365] debug2: *810 MHz [40]
[2021-12-01T09:29:46.365] debug2: ...
[2021-12-01T09:29:46.365] debug2: *225 MHz [79]
[2021-12-01T09:29:46.365] debug2: *210 MHz [80]
[2021-12-01T09:29:46.555] debug2: gpu/nvml: _nvml_shutdown: Successfully shut down NVML
[2021-12-01T09:29:46.555] gpu/nvml: _get_system_gpu_list_nvml: 1 GPU system device(s) detected
[2021-12-01T09:29:46.555] debug: Gres GPU plugin: Normalizing gres.conf with system GPUs
[2021-12-01T09:29:46.555] debug2: gres/gpu: _normalize_gres_conf: gres_list_conf:
[2021-12-01T09:29:46.555] debug2: GRES[gpu] Type:A100 Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE File:(null)
[2021-12-01T09:29:46.556] debug: gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration:
[2021-12-01T09:29:46.556] debug: GRES[gpu] Type:A100 Count:1 Cores(64):16-23 Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-01T09:29:46.556] debug2: gres/gpu: _normalize_gres_conf: gres_list_gpu
[2021-12-01T09:29:46.556] debug2: GRES[gpu] Type:A100 Count:1 Cores(64):16-23 Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-01T09:29:46.556] debug: Gres GPU plugin: Final normalized gres.conf list:
[2021-12-01T09:29:46.556] debug: GRES[gpu] Type:A100 Count:1 Cores(64):16-23 Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-01T09:29:46.556] debug: Gres MPS plugin: Initalized gres.conf list:
[2021-12-01T09:29:46.556] debug: GRES[gpu] Type:A100 Count:1 Cores(64):16-23 Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-01T09:29:46.556] debug: Gres MPS plugin: Final gres.conf list:
[2021-12-01T09:29:46.556] debug: GRES[gpu] Type:A100 Count:1 Cores(64):16-23 Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-01T09:29:46.556] Gres Name=gpu Type=A100 Count=1
![]() |
Fernando Guillén Camba |
| Unidade de Xestión de Infraestruturas TIC | |