[slurm-users] Autodetect of nvml is not working in gres.conf

126 views
Skip to first unread message

Ravi Konila

unread,
Nov 30, 2023, 9:07:00 AM11/30/23
to slurm...@lists.schedmd.com
Hello,
 
My gres.conf has AutoDetect=nvml
when I restart slurmd service I do get
 
fatal: We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured.
 
Referred few links to solve along with slurm-users email archives but could not understand much.
 
Can someone help me with this one. I am using DGX A100 Server which has 4 numbers of A100 80GB GPUs.
 
With Warm Regards
Ravi Konila

Josef Dvoracek

unread,
Nov 30, 2023, 9:12:35 AM11/30/23
to slurm...@lists.schedmd.com

couldn't be that library "cuda-nvml-devel" was not installed when you were building slurm?

cheers

josef

Groner, Rob

unread,
Nov 30, 2023, 9:17:18 AM11/30/23
to Ravi Konila, Slurm User Community List
Did you have --with-nvml as part of your configuration?  Go back to your config.log and verify that it ever said it found nvml.h.

If not, then you'll need to make sure you have the right nvidia/cuda packages installed on the host you're building slurm on, and you might have to specify --with-nvml=<path to nvml install> if it's not in a standard location.

Rob


From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Ravi Konila <ravi...@gmail.com>
Sent: Thursday, November 30, 2023 9:06 AM
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: [slurm-users] Autodetect of nvml is not working in gres.conf
 
You don't often get email from ravi...@gmail.com. Learn why this is important

Shunran Zhang

unread,
Nov 30, 2023, 9:54:48 AM11/30/23
to Ravi Konila, Slurm User Community List
Hi all,

If you could offer a little bit more details on your OS and Slurm version that might shed some light.

There is an interesting detail about the NVML package if you are using RHEL-like OS.
The NVML detection part of the slurm library (/usr/lib64/slurm/gpu_nvml.so) is linked against the /lib64/libnvidia-ml.so.1 to do the actual detection.
If you do a simple nvidia driver installation that pulls in nvidia-driver-NVML from cuda-rhel8-x86_64 repository,
this package would install /lib64/libnvidia-ml.so.1 as a symlink to /lib64/libnvidia-ml.so.<your driver version>.
In this setup, as the linked library is present, the code would not crash.

However, interestingly the package mentioned above missed another symlink: the /lib64/libnvidia-ml.so to /lib64/libnvidia-ml.so.<your driver version>.
Take a look at the following line of the Slurm source code (I just used the master branch but git blame says it comes a long way):

"""
if (!dlopen("libnvidia-ml.so", RTLD_NOW | RTLD_GLOBAL))
"""

So even though the nvidia-driver-NVML is installed, and the system was able to find the linked library as it was linked against libnvidia-ml.so.1,
as the libnvidia-ml.so link is not provided there, the dlopen fails for the file not found, thus the error message you posted follows.

In our case, I just manually created the missing symlink by ln -s /lib64/libnvidia-ml.so.1 /lib64/libnvidia-ml.so, and the NVML worked as expected.

I kind of wonder if such an issue arose from the packaging issue on the NVIDIA side, or if it should be filed as a bug of SLURM code only checking
for the so library without any versioning suffix.

Your case might be different, but I think as the error message is a direct result of slurm unable to find /lib64/libnvidia-ml.so, you should take
a look at your setup to see if such so file is installed or not - if not, install the package, otherwise create the missing symlink.

Sincerely,
S. Zhang

2023年11月30日(木) 23:23 Ravi Konila <ravi...@gmail.com>:

Shunran Zhang

unread,
Nov 30, 2023, 10:03:33 AM11/30/23
to Slurm User Community List, rug...@psu.edu, Ravi Konila
Hi all,

Apologies for writing something misleading in the last mail. I missed your error message.

Rob was correct - your slurmd appears not to have the NVML flag on compile time.
You need to set up the NVML and turn the --with-nvml flag on when configuring slurm to fix the issue if you are compiling one, or find a binary package that complied with such flag on.

Credit to Rob - WE ARE
S. Zhang

2023年11月30日(木) 23:30 Groner, Rob <rug...@psu.edu>:
Reply all
Reply to author
Forward
0 new messages