[slurm-users] Autodetect of nvml is not working in gres.conf

Ravi Konila

unread,

Nov 30, 2023, 9:07:00 AM11/30/23

to slurm...@lists.schedmd.com

Hello,

My gres.conf has AutoDetect=nvml

when I restart slurmd service I do get

fatal: We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured.

Referred few links to solve along with slurm-users email archives but could not understand much.

Can someone help me with this one. I am using DGX A100 Server which has 4 numbers of A100 80GB GPUs.

With Warm Regards
Ravi Konila

Josef Dvoracek

unread,

Nov 30, 2023, 9:12:35 AM11/30/23

to slurm...@lists.schedmd.com

couldn't be that library "cuda-nvml-devel" was not installed when you were building slurm?

cheers

josef

Groner, Rob

unread,

Nov 30, 2023, 9:17:18 AM11/30/23

to Ravi Konila, Slurm User Community List

Did you have --with-nvml as part of your configuration? Go back to your config.log and verify that it ever said it found nvml.h.

If not, then you'll need to make sure you have the right nvidia/cuda packages installed on the host you're building slurm on, and you might have to specify --with-nvml=<path to nvml install> if it's not in a standard location.

Rob

From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Ravi Konila <ravi...@gmail.com>
Sent: Thursday, November 30, 2023 9:06 AM
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: [slurm-users] Autodetect of nvml is not working in gres.conf

You don't often get email from ravi...@gmail.com. Learn why this is important

Shunran Zhang

unread,

Nov 30, 2023, 9:54:48 AM11/30/23

to Ravi Konila, Slurm User Community List

Hi all,

If you could offer a little bit more details on your OS and Slurm version that might shed some light.

There is an interesting detail about the NVML package if you are using RHEL-like OS.

The NVML detection part of the slurm library (/usr/lib64/slurm/gpu_nvml.so) is linked against the /lib64/libnvidia-ml.so.1 to do the actual detection.

If you do a simple nvidia driver installation that pulls in nvidia-driver-NVML from cuda-rhel8-x86_64 repository,

this package would install /lib64/libnvidia-ml.so.1 as a symlink to /lib64/libnvidia-ml.so.<your driver version>.

In this setup, as the linked library is present, the code would not crash.

However, interestingly the package mentioned above missed another symlink: the /lib64/libnvidia-ml.so to /lib64/libnvidia-ml.so.<your driver version>.

Take a look at the following line of the Slurm source code (I just used the master branch but git blame says it comes a long way):

"""

if (!dlopen("libnvidia-ml.so", RTLD_NOW | RTLD_GLOBAL))

"""

Link to source code: https://github.com/SchedMD/slurm/blob/master/src/interfaces/gpu.c#L100

So even though the nvidia-driver-NVML is installed, and the system was able to find the linked library as it was linked against libnvidia-ml.so.1,

as the libnvidia-ml.so link is not provided there, the dlopen fails for the file not found, thus the error message you posted follows.

In our case, I just manually created the missing symlink by ln -s /lib64/libnvidia-ml.so.1 /lib64/libnvidia-ml.so, and the NVML worked as expected.

I kind of wonder if such an issue arose from the packaging issue on the NVIDIA side, or if it should be filed as a bug of SLURM code only checking

for the so library without any versioning suffix.

Your case might be different, but I think as the error message is a direct result of slurm unable to find /lib64/libnvidia-ml.so, you should take

a look at your setup to see if such so file is installed or not - if not, install the package, otherwise create the missing symlink.

Sincerely,

S. Zhang

2023年11月30日(木) 23:23 Ravi Konila <ravi...@gmail.com>:

Shunran Zhang

unread,

Nov 30, 2023, 10:03:33 AM11/30/23

to Slurm User Community List, rug...@psu.edu, Ravi Konila

Hi all,

Apologies for writing something misleading in the last mail. I missed your error message.

Rob was correct - your slurmd appears not to have the NVML flag on compile time.

You need to set up the NVML and turn the --with-nvml flag on when configuring slurm to fix the issue if you are compiling one, or find a binary package that complied with such flag on.

Credit to Rob - WE ARE

S. Zhang

2023年11月30日(木) 23:30 Groner, Rob <rug...@psu.edu>:

Reply all

Reply to author

Forward