Hi all,
If you could offer a little bit more details on your OS and Slurm version that might shed some light.
There is an interesting detail about the NVML package if you are using RHEL-like OS.
The NVML detection part of the slurm library (/usr/lib64/slurm/gpu_nvml.so) is linked against the /lib64/libnvidia-ml.so.1 to do the actual detection.
If you do a simple nvidia driver installation that pulls in nvidia-driver-NVML from cuda-rhel8-x86_64 repository,
this package would install /lib64/libnvidia-ml.so.1 as a symlink to /lib64/libnvidia-ml.so.<your driver version>.
In this setup, as the linked library is present, the code would not crash.
However, interestingly the package mentioned above missed another symlink: the /lib64/libnvidia-ml.so to /lib64/libnvidia-ml.so.<your driver version>.
Take a look at the following line of the Slurm source code (I just used the master branch but git blame says it comes a long way):
"""
if (!dlopen("libnvidia-ml.so", RTLD_NOW | RTLD_GLOBAL))
"""
So even though the nvidia-driver-NVML is installed, and the system was able to find the linked library as it was linked against libnvidia-ml.so.1,
as the libnvidia-ml.so link is not provided there, the dlopen fails for the file not found, thus the error message you posted follows.
In our case, I just manually created the missing symlink by ln -s /lib64/libnvidia-ml.so.1 /lib64/libnvidia-ml.so, and the NVML worked as expected.
I kind of wonder if such an issue arose from the packaging issue on the NVIDIA side, or if it should be filed as a bug of SLURM code only checking
for the so library without any versioning suffix.
Your case might be different, but I think as the error message is a direct result of slurm unable to find /lib64/libnvidia-ml.so, you should take
a look at your setup to see if such so file is installed or not - if not, install the package, otherwise create the missing symlink.
Sincerely,
S. Zhang