Help Needed with NVIDIA Driver Configuration on HPC Cluster, IsoNet

62 views
Skip to first unread message

Robin

unread,
Aug 18, 2024, 6:41:44 AM8/18/24
to IsoNet

Hi everyone,

Context: I am trying to train IsoNet on 5 out of 25 tomgrams on our HPC infrastructure. The training is not feasible due to time constraints, i.e. the job is terminated without being able to run through the first iteration.

I'm encountering an issue while trying to run nvidia-smi on our HPC cluster. When executing the following command:

srun --pty --overlap --jobid my_job_ID nvidia-smi

I receive the following warning and error message:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Linked to libnvidia-ml library at wrong path : /storage/software/broadwell.9/software/CUDA/12.2.0/stubs/lib64/libnvidia-ml.so

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

It seems like there's a configuration issue where libnvidia-ml.so is being linked to the wrong path (/storage/software/broadwell.9/software/CUDA/12.2.0/stubs/lib64/libnvidia-ml.so) instead of the expected default locations (/usr/lib or /usr/lib64).

This is causing nvidia-smi to fail because it can't communicate with the NVIDIA driver. The job also exits with code 9.

Could anyone provide guidance on how to resolve this issue? Does this indicate a configuration problem with the NVIDIA driver or CUDA installation on the cluster? And does this mean my jobs are not running on GPU but CPU even though SLURM should downgrade GPU requests?

Any help would be greatly appreciated!

Thanks in advance!


Reply all
Reply to author
Forward
0 new messages