[slurm-users] How to use Autodetect=nvml in gres.conf

106 views
Skip to first unread message

Dean Schulze

unread,
Feb 5, 2020, 3:07:28 PM2/5/20
to Slurm User Community List
I need to dynamically configure gpus on my nodes.  The gres.conf doc says to use

Autodetect=nvml

in gres.conf instead of adding configuration details to each gpu in gres.conf.  The docs aren't really clear about this because they show an example with the details for each gpu:

AutoDetect=nvml
Name=gpu Type=gp100  File=/dev/nvidia0 Cores=0,1
Name=gpu Type=gp100  File=/dev/nvidia1 Cores=0,1
Name=gpu Type=p6000  File=/dev/nvidia2 Cores=2,3
Name=gpu Type=p6000  File=/dev/nvidia3 Cores=2,3
Name=mps Count=200  File=/dev/nvidia0
Name=mps Count=200  File=/dev/nvidia1
Name=mps Count=100  File=/dev/nvidia2
Name=mps Count=100  File=/dev/nvidia3
Name=bandwidth Type=lustre Count=4G
First Question:  If I use Autodetect=nvml do I also need to specify File= and Cores= for each gpu in gres.conf?  I'm hoping that with Autodetect=nvml that all I need is the Name= and Type= for each gpu.  Otherwise it's not clear what the purpose of setting Autodetect=nvml would be.

Second Question:  I installed the CUDA tools from the binary cuda_10.2.89_440.33.01_linux.run.  When I restart slurmd with Autodetect=nvml in gres.conf I get this error:

fatal: We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured.

Is there something else I need to configure to tell slurmd how to use nvml?


Stephan Roth

unread,
Feb 7, 2020, 4:23:14 AM2/7/20
to slurm...@lists.schedmd.com
On 05.02.20 21:06, Dean Schulze wrote:
> I need to dynamically configure gpus on my nodes. The gres.conf doc
> says to use
>
> Autodetect=nvml

That's all you need in gres.conf provided you don't configure any
Gres=... entries for your nodes in your slurm.conf.
If you do, make sure the string matches what NVML discovers, i.e.
lowercase and underscores instead of spaces or dashes.

The upside of configuring everything is you will be informed in case the
automatically detected GPUs in a node don't match what you configured.
I guess the version of slurm you're using was linked against a version
of NVML which has been overwritten by your installation of Cuda 10.2

If that's the case there are various ways to solve that problem, but
that depends on your reason to install Cuda 10.2.

My recommendation is to use the Cuda version of your system matching
your system's slurm package and to install Cuda 10.2 in a non-default
location, provided you need to make it available on a cluster node.

If people using your cluster ask for Cuda 10.2 they have the option of
using a virtual conda environment and install Cuda 10.2 there.


Cheers,
Stephan

dean.w....@gmail.com

unread,
Feb 7, 2020, 10:17:28 AM2/7/20
to Slurm User Community List
I didn't know that slurm had nvml linked into it. I build slurm from source and didn't notice that nvml was part of the build. I'll check on that again.

dean.w....@gmail.com

unread,
Feb 7, 2020, 11:04:20 AM2/7/20
to Slurm User Community List
I just checked the .deb package that I build from source and there is nothing in it that has nv or cuda in its name.

Are you sure that slurm distributes nvidia binaries?

-----Original Message-----
From: slurm-users <slurm-use...@lists.schedmd.com> On Behalf Of Stephan Roth
Sent: Friday, February 7, 2020 2:23 AM
To: slurm...@lists.schedmd.com

Stephan Roth

unread,
Feb 7, 2020, 11:58:10 AM2/7/20
to slurm...@lists.schedmd.com
gpu_nvml.so links to libnvidia-ml.so:

$ ldd lib/slurm/gpu_nvml.so
...
libnvidia-ml.so.1 => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
(0x00007f2d2bac8000)
...

When you run configure you'll see something along these lines:
-------------------------------------------------------------------
Stephan Roth | ISG.EE D-ITET ETH Zurich | http://www.isg.ee.ethz.ch
+4144 632 30 59 | ETF D 104 | Sternwartstrasse 7 | 8092 Zurich
-------------------------------------------------------------------
GPG Fingerprint: E2B9 1B4F 4D35 F233 BE12 1BE9 B423 4018 FBC0 EA17

Stephan Roth

unread,
Feb 7, 2020, 12:03:19 PM2/7/20
to slurm...@lists.schedmd.com
I didn't say slurm distributes nvidia binaries. But slurm's gpu_nvml.so
links to libnvidia-ml.so if it was found at build time:

$ ldd lib/slurm/gpu_nvml.so
...
libnvidia-ml.so.1 => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
(0x00007f2d2bac8000)
...


When I run configure without any options I see

...
checking nvml.h usability... yes
checking nvml.h presence... yes
checking for nvml.h... yes
checking for nvmlInit in -lnvidia-ml... no
configure: WARNING: unable to locate libnvidia-ml.so and/or nvml.h
...


With LDFLAGS=-L/usr/lib/x86_64-linux-gnu/nvidia/current it works:

checking nvml.h usability... yes
checking nvml.h presence... yes
checking for nvml.h... yes
checking for nvmlInit in -lnvidia-ml... yes

Cheers,
Stephan

On 07.02.20 17:03, dean.w....@gmail.com wrote:

Christopher Samuel

unread,
Feb 7, 2020, 12:08:17 PM2/7/20
to slurm...@lists.schedmd.com
Hi Dean,

On 2/7/20 8:03 AM, dean.w....@gmail.com wrote:

> I just checked the .deb package that I build from source and there is nothing in it that has nv or cuda in its name.
>
> Are you sure that slurm distributes nvidia binaries?

SchedMD only distributes sources, it's up to distros how they package it.

I suspect you'll need to build it yourself if you want NVML support, I
doubt many distros will want to be distributing builds linked against
non-free nvidia libraries.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Dean Schulze

unread,
Feb 7, 2020, 1:49:53 PM2/7/20
to Slurm User Community List
So this is related to the gpu/nvml plugin in the source code tree.  That didn't get built because I didn't have the nvidia driver (really the library libnvidia-ml.so) installed when I built the code.  I see in config.log where it tries to find -lnvidia-ml and it skips building the gpu.nvml plugin if it doesn't find it.

So in order to use Autodetect=nvml in gres.conf you have to install the nvidia driver before building the source code.

I wish they would document some of these things.

Reply all
Reply to author
Forward
0 new messages