On 05.02.20 21:06, Dean Schulze wrote:
> I need to dynamically configure gpus on my nodes. The gres.conf doc
> says to use
>
> Autodetect=nvml
That's all you need in gres.conf provided you don't configure any
Gres=... entries for your nodes in your slurm.conf.
If you do, make sure the string matches what NVML discovers, i.e.
lowercase and underscores instead of spaces or dashes.
The upside of configuring everything is you will be informed in case the
automatically detected GPUs in a node don't match what you configured.
I guess the version of slurm you're using was linked against a version
of NVML which has been overwritten by your installation of Cuda 10.2
If that's the case there are various ways to solve that problem, but
that depends on your reason to install Cuda 10.2.
My recommendation is to use the Cuda version of your system matching
your system's slurm package and to install Cuda 10.2 in a non-default
location, provided you need to make it available on a cluster node.
If people using your cluster ask for Cuda 10.2 they have the option of
using a virtual conda environment and install Cuda 10.2 there.
Cheers,
Stephan