On Jan 23, 2021, at 15:44, Philip Kovacs <pkd...@yahoo.com> wrote:
Several things to keep in mind...
Andy Riebs
Also, a plug for support contracts. I have been doing slurm for a very
long while, but always encourage my clients to get a support contract.
That is how SchedMD stays alive and we are able to have such a good
piece of software. I see the cloud providers starting to build tools
that will eventually obsolesce slurm for the cloud. I worry that there
won't be enough paying customers for Tim to keep things running as well
as he has. I'm pretty sure most folks that use slurm for any period of
time has received more value that a small support contract would be.
You should check your jobs that allocated GPUs and make sure
CUDA_VISIBLE_DEVICES is being set in the environment. This is a sign
you GPU support is not really there but SLURM is just doing "generic"
resource assignment.
I have both GPU and non-GPU nodes. I build SLURM rpms twice. Once on a
non-GPU node and use those RPMs to install on the non-GPU nodes. Then build
again on the GPU node where CUDA is installed via the NVIDIA CUDA YUM repo
rpms so the NVML lib is at /lib64/libnvidia-ml.so.1 (from rpm
nvidia-driver-NVML-455.45.01-1.el8.x86_64) and no special mods to the default
RPM SPEC is needed. I just run
rpmbuild --tb slurm-20.11.3.tar.bz2
You can run 'rpm -qlp slurm-20.11.3-1.el8.x86_64.rpm | grep nvml' and see
that /usr/lib64/slurm/gpu_nvml.so only exists on the one built on the
GPU node.
On Tue, 26 Jan 2021 2:29pm, Ole Holm Nielsen wrote:
> In another thread, On 26-01-2021 17:44, Prentice Bisbal wrote:
>> Personally, I think it's good that Slurm RPMs are now available through
>> EPEL, although I won't be able to use them, and I'm sure many people on
>> the list won't be able to either, since licensing issues prevent them from
>> providing support for NVIDIA drivers, so those of us with GPUs on our
>> clusters will still have to compile Slurm from source to include NVIDIA
>> GPU support.
>
> We're running Slurm 20.02.6 and recently added some NVIDIA GPU nodes.
> The Slurm GPU documentation seems to be
> We don't seem to have any problems scheduling jobs on GPUs, even though our
> Slurm RPM build host doesn't have any NVIDIA software installed, as shown by
> the command:
> $ ldconfig -p | grep libnvidia-ml
>
> I'm curious about Prentice's statement about needing NVIDIA libraries to be
> installed when building Slurm RPMs, and I read the discussion in bug 9525,
You can include gpu's as gres in slurm with out compiling specifically against nvml. You only really need to do that if you want to use the autodetection features that have been built into the slurm. We don't really use any of those features at our site, we only started building against nvml to future proof ourselves for when/if those features become relevant to us.
To me at least it would be nicer if there was a less hacky way of
getting it to do that. Arguably Slurm should dynamically link
against the libs it needs or not depending on the node. We hit
this issue with Lustre/IB as well where you have to roll a
separate slurm for each type of node you have if you want these
which is hardly ideal.
-Paul Edmon-
I've definitely been there with the minimum cost issue. One thing I have done personally is start attending SLUG. Now I can give back and learn more in the process. That may be an option to pitch, iterating the value you receive from open source software as part of the ROI.
Interestingly, I have been able to deploy completely to cloud using only slurm. It has the ability to integrate into any cloud cli, so nothing else has been needed. Just for the heck of it, I am thinking of integrating it into Terraform, although not necessary.
Brian Andrus