[slurm-users] How to reinstall / reconfigure Slurm?

146 views
Skip to first unread message

Shooktija S N via slurm-users

unread,
Apr 3, 2024, 7:03:06 AMApr 3
to slurm...@lists.schedmd.com
Hi,

I am setting up Slurm on our lab's 3 node cluster and I have run into a problem while adding GPUs (each node has an NVIDIA 4070 ti) as a GRES. There is an error at the 'debug' log level in slurmd.log that says that the GPU is file-less and is being removed from the final GRES list. This error according to some older posts on this forum might be fixed by reinstalling / reconfiguring Slurm with the right flag (the '--with-nvml' flag according to this post).

Line in /var/log/slurmd.log:
[2024-04-03T15:42:02.695] debug:  Removing file-less GPU gpu:rtx4070 from final GRES list

Does this error require me to either reinstall / reconfigure Slurm? What does 'reconfigure Slurm' mean?
I'm about as clueless as a caveman with a smartphone when it comes to Slurm administration and Linux system administration in general. So, if you could, please explain it to me as simply as possible.

slurm.conf without comment lines:
ClusterName=DlabCluster
SlurmctldHost=server1
GresTypes=gpu
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/affinity,task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
SlurmctldDebug=debug2
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug2
SlurmdLogFile=/var/log/slurmd.log
NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4070:1
PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP

gres.conf (only one line):
AutoDetect=nvml

While installing cuda, I know that nvml has been installed because of this line in /var/log/cuda-installer.log:
[INFO]: Installing: cuda-nvml-dev

Thanks!

PS: I could've added this as a continuation to this post, but for some reason I do not have permission to post to that group, so here I am starting a new thread.

Williams, Jenny Avis via slurm-users

unread,
Apr 3, 2024, 11:50:26 AMApr 3
to Shooktija S N, slurm...@lists.schedmd.com

Slurm source code should be downloaded and recompiled including the configuration flag – with-nvml.

 

 

As an example, using rpmbuild mechanism for recompiling and generating rpms, this is our current method.  Be aware that the compile works only if it finds the prerequisites needed for a given option on the host. (* e.g. to recompile this –with-nvml you should do so on a functioning gpu host *) 

 

========

 

export VERSION=23.11.5

 

 

wget https://download.schedmd.com/slurm/slurm-$VERSION.tar.bz2

#

rpmbuild --define="_with_nvml --with-nvml=/usr" --define="_with_pam --with-pam=/usr" --define="_with_pmix --with-pmix=/usr" --define="_with_hdf5 --without-hdf5" --define="_with_ofed --without-ofed" --define="_with_http_parser --with-http-parser=/usr/lib64" --define="_with_yaml  --define="_with_jwt  --define="_with_slurmrestd --with-slurmrestd=1" -ta slurm-$VERSION.tar.bz2 > build.log-$VERSION-`date +%F` 2>&1

 

 

This is a list of packages we ensure are installed on a given node when running this compile . 

 

    - pkgs:

      - bzip2

      - cuda-nvml-devel-12-2

      - dbus-devel

      - freeipmi

      - freeipmi-devel

      - gcc

      - gtk2-devel

      - hwloc-devel

      - libjwt-devel

      - libssh2-devel

      - libyaml-devel

      - lua-devel

      - make

      - mariadb-devel

      - munge-devel

      - munge-libs

      - ncurses-devel

      - numactl-devel

      - openssl-devel

      - pam-devel

      - perl

      - perl-ExtUtils-MakeMaker

      - readline-devel

      - rpm-build

      - rpmdevtools

      - rrdtool-devel

      - http-parser-devel

      - json-c-devel

Shooktija S N via slurm-users

unread,
Apr 4, 2024, 5:14:33 AMApr 4
to Williams, Jenny Avis, slurm...@lists.schedmd.com
Thank you for the response, it certainly clears up a few things, and the list of required packages is super helpful (where are these listed in the docs?).

Here are a few follow up questions:

I had installed Slurm (version 22.05) using apt by running 'apt install slurm-wlm'. Is it necessary to execute a command like 'apt-get autoremove slurm-wlm' to compile the Slurm source code from scratch, as you've described?

You have given this command as an example:
rpmbuild --define="_with_nvml --with-nvml=/usr" --define="_with_pam --with-pam=/usr" --define="_with_pmix --with-pmix=/usr" --define="_with_hdf5 --without-hdf5" --define="_with_ofed --without-ofed" --define="_with_http_parser --with-http-parser=/usr/lib64" --define="_with_yaml  --define="_with_jwt  --define="_with_slurmrestd --with-slurmrestd=1" -ta slurm-$VERSION.tar.bz2 > build.log-$VERSION-`date +%F` 2>&1

Are the options you've used in this example command fairly standard options for a 'general' installation of Slurm? Where can I learn more about these options to make sure that I don't miss any important options that might be necessary for the specs of my cluster?

Would I have to add the paths to the compiled binaries to the PATH or LD_LIBRARY_PATH environment variables?

My nodes are running an OS based on Debian 12 (Proxmox VE), what is the 'rpmbuild' equivalent for my OS? Would the syntax used in your example command be the same for any build tool?

Thanks!

Shooktija S N via slurm-users

unread,
Apr 8, 2024, 4:52:57 AMApr 8
to slurm...@lists.schedmd.com
Follow up:
I was able to fix my problem following advice in this post which said that the GPU GRES could be manually configured (no autodetect) by adding a line like this: 'NodeName=slurmnode Name=gpu File=/dev/nvidia0' to gres.conf
Reply all
Reply to author
Forward
0 new messages