Hi all,
About to embark on my first Slurm upgrade (building from source now, into a versioned path /opt/slurm/<vernum>/ which is then symlinked to /opt/slurm/current/ for the “in-use” one…) This is a new cluster, running 20.11.5 (which we now know has a CVE that was fixed in 20.11.7) but I have researchers running jobs on it currently. As I’m still building out the cluster, I found today that all Slurm source tarballs before 20.11.7 were withdrawn by SchedMD. So, need to upgrade at least the -ctld and -dbd nodes before I can roll any new nodes out on 20.11.7…
As I have at least one researcher that is running some long multi-day jobs, can I down the -dbd and -ctld nodes and upgrade them, then put them back online running the new (latest) release, without munging the jobs on the running worker nodes?
Thanks!
Will
Yup, in our case, it would be 20.11.5 -> 20.11.7.
On Wednesday, May 26, 2021 at 2:49 PM
Ole Holm Nielsen said:
> I recommend strongly to read the SchedMD presentations in the
> [snipped] page, especially the "Field
> notes" documents. The latest one is "Field Notes 4: From The Frontlines
> of Slurm Support", Jason Booth, SchedMD.
Yes, thanks for the reminder.
> We upgrade Slurm continuously while the nodes are in production mode.
> There's a required order of upgrading: first slurmdbd, then slurmctld,
> then slurmd nodes, and finally login nodes, see
> [snipped]
> The detailed upgrading commands for CentOS are in [snipped]
Yes, in our case, it’s Ubuntu; as there is no (recent) official packaging, and keeping a PPA up is a lot of work, we are just compiling source locally now, which SchedMD (who we get support from) prefers anyhow.