[slurm-users] Upgrading slurm - can I do it while jobs running?

240 views
Skip to first unread message

Will Dennis

unread,
May 26, 2021, 2:24:00 PM5/26/21
to slurm...@lists.schedmd.com

Hi all,

 

About to embark on my first Slurm upgrade (building from source now, into a versioned path /opt/slurm/<vernum>/ which is then symlinked to /opt/slurm/current/ for the “in-use” one…) This is a new cluster, running 20.11.5 (which we now know has a CVE that was fixed in 20.11.7) but I have researchers running jobs on it currently. As I’m still building out the cluster, I found today that all Slurm source tarballs before 20.11.7 were withdrawn by SchedMD. So, need to upgrade at least the -ctld and -dbd nodes before I can roll any new nodes out on 20.11.7…

 

As I have at least one researcher that is running some long multi-day jobs, can I down the -dbd and -ctld nodes and upgrade them, then put them back online running the new (latest) release, without munging the jobs on the running worker nodes?

 

Thanks!

Will

Antony Cleave

unread,
May 26, 2021, 2:44:09 PM5/26/21
to Slurm User Community List
Short answer yes

Its not risk free but as long as you increase all the timeouts to your worst case estimate x4 and make sure you understand the upgrades section of this link

And keep it open for reference you should be fine

Antony

Ole Holm Nielsen

unread,
May 26, 2021, 2:49:13 PM5/26/21
to slurm...@lists.schedmd.com
I recommend strongly to read the SchedMD presentations in the
https://slurm.schedmd.com/publications.html page, especially the "Field
notes" documents. The latest one is "Field Notes 4: From The Frontlines
of Slurm Support", Jason Booth, SchedMD.

We upgrade Slurm continuously while the nodes are in production mode.
There's a required order of upgrading: first slurmdbd, then slurmctld,
then slurmd nodes, and finally login nodes, see
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

The detailed upgrading commands for CentOS are in
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-on-centos-7

We don't have any problems with running jobs across upgrades, but
perhaps others can share their experiences?

/Ole

Paul Edmon

unread,
May 26, 2021, 2:58:46 PM5/26/21
to slurm...@lists.schedmd.com
We generally pause scheduling during upgrades out of paranoia more than
anything.  What that means is that we set all our partitions to DOWN and
suspend all the jobs.  Then we do the upgrade.  That said I know of
people who do it live with out much trouble.

The risk is more substantial for major version upgrades than minors. So
if you are doing a minor version upgrade its likely fine to do live. 
For major version I would recommend at least pausing all the jobs.

-Paul Edmon-

Will Dennis

unread,
May 26, 2021, 3:11:52 PM5/26/21
to Slurm User Community List

Yup, in our case, it would be 20.11.5 -> 20.11.7.

Will Dennis

unread,
May 26, 2021, 3:20:42 PM5/26/21
to Ole.H....@fysik.dtu.dk, Slurm User Community List

On  Wednesday, May 26, 2021 at 2:49 PM Ole Holm Nielsen said:

> I recommend strongly to read the SchedMD presentations in the

> [snipped] page, especially the "Field

> notes" documents.  The latest one is "Field Notes 4: From The Frontlines
> of Slurm Support", Jason Booth, SchedMD.

Yes, thanks for the reminder.


> We upgrade Slurm continuously while the nodes are in production mode.
> There's a required order of upgrading: first slurmdbd, then slurmctld,
> then slurmd nodes, and finally login nodes, see

> [snipped]

> The detailed upgrading commands for CentOS are in [snipped]

Yes, in our case, it’s Ubuntu; as there is no (recent) official packaging, and keeping a PPA up is a lot of work, we are just compiling source locally now, which SchedMD (who we get support from) prefers anyhow.

Reply all
Reply to author
Forward
0 new messages