Hello,
I wondered if I could compare notes with other community members who have upgraded slurm on their cluster. We are currently running slurm v17.02 and I understand that the rpm mix/structure changed at v17.11. We are, by the way, planning to upgrade to v18.08.
I gather that I should upgrade slurmdbd first and ideally the slurmctld should not be touched/killed at point. On our cluster we run both the slurmdbd and the slurmctld on a single host. I have been advised to just upgrade slurmdbd initially -- upgrading the rpms in a piecemeal fashion. That is, just installing the slurm and slurmdbd rpms in the first instance. Due to the rpm structure changes at v17.11 that strategy doesn't work -- the yum updates fail with dependency issues.
I guess that the only solution is to upgrade all the slurm at once. That means that the slurmctld will be killed (unless it has been stopped first). Is there anyone who has done an upgrade who would be willing to share their experiences, please? In other words, is it valid to kill both the slurmdbd and slurmcltd processes at the start of the upgrade and, if so, how does the loss of the slurmctld affect the cluster re running jobs, user activity, etc?
Best regards,
David
Thank you for your comments. I could potentially force the upgrade of the slurm and slurm-slumdbd rpms using something like:
I use rpm's for our installs here. I usually pause all the jobs prior to the upgrade, then I follow the guide here:
https://slurm.schedmd.com/quickstart_admin.html
I haven't done the upgrade to 18.08 though yet, and so I haven't had to contend with the automatic restart that seems to be the case with the new rpm spec script (we went to 17.11 prior to the rpm spec reorg). Frankly I wish that they didn't do the automatic restart as I like to manage that myself.
As Chris said though you definitely want to do the slurmdbd upgrade from the commandline. I've had it where when just restarting the service it times out and the database only gets partially update. In which case I had to restore from the mysqldump I had made and tried again. Also highly recommend doing mysqldumps prior to major version updates.
-Paul Edmon-
Thank you for your reply. You're correct, the systemd commands aren't invoked, however upgrading the slurm rpm effectively pulls the rug from under /usr/sbin/slurmctld. The v17.02 slurm rpm provides /usr/sbin/slurmctld, but from v17.11 that executable is provided by the slurm-slurmctld rpm.
In other words, doing a minimal install of just the slurm and the slurmdbd rpms deletes the slurmctld executable. I haven't explicitly tested this, however I tested the upgrade on a compute node and experimented with the slurmd -- the logic should be the same.
I guess that the question that comes to mind is.. Is it a really big deal if the slurmctld process is down whilst the slurmdbd is being upgraded? Bearing in mind that I will probably opt to suspend all run jobs and stop the partitions during the upgrade.
Best regards,
David