Hi all,
I am looking for a clean way to set up Slurms native high availability
feature. I am managing a Slurm cluster with one control node (hosting
both slurmctld and slurmdbd), one login node and a few dozen compute
nodes. I have a virtual machine that I want to set up as a backup
control node.
The Slurm documentation says the following about the StateSaveLocation
directory:
> The directory used should be on a low-latency local disk to prevent file system delays from affecting Slurm performance. If using a backup host, the StateSaveLocation should reside on a file system shared by the two hosts. We do not recommend using NFS to make the directory accessible to both hosts, but do recommend a shared mount that is accessible to the two controllers and allows low-latency reads and writes to the disk. If a controller comes up without access to the state information, queued and running jobs will be cancelled. [1]
My question: How do I implement the shared file system for the
StateSaveLocation?
I do not want to introduce a single point of failure by having a single
node that hosts the StateSaveLocation, neither do I want to put that
directory on the clusters NFS storage since outages/downtime of the
storage system will happen at some point and I do not want that to cause
an outage of the Slurm controller.
Any help or ideas would be appreciated.
Best,
Pierre
[1]
https://slurm.schedmd.com/quickstart_admin.html#Config
--
Pierre Abele, M.Sc.
HPC Administrator
Max-Planck-Institute for Evolutionary Anthropology
Department of Primate Behavior and Evolution
Deutscher Platz 6
04103 Leipzig
Room: U2.80
E-Mail:
pierre...@eva.mpg.de
Phone:
+49 (0) 341 3550 245