Hi all,
We are building out a new Slurm cluster for a research group here; unfortunately this has taken place over a long period of time, and there’s been some architectural changes made in the middle, most importantly the host OS on the Slurm nodes (we were using CentOS 7.x originally, now the compute nodes are on Ubuntu 16.04.) Currently, we have a single controller (slurmctld) node, an accounting db node (slurmdbd), and 10 compute/worker nodes (slurmd.)
The problem is that the controller is still running CentOS 7 with our older NFS-mounted /home scheme, but the compute nodes are now all Ubuntu 16.04 with local /home fs’s. Currently (still in testing mode here), the users log into the controller node to submit jobs, but of course that’s now a completely different OS/lib environment than on the compute nodes. (They cannot log into the compute nodes unless they have a job currently running on them, as we have implemented the ‘pam_slurm.so’ PAM module on the compute nodes.)
My questions are these:
1) Can we leave the current controller machine on C7 OS, and just have the users log into other machines (that have the same config as the compute nodes) to submit jobs? Or should the controller node really be on the same OS as the compute nodes for some reason?
2) Can I add a backup controller node that runs a different environment (i.e. like the compute node environment) than the primary controller node? Or should (must) it be the same as the primary controller node?
3) What are the steps to replace a primary controller, given that a backup controller exists? (Hopefully this is already documented somewhere that I haven’t found yet)
Thanks,
Will
No cluster mgr/framework in use... Custom-compiled and packaged the Slurm 16.05.4 release into .rpm/.deb files, and used them to install the different nodes.
Although the homedirs are no longer shared, the nodes do have access to shared storage, one mounted as a subdir of the home directory (which you can symlink stuff from that to the homedir level to “auto-magically” via a conf file that works with a system we designed.) So shared dotfiles, subdirs/files in homedir, etc are all possible.
Have not investigated containerized Slurm setup – will have to put that on the exploration list. If the workloads were Dockerized, I’d probably run them via Kubernetes rather than Slurm...
-Will