1. Part of the communications for slurm is hierarchical. Thus nodes need to know about other nodes so they can talk to each other and forward messages to the slurmctld.
2. Yes, this is what we do. We have our slurm.conf shared via NFS from our slurm master and then we just update that single conf. After that update we then use salt to issue a global restart to all the slurmd's and slurmctld to pick up the new config. scontrol reconfigure is not enough when adding new nodes, you have to issue a global restart.
3. It's pretty straight forward all told. You just need to update the slurm.conf and do a restart. You need to be careful that the names you enter into the slurm.conf are resolvable by DNS, else slurmctld may barf on restart. Sadly no built in sanity checker exists that I am aware of aside from actually running slurmctld. We got around this by putting together a gitlab runner which screens our slurm.conf's by running synthetic slurmctld to sanity check.
-Paul Edmon-
I agree that people are making updating slurm.conf a bigger issue
than people are making it out to be. However, there are certain
config changes that do require restarting the daemon rather than
just doing 'scontrol reconfigure.' these options are documented in
the slurm.conf documentation (just search for "restart")
I believe it's often only the slurmctld that needs to be
restarted, which is one daemon on one system, rather than
restarting slurmd on all the compute nodes, but there are a few
that require all slurm daemons being restarted. Adding nodes to a
cluster is one of them:
Changes in node configuration (e.g. adding nodes, changing their processor count, etc.) require restarting both the slurmctld daemon and the slurmd daemons. All slurmd daemons must know each node in the system to forward messages in support of hierarchical communications
But to avoid this, you can use the future setting to define
"future" nodes:
- FUTURE
- Indicates the node is defined for future use and need not exist when the Slurm daemons are started. These nodes can be made available for use simply by updating the node state using the scontrol command rather than restarting the slurmctld daemon. After these nodes are made available, change their State in the slurm.conf file. Until these nodes are made available, they will not be seen using any Slurm commands or nor will any attempt be made to contact them.
--
Prentice