[slurm-users] cluster reconfigure

2,159 views
Skip to first unread message

Steve Brasier

unread,
Jun 9, 2020, 6:12:55 AM6/9/20
to slurm...@schedmd.com
Hi all, looking for some advice on the process to following when doing one of the reconfigurations which requires a slurm daemon restart (as listed in docs for "scontrol reconfigure").

In this situation, is there any difference in terms of preservation of slurm's state etc between using "scontrol shutdown" or running "service slurmd/slurmctld stop" on each node?

Is there a recommended order in which to shutdown and restart daemons?

many thanks

Steve

Please note I work Tuesday to Friday.

Ole Holm Nielsen

unread,
Jun 9, 2020, 8:29:26 AM6/9/20
to slurm...@lists.schedmd.com
On 6/9/20 12:12 PM, Steve Brasier wrote:
> Hi all, looking for some advice on the process to following when doing one
> of the reconfigurations which requires a slurm daemon restart (as listed
> in docs for "scontrol reconfigure").

When reconfiguring slurm.conf, make sure to propagate that file to all
nodes first!

The scontrol manual page explains when a restart of the daemons (and not
just "scontrol reconfig") is required:

reconfigure
Instruct all Slurm daemons to re-read the configuration file. This
command does not restart the daemons. This mechanism would be used to
modify configuration parameters (Epilog, Prolog, SlurmctldLogFile,
SlurmdLogFile, etc.). The Slurm controller (slurmctld) forwards the
request all other daemons (slurmd daemon on each compute node). Running
jobs continue execution. Most configuration parameters can be changed by
just running this command, however, Slurm daemons should be shutdown and
restarted if any of these parameters are to be changed: AuthType,
ControlMach, PluginDir, StateSaveLocation, SlurmctldHost, SlurmctldPort,
or SlurmdPort. The slurmctld daemon and all slurmd daemons must be
restarted if nodes are added to or removed from the cluster.


> In this situation, is there any difference in terms of preservation of
> slurm's state etc between using "scontrol shutdown" or running "service
> slurmd/slurmctld stop" on each node?

The slurmctld state is preserved in the server's StateSaveLocation:

# scontrol show config | grep StateSaveLocation
StateSaveLocation = /var/spool/slurmctld

It is essential not to disturb that folder! Make a backup after stopping
slurmctld, just in case...

> Is there a recommended order in which to shutdown and restart daemons?

Why do you want to shutdown/restart in the first place? I think you can
restart any daemon if necessary, but you have to consider Slurm's timeout
parameters SlurmctldTimeout and SlurmdTimeout:

# scontrol show config | grep Timeout

If any daemon is down for a longer time, things will start failing!

Best regards,
Ole

Reply all
Reply to author
Forward
0 new messages