[slurm-users] SLURM: reconfig

Steven Varga

unread,

May 4, 2022, 10:27:17 PM5/4/22

to slurm...@lists.schedmd.com

Hello,

I am wondering what is the best way to update node changes, such as addition and removal of nodes to SLURM. The excerpts below suggest a full restart, can someone confirm this? or perhaps `scontrol reconfigure | kill -s SIGHUP` does it?

best wishes: steven

// src/slurmctld/read_config.c line #2819 
static int _compare_hostnames(node_record_t *old_node_table, int old_node_count, node_record_t *node_table, int node_count) {
 [...]
    if (old_node_count != node_count) {
        error("%s: node count has changed before reconfiguration "
              "from %d to %d. You have to restart slurmctld.",
              __func__, old_node_count, node_count);
        return -1;
    }
[...]
    if (xstrcmp(old_ranged, ranged) != 0) {
        error("%s: node names changed before reconfiguration. "
              "You have to restart slurmctld.", __func__);
        cc = -1;
    }
[...]
    return cc;
}

Christopher Samuel

unread,

May 5, 2022, 12:27:17 AM5/5/22

to slurm...@lists.schedmd.com

On 5/4/22 7:26 pm, Steven Varga wrote:

> I am wondering what is the best way to update node changes, such as
> addition and removal of nodes to SLURM. The excerpts below suggest a
> full restart, can someone confirm this?

You are correct, you need to restart slurmctld and slurmd daemons at
present. See https://slurm.schedmd.com/faq.html#add_nodes

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Steven Varga

unread,

May 5, 2022, 8:18:29 AM5/5/22

to Slurm User Community List

Thank you for the quick reply! I know I am pushing my luck here: is it possible to modify slurm: src/common/[read_conf.c, node_conf.c] src/slurmctld/[read_config.c, ...] such that the state can be maintained dynamically? -- or cheaper to write a job manager with less features but supporting dynamic nodes from ground up?

best wishes: steve

Tina Friedrich

unread,

May 5, 2022, 8:55:23 AM5/5/22

to slurm...@lists.schedmd.com

Hi List,

out of curiosity - I would assume that if running configless, one
doesn't manually need to restart slurmd on the nodes if the config changes?

Hi Steven,

I have no idea if you want to do it every couple of minutes and what the
implications are of that (although I've certainly manage to restart them
every 5 minutes by accident with no real problems caused), but -
generally, restarting the daemons (slurmctld, slurmd) is a non-issue, as
it's a safe operation. There's no risk to running jobs or anything. I
have the config management restart them if any files change. It also
doesn't seem to matter if the restarts of the controller & the node
daemons are splayed a bit (i.e. don't happen at the same time), or what
order they happen in.

Tina

> Chris Samuel : http://www.csamuel.org/ <http://www.csamuel.org/>
> : Berkeley, CA, USA
>

--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk

Steven Varga

unread,

May 5, 2022, 9:25:30 AM5/5/22

to Slurm User Community List

Hi Tina,

Thank you for sharing. This matches my observations when I checked if slurm could do what I am upto: manage AWS EC2 dynamic(spot) instances.

After replacing MySQL with REDIS now i wonder what would it take to make slurm node addition | removal dynamic. I've been looking at the source code for many months now and trying to decide if it can be done.

I am using configless, 3 controllers, 2 slurmdbs with a redis sentinel based robust backend.

Steven

Brian Andrus

unread,

May 5, 2022, 9:48:49 AM5/5/22

to slurm...@lists.schedmd.com

@Tina,

Figure slurmd reads the config in ones and runs with it. You would need to have it recheck regularly to see if there are any changes. This is exactly what 'scontrol reconfig' does: tells all the slurm nodes to recheck the config.

@Steven,

It seems to me you could just have a monitor daemon that keeps things up-to-date.
It could watch for the alert that AWS sends (2 minute warning, IIRC) and take appropriate action of drain the node and cancel/checkpoint a job.
In addition, it could keep an eye on things in the event a warning wasn't received and a node 'vanishes'. I suspect Nagios even has the hooks to make that work. You could also email the user to let them know their job was ended due to spot being pulled.

Just some ideas,

Brian Andrus

Ole Holm Nielsen

unread,

May 5, 2022, 9:51:51 AM5/5/22

to slurm...@lists.schedmd.com

Hi Tina,

On 5/5/22 14:54, Tina Friedrich wrote:
> Hi List,
>
> out of curiosity - I would assume that if running configless, one doesn't
> manually need to restart slurmd on the nodes if the config changes?

That is correct. Just do "scontrol reconfig" on the slurmctld server. If
all your slurmd's are truly running Configless[1], they will pick up the
new config and reconfigure without restarting.

Details are summarized in
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#reconfiguration-of-slurm-conf.
Beware that you can't add or remove nodes without restarting. Also,
changing certain slurm.conf parameters require restarting.

/Ole

[1] https://slurm.schedmd.com/configless_slurm.html

Ward Poelmans

unread,

May 5, 2022, 9:54:31 AM5/5/22

to slurm...@lists.schedmd.com

Hi Steven,

I think truly dynamic adding and removing of nodes is something that's on the roadmap for slurm 23.02?

Ward

> > Chris Samuel : http://www.csamuel.org/ <http://www.csamuel.org/> <http://www.csamuel.org/ <http://www.csamuel.org/>>

> > : Berkeley, CA, USA
> >
>
> --
> Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
>
> Research Computing and Support Services
> IT Services, University of Oxford

> http://www.arc.ox.ac.uk <http://www.arc.ox.ac.uk> http://www.it.ox.ac.uk <http://www.it.ox.ac.uk>
>

Mark Dixon

unread,

May 5, 2022, 10:09:26 AM5/5/22

to Slurm User Community List

On Thu, 5 May 2022, Ole Holm Nielsen wrote:
...

> That is correct. Just do "scontrol reconfig" on the slurmctld server. If
> all your slurmd's are truly running Configless[1], they will pick up the
> new config and reconfigure without restarting.
>
> Details are summarized in
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#reconfiguration-of-slurm-conf.
> Beware that you can't add or remove nodes without restarting. Also,
> changing certain slurm.conf parameters require restarting.

...

However...

Given that the normal recommendation for adding/removing nodes safely is
to:

* stop slurmctld
* edit slurm.conf etc.
* restart the slurmd nodes to pick up new slurm.conf
* start slurmctld

I'm confused how this is supposed to be achieved in a configless setting,
as slurmctld isn't running to distribute the updated files to slurmd.

Best,

Mark

Ole Holm Nielsen

unread,

May 5, 2022, 10:38:08 AM5/5/22

to slurm...@lists.schedmd.com

On 5/5/22 15:53, Ward Poelmans wrote:
> Hi Steven,
>
> I think truly dynamic adding and removing of nodes is something that's on
> the roadmap for slurm 23.02?

Yes, see slide 37 in https://slurm.schedmd.com/SLUG21/Roadmap.pdf from the
Slurm publications site https://slurm.schedmd.com/publications.html

/Ole

Ole Holm Nielsen

unread,

May 5, 2022, 10:44:39 AM5/5/22

to slurm...@lists.schedmd.com

You're right, probably the correct order for Configless must be:

* stop slurmctld
* edit slurm.conf etc.

* start slurmctld

* restart the slurmd nodes to pick up new slurm.conf

See also slides 29-34 in
https://slurm.schedmd.com/SLUG21/Field_Notes_5.pdf from the Slurm
publications site https://slurm.schedmd.com/publications.html

Less-Safe, but usually okay, procedure:
1. Change configs
2. Restart slurmctld
3. Restart all slurmd processes really quickly

/Ole

Christopher Samuel

unread,

May 5, 2022, 3:15:42 PM5/5/22

to slurm...@lists.schedmd.com

On 5/5/22 5:17 am, Steven Varga wrote:

> Thank you for the quick reply! I know I am pushing my luck here: is it
> possible to modify slurm: src/common/[read_conf.c, node_conf.c]
> src/slurmctld/[read_config.c, ...] such that the state can be maintained
> dynamically? -- or cheaper to write a job manager with less features but
> supporting dynamic nodes from ground up?

I had said currently, because it looks like you will be in luck with the
next release (though it sounds like it needs a little config):

From https://github.com/SchedMD/slurm/blob/master/RELEASE_NOTES:

-- Allow nodes to be dynamically added and removed from the system.
Configure
MaxNodeCount to accomodate nodes created with dynamic node
registrations
(slurmd -Z<feature> --conf="") and scontrol.

Christopher Samuel

unread,

May 5, 2022, 5:11:00 PM5/5/22

to slurm...@lists.schedmd.com

On 5/5/22 7:08 am, Mark Dixon wrote:

> I'm confused how this is supposed to be achieved in a configless
> setting, as slurmctld isn't running to distribute the updated files to
> slurmd.

That's exactly what happens with configless mode, slurmd's retrieve
their config from the slurmctld, and will grab it again on an "scontrol
reconfigure". There's no reason to stop slurmctld for this.

So your slurm.conf should only exist on the slurmctld node - this is how
we operate on our latest system.

All the best,
Chris
--

Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Mark Dixon

unread,

May 6, 2022, 4:27:23 AM5/6/22

to Slurm User Community List

On Thu, 5 May 2022, Ole Holm Nielsen wrote:
...

> You're right, probably the correct order for Configless must be:
>
> * stop slurmctld
> * edit slurm.conf etc.
> * start slurmctld
> * restart the slurmd nodes to pick up new slurm.conf
>
> See also slides 29-34 in
> https://slurm.schedmd.com/SLUG21/Field_Notes_5.pdf from the Slurm
> publications site https://slurm.schedmd.com/publications.html
>
> Less-Safe, but usually okay, procedure:
> 1. Change configs
> 2. Restart slurmctld
> 3. Restart all slurmd processes really quickly

Sure, I'd seen that: but it's not exactly ideal, is it?

Roll on dynamic node adding and removing...

Mark

Ole Holm Nielsen

unread,

May 6, 2022, 4:40:15 AM5/6/22

to slurm...@lists.schedmd.com

Not ideal, but there are Slurm design reasons (node bitmaps) that
necessitate restarting slurmd's when nodes are added or removed.