One of the simple ways I have dealt with different configs is to symlink /etc/slurm/slurm.conf to the appropriate file (eg: slurm-dev.conf and slurm-prod.conf)
In fact, I use the symlink for my dev and nothing (configless) for prod. Then I can change a running node to/from dev/prod by merely creating/deleting the symlink and restarting slurmd.
Just an option that may work for you.
I also use separate repos for prod/dev when I am working on
packages/testing. I rather prefer that separation so I don't have
someone accidentally update to a package that is not
production-ready.
Brian Andrus
The symlink method for slurm.conf is what we do as well. We have a NFS mount from the slurm master that we host the slurm.conf on that we then symlink slurm.conf to that NFS share.
-Paul Edmon-
|
You don't often get email from samuel_...@brown.edu.
Learn why this is important
|
Run a secondary controller.
Do 'scontrol takeover' before any changes, make your changes and restart slurmctld on the primary.
If it fails, no harm/no foul, because the secondary is still running happily. If it succeeds, it takes control back and you can then restart the secondary with the new (known good) config.
Brian Andrus
Hi Rob,
Slurm doesn’t have a “validate” parameter hence one must know ahead of time whether the configuration will work or not.
In answer to your question – yes – on our site the Slurm configuration is altered outside of a maintenance window.
Depending upon the potential impact of the change, it will either be made silently (no announcement) or users are notified on slack that there maybe a brief outage.
Slurm is quite resilient – if slurmctld is down, launching jobs will not happen and user commands will fail. But all existing jobs will keep running.
Our users are quite tolerant as well – letting them know when a potential change may impact their overall experience of the cluster seems to be appreciated.
On our site the configuration files are not changed directly, but moreover a template engine is used – our slurm configuration data is in YAML files, which are then validated and processed to generate the slurm.conf / nodes.conf / partitions.conf / topology.conf
This provides some surety that adding / removing nodes etc. won’t result in an inadvertent configuration issue.
We have three clusters (one production, and two test) – all are managed the same way.
Finally, using configuration templating it’s possible to spin up new clusters quite quickly . . . The longest time is spent picking a new cluster name.
-Greg
On 17/01/2023, 23:42, "slurm-users" <slurm-use...@lists.schedmd.com> wrote:
So, you have two equal sized clusters, one for test and one for production? Our test cluster is a small handful of machines compared to our production.
We have a test slurm control node on a test cluster with a test slurmdbd host and test nodes, all named specifically for test. We don't want a situation where our "test" slurm controller node is named the same as our "prod" slurm controller node, because the possibility of mistake is too great. ("I THOUGHT I was on the test network....")
Here's the ultimate question I'm trying to get answered.... Does anyone update their slurm.conf file on production outside of an outage? If so, how do you KNOW the slurmctld won't barf on some problem in the file you didn't see (even a mistaken character in there would do it)? We're trying to move to a model where we don't have downtimes as often, so I need to determine a reliable way to continue to add features to slurm without having to wait for the next outage. There's no way I know of to prove the slurm.conf file is good, except by feeding it to slurmctld and crossing my fingers.
Rob
--
|
You don't often get email from greg.w...@kaust.edu.sa.
Learn why this is important
|
Hi Rob,
> Are you just creating those files and then including them in slurm.conf?
Yes.
We’re using puppet, but you could get the same results using jinja2.
The workflow we use is a little convoluted – the original YAML files are validated then JSON formatted data is written to intermediate files.
The schema of the JSON formatted files is rather verbose to match the capability of the template engine (this simplifies the template definition significantly)
When puppet runs it loads the JSON and then renders the Slurm files.
(The same YAML files are used to configure warewulf/ dnsmasq (DHCP) / bind / iPXE . . .)
Finally, it’s worth mentioning that the YAML files are managed by git, with a gitlab runner completing the validation phase before any files are published to production.
-greg