[slurm-users] Maintaining slurm config files for test and production clusters

625 views
Skip to first unread message

Groner, Rob

unread,
Jan 4, 2023, 12:23:26 PM1/4/23
to slurm...@schedmd.com
We currently have a test cluster and a production cluster, all on the same network.  We try things on the test cluster, and then we gather those changes and make a change to the production cluster.  We're doing that through two different repos, but we'd like to have a single repo to make the transition from testing configs to publishing them more seamless.  The problem is, of course, that the test cluster and production clusters have different cluster names, as well as different nodes within them.

Using the include directive, I can pull all of the NodeName lines out of slurm.conf and put them into %c-nodes.conf files, one for production, one for test.  That still leaves me with two problems:
  • The clustername itself will still be a problem.  I WANT the same slurm.conf file between test and production...but the clustername line will be different for them both.  Can I use an env var in that cluster name, because on production there could be a different env var value than on test?
  • The gres.conf file.  I tried using the same "include" trick that works on slurm.conf, but it failed because it did not know what the "ClusterName" was.  I think that means that either it doesn't work for anything other than slurm.conf, or that the clustername will have to be defined in gres.conf as well?
Any other suggestions of how to keep our slurm files in a single source control repo, but still have the flexibility to have them run elegantly on either test or production systems?

Thanks.

Brian Andrus

unread,
Jan 4, 2023, 1:53:41 PM1/4/23
to slurm...@lists.schedmd.com

One of the simple ways I have dealt with different configs is to symlink /etc/slurm/slurm.conf to the appropriate file (eg: slurm-dev.conf and slurm-prod.conf)


In fact, I use the symlink for my dev and nothing (configless) for prod. Then I can change a running node to/from dev/prod by merely creating/deleting the symlink and restarting slurmd.


Just an option that may work for you.

I also use separate repos for prod/dev when I am working on packages/testing. I rather prefer that separation so I don't have someone accidentally update to a package that is not production-ready.


Brian Andrus

Fulcomer, Samuel

unread,
Jan 4, 2023, 1:54:49 PM1/4/23
to Slurm User Community List
Just make the cluster names the same, with different Nodename and Partition lines. The rest of slurm.conf can be the same. Having two cluster names is only necessary if you're running production in a multi-cluster configuration.

Our model has been to have a production cluster and a test cluster which becomes the production cluster at yearly upgrade time (for us, next week). The test cluster is also used for rebuilding MPI prior to the upgrade, when the PMI changes. We force users to resubmit jobs at upgrade time (after the maintenance reservation) to ensure that MPI runs correctly.


Paul Edmon

unread,
Jan 4, 2023, 1:59:18 PM1/4/23
to slurm...@lists.schedmd.com

The symlink method for slurm.conf is what we do as well. We have a NFS mount from the slurm master that we host the slurm.conf on that we then symlink slurm.conf to that NFS share.


-Paul Edmon-

Fulcomer, Samuel

unread,
Jan 4, 2023, 2:00:56 PM1/4/23
to Slurm User Community List
...and... using the same cluster name is important in our scenario for the seamless slurmdbd upgrade transition.

In thinking about it a bit more, I'm not sure I'd want to fold together production and test/dev configs in the same revision control tree. We keep them separate. There's no reason to baroquify it.

Groner, Rob

unread,
Jan 17, 2023, 3:37:22 PM1/17/23
to Slurm User Community List
So, you have two equal sized clusters, one for test and one for production?  Our test cluster is a small handful of machines compared to our production.

We have a test slurm control node on a test cluster with a test slurmdbd host and test nodes, all named specifically for test.  We don't want a situation where our "test" slurm controller node is named the same as our "prod" slurm controller node, because the possibility of mistake is too great.  ("I THOUGHT I was on the test network....")

Here's the ultimate question I'm trying to get answered....  Does anyone update their slurm.conf file on production outside of an outage?  If so, how do you KNOW the slurmctld won't barf on some problem in the file you didn't see (even a mistaken character in there would do it)?  We're trying to move to a model where we don't have downtimes as often, so I need to determine a reliable way to continue to add features to slurm without having to wait for the next outage.  There's no way I know of to prove the slurm.conf file is good, except by feeding it to slurmctld and crossing my fingers.

Rob


From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Fulcomer, Samuel <samuel_...@brown.edu>
Sent: Wednesday, January 4, 2023 1:54 PM
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Maintaining slurm config files for test and production clusters
 
You don't often get email from samuel_...@brown.edu. Learn why this is important

Brian Andrus

unread,
Jan 17, 2023, 5:55:01 PM1/17/23
to slurm...@lists.schedmd.com


Run a secondary controller.

Do 'scontrol takeover' before any changes, make your changes and restart slurmctld on the primary.

If it fails, no harm/no foul, because the secondary is still running happily. If it succeeds, it takes control back and you can then restart the secondary with the new (known good) config.


Brian Andrus

Greg Wickham

unread,
Jan 18, 2023, 1:38:45 AM1/18/23
to Slurm User Community List

Hi Rob,

 

Slurm doesn’t have a “validate” parameter hence one must know ahead of time whether the configuration will work or not.

 

In answer to your question – yes – on our site the Slurm configuration is altered outside of a maintenance window.

 

Depending upon the potential impact of the change, it will either be made silently (no announcement) or users are notified on slack that there maybe a brief outage.

 

Slurm is quite resilient – if slurmctld is down, launching jobs will not happen and user commands will fail. But all existing jobs will keep running.

 

Our users are quite tolerant as well – letting them know when a potential change may impact their overall experience of the cluster seems to be appreciated.

 

On our site the configuration files are not changed directly, but moreover a template engine is used – our slurm configuration data is in YAML files, which are then validated and processed to generate the slurm.conf / nodes.conf / partitions.conf / topology.conf

 

This provides some surety that adding / removing nodes etc. won’t result in an inadvertent configuration issue.

 

We have three clusters (one production, and two test) – all are managed the same way.

 

Finally, using configuration templating it’s possible to spin up new clusters quite quickly . . . The longest time is spent picking a new cluster name.

 

   -Greg

 

On 17/01/2023, 23:42, "slurm-users" <slurm-use...@lists.schedmd.com> wrote:

 

So, you have two equal sized clusters, one for test and one for production?  Our test cluster is a small handful of machines compared to our production.

 

We have a test slurm control node on a test cluster with a test slurmdbd host and test nodes, all named specifically for test.  We don't want a situation where our "test" slurm controller node is named the same as our "prod" slurm controller node, because the possibility of mistake is too great.  ("I THOUGHT I was on the test network....")

 

Here's the ultimate question I'm trying to get answered....  Does anyone update their slurm.conf file on production outside of an outage?  If so, how do you KNOW the slurmctld won't barf on some problem in the file you didn't see (even a mistaken character in there would do it)?  We're trying to move to a model where we don't have downtimes as often, so I need to determine a reliable way to continue to add features to slurm without having to wait for the next outage.  There's no way I know of to prove the slurm.conf file is good, except by feeding it to slurmctld and crossing my fingers.

 

Rob

 

--

Groner, Rob

unread,
Jan 18, 2023, 9:11:35 AM1/18/23
to slurm...@lists.schedmd.com
This sounds like a great idea.  My org has been strangely resistent to setting up HA for slurm, this might be a good enough reason.  thanks.

Rob

From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Brian Andrus <toom...@gmail.com>
Sent: Tuesday, January 17, 2023 5:54 PM
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>

Groner, Rob

unread,
Jan 18, 2023, 9:20:22 AM1/18/23
to Slurm User Community List
Generating the *.conf files from parseable/testable sources is an interesting idea.  You mention nodes.conf and partitions.conf.  I can't find any documentation on those.  Are you just creating those files and then including them in slurm.conf?

Rob


From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Greg Wickham <greg.w...@kaust.edu.sa>
Sent: Wednesday, January 18, 2023 1:38 AM

To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Maintaining slurm config files for test and production clusters
 

Greg Wickham

unread,
Jan 18, 2023, 12:14:52 PM1/18/23
to Slurm User Community List

Hi Rob,


>
Are you just creating those files and then including them in slurm.conf?


Yes.

We’re using puppet, but you could get the same results using jinja2.


The workflow we use is a little convoluted – the original YAML files are validated then JSON formatted data is written to intermediate files.

The schema of the JSON formatted files is rather verbose to match the capability of the template engine (this simplifies the template definition significantly)

When puppet runs it loads the JSON and then renders the Slurm files.

(The same YAML files are used to configure warewulf/ dnsmasq (DHCP) / bind / iPXE . . .)


Finally, it’s worth mentioning that the YAML files are managed by git, with a gitlab runner completing the validation phase before any files are published to production.

   -greg

Reply all
Reply to author
Forward
0 new messages