[slurm-users] Slurm Multi-cluster implementation

1,373 views
Skip to first unread message

navin srivastava

unread,
Oct 28, 2021, 5:55:28 AM10/28/21
to Slurm User Community List
Hi ,

I am looking for a stepwise guide to setup multi cluster implementation.
We wanted to set up 3 clusters and one Login Node to run the job using -M cluster option.
can anybody have such a setup and can share some insight into how it works and it is really a stable solution.


Regards
Navin.

Tina Friedrich

unread,
Oct 28, 2021, 9:33:47 AM10/28/21
to slurm...@lists.schedmd.com
Hi Navin,

well, I have two clusters & login nodes that allow access to both. That
do? I don't think a third would make any difference in setup.

They need to share a database. As long as the share a database, the
clusters have 'knowledge' of each other.

So if you set up one database server (running slurmdbd), and then a
SLURM controller for each cluster (running slurmctld) using that one
central database, the '-M' option should work.

Tina
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk

navin srivastava

unread,
Oct 28, 2021, 10:00:00 AM10/28/21
to Slurm User Community List
Thank you Tina.

so if i understood correctly.Database is global to both cluster and running on login Node?
or is the database running on one of the master Node and shared with another master server Node?

but as far I have read that the slurm database can also be separate on both the master and just use the parameter AccountingStorageExternalHost so that both databases are aware of each other.

Also on the login node in slurm .conf file pointed to which Slurmctld?
is it possible to share the  sample slurm.conf file of login Node.

Regards
Navin.







Tina Friedrich

unread,
Oct 28, 2021, 12:29:10 PM10/28/21
to slurm...@lists.schedmd.com
Hello,

I have the database on a separate server (it runs the database and the
database only). The login nodes run nothing SLURM related, they simply
have the binaries installed & a SLURM config.

I've never looked into having multiple databases & using
AccountingStorageExternalHost (in fact I'd forgotten you could do that),
so I can't comment on that (maybe someone else can); I think that works,
yes, but as I said never tested that (didn't see much point in running
multiple databases if one would do the job).

I actually have specific login nodes for both of my clusters, to make it
easier for users (especially those with not much experience using the
HPC environment); so I have one login node connecting to cluster 1 and
one connecting to cluster 1.

I think the relevant bits of slurm.conf Relevant config entries (if I'm
not mistaken) on the login nodes are probably:

The differences in the slurm config files (that haven't got to do with
topology & nodes & scheduler tuning) are

ClusterName=cluster1
ControlMachine=cluster1-slurm
ControlAddr=/IP_OF_SLURM_CONTROLLER/

ClusterName=cluster2
ControlMachine=cluster2-slurm
ControlAddr=/IP_OF_SLURM_CONTROLLER/

(where IP_OF_SLURM_CONTROLLER is the IP address of host cluster1-slurm,
same for cluster2)

And then the have common entries for the AccountingStorageHost:

AccountingStorageHost=slurm-db-prod
AccountingStorageBackupHost=slurm-db-prod
AccountingStoragePort=7030
AccountingStorageType=accounting_storage/slurmdbd

(slurm-db-prod is simply the hostname of the SLURM database server)

Does that help?

Tina
> http://www.arc.ox.ac.uk <http://www.arc.ox.ac.uk>
> http://www.it.ox.ac.uk <http://www.it.ox.ac.uk>

navin srivastava

unread,
Oct 28, 2021, 12:35:35 PM10/28/21
to Slurm User Community List
Thank you Tina. 
It will really help

Regards 
Navin 

Yair Yarom

unread,
Oct 31, 2021, 4:09:23 AM10/31/21
to Slurm User Community List
Hi,

If it helps, this is our setup:
6 clusters (actually a bit more)
1 mysql + slurmdbd on the same host
6 primary slurmctld on 3 hosts (need to make sure each have a distinct SlurmctldPort)
6 secondary slurmctld on an arbitrary node on the clusters themselves.
1 login node per cluster (this is a very small VM, and the users are limited both to cpu time (with ulimit) and memory (with systemd))
The slurm.conf's are shared on nfs to everyone in /path/to/nfs/<cluster name>/slurm.conf. With symlink to /etc for the relevant cluster per node.

The -M generally works, we can submit/query jobs from a login node of one cluster to another. But there's a caveat to notice when upgrading. slurmdbd must be upgraded first, but usually we have a not so small gap between upgrading the different clusters. This causes the -M to stop working because binaries of one version won't work on the other (I don't remember in which direction).
We solved this by using an lmod module per cluster, which both sets the SLURM_CONF environment, and the PATH to the correct slurm binaries (which we install in /usr/local/slurm/<version>/ so that they co-exists). So when the -M won't work, users can use:
module load slurm/clusterA
squeue
module load slurm/clusterB
squeue

BR,






--
  /|       |
  \/       | Yair Yarom | System Group (DevOps)
  []       | The Rachel and Selim Benin School
  [] /\    | of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //    \  | ir...@cs.huji.ac.il
 //        |

Brian Andrus

unread,
Oct 31, 2021, 12:31:07 PM10/31/21
to slurm...@lists.schedmd.com

That is interesting to me.

How do you use ulimit and systemd to limit user usage on the login nodes? This sounds like something very useful.

Brian Andrus

Yair Yarom

unread,
Nov 1, 2021, 6:36:25 AM11/1/21
to Slurm User Community List

cpu limit using ulimit is pretty straightforward with pam_limits and /etc/security/limits.conf. On some of the login nodes we have a cpu limit of 10 minutes, so heavy processes will fail.

The memory was a bit more complicated (i.e. not pretty). We wanted that a user won't be able to use more than e.g. 1G for all processes combined. Using systemd we added the file /etc/systemd/system/user-.slice.d/20-memory.conf which contains:
[Slice]
MemoryLimit=1024M
MemoryAccounting=true

But we also wanted to restrict swap usage and we're still on cgroupv1, so systemd didn't help there. The ugly part comes with a pam_exec to a script that updates the memsw limit of the cgroup for the above slice. The script does more things, but the swap section is more or less:

if [ "x$PAM_TYPE" = 'xopen_session' ]; then
    _id=`id -u $PAM_USER`
    if [ -z "$_id" ]; then
        exit 1
    fi
    if [[ -e /sys/fs/cgroup/memory/user.slice/user-${_id}.slice/memory.memsw.limit_in_bytes ]]; then
        swap=$((1126 * 1024 * 1024))
        echo $swap > /sys/fs/cgroup/memory/user.slice/user-${_id}.slice/memory.memsw.limit_in_bytes
    fi
fi

Tina Friedrich

unread,
Nov 3, 2021, 1:32:03 PM11/3/21
to slurm...@lists.schedmd.com
Thank you for that - I'm restricting things via limits.conf on the login
nodes at the moment, but have been considering using cgroups instead for
a while. So this is very useful :)

If we're sharing details, our setup currently is:

2x2 cluster not-quite-federations - prod and dev, each with a
'capability' and 'throughput' cluster.

The 'capability' system is homogeneous, and all nodes have low latency
interconnect; it prefers big jobs. The 'throughput' system is more
heterogeneous (it's where all the GPUs live, as well as some CPU only
nodes); it prefers small jobs.

(dev system has the same cluster config, but not the same resources,
obviously - it's for configuration testing, mostly, and for testing
upgrades)

Both 'federations' have one database server (server running mariadb and
slurmdbd) - so there's a 'prod' and a 'dev' database server.

All clusters have the same partitions & users & projects etc.

Each cluster has a dedicated hosts running slurmctld.

SLURM is installed locally - I build RPMs.

2 login nodes per prod cluster (only one overall for dev).

Apart from the actual worker nodes, all of these are VMs. I have no
secondary slurmctlds or anything as I'm more or less relying on the
VMWare cluster to handle this. (And we've done live migrations on the
VMWare end without problems.) The login nodes are doubled up as they get
rebooted for security updates regularly, so to ensure people can always
log in we double those up.

Login nodes are not (!) on the same OS release as cluster nodes - they
run the latest for security reasons. Software building etc happens on
'interactive' nodes (...a partition that oversubscribes by default).

Nodes mount application shares by hardware architecture (skylake,
broadwell, ...) - using autofs variables to pick up the correct share,
so applications have the same path on all nodes, but what's mounted is
the correct build for the local architecture. (Using Easybuild & lmod
for applications.)

Updates always need to be database first, of course (don't they always);
however,I can't quite confirm binaries not working with after upgrading
the database - we ran with a 20.11 database server & 20.02 everything
else for a fair while (the change in MPI behaviour had us downgrade
everything apart from the database), so I've only ever seen the '-M'
(and all accounting) not work during DB restarts.

They were meant to actually be federated, but it confused our users, so
I broke the federation again (but left the rest of the setup in place).

Tina
>> <mailto:tina.fr...@it.ox.ac.uk
>> /| |
>> \/ |Yair Yarom | System Group (DevOps)
>> [] |The Rachel and Selim Benin School
>> [] /\ |of Computer Science and Engineering
>> []//\\/ |The Hebrew University of Jerusalem
>> [// \\ |T +972-2-5494522 | F +972-2-5494522
>> // \ |ir...@cs.huji.ac.il <mailto:ir...@cs.huji.ac.il>
>> // |
>
>
>
> --
>
> /| |
> \/ |Yair Yarom | System Group (DevOps)
> [] |The Rachel and Selim Benin School
> [] /\ |of Computer Science and Engineering
> []//\\/ |The Hebrew University of Jerusalem
> [// \\ |T +972-2-5494522 | F +972-2-5494522
> // \ |ir...@cs.huji.ac.il <mailto:ir...@cs.huji.ac.il>
> // |
Reply all
Reply to author
Forward
0 new messages