Thank you for that - I'm restricting things via limits.conf on the login
nodes at the moment, but have been considering using cgroups instead for
a while. So this is very useful :)
If we're sharing details, our setup currently is:
2x2 cluster not-quite-federations - prod and dev, each with a
'capability' and 'throughput' cluster.
The 'capability' system is homogeneous, and all nodes have low latency
interconnect; it prefers big jobs. The 'throughput' system is more
heterogeneous (it's where all the GPUs live, as well as some CPU only
nodes); it prefers small jobs.
(dev system has the same cluster config, but not the same resources,
obviously - it's for configuration testing, mostly, and for testing
upgrades)
Both 'federations' have one database server (server running mariadb and
slurmdbd) - so there's a 'prod' and a 'dev' database server.
All clusters have the same partitions & users & projects etc.
Each cluster has a dedicated hosts running slurmctld.
SLURM is installed locally - I build RPMs.
2 login nodes per prod cluster (only one overall for dev).
Apart from the actual worker nodes, all of these are VMs. I have no
secondary slurmctlds or anything as I'm more or less relying on the
VMWare cluster to handle this. (And we've done live migrations on the
VMWare end without problems.) The login nodes are doubled up as they get
rebooted for security updates regularly, so to ensure people can always
log in we double those up.
Login nodes are not (!) on the same OS release as cluster nodes - they
run the latest for security reasons. Software building etc happens on
'interactive' nodes (...a partition that oversubscribes by default).
Nodes mount application shares by hardware architecture (skylake,
broadwell, ...) - using autofs variables to pick up the correct share,
so applications have the same path on all nodes, but what's mounted is
the correct build for the local architecture. (Using Easybuild & lmod
for applications.)
Updates always need to be database first, of course (don't they always);
however,I can't quite confirm binaries not working with after upgrading
the database - we ran with a 20.11 database server & 20.02 everything
else for a fair while (the change in MPI behaviour had us downgrade
everything apart from the database), so I've only ever seen the '-M'
(and all accounting) not work during DB restarts.
They were meant to actually be federated, but it confused our users, so
I broke the federation again (but left the rest of the setup in place).
Tina
>> <mailto:
tina.fr...@it.ox.ac.uk
>> /| |
>> \/ |Yair Yarom | System Group (DevOps)
>> [] |The Rachel and Selim Benin School
>> [] /\ |of Computer Science and Engineering
>> []//\\/ |The Hebrew University of Jerusalem
>> [// \\ |T
+972-2-5494522 | F
+972-2-5494522
>> // \ |
ir...@cs.huji.ac.il <mailto:
ir...@cs.huji.ac.il>
>> // |
>
>
>
> --
>
> /| |
> \/ |Yair Yarom | System Group (DevOps)
> [] |The Rachel and Selim Benin School
> [] /\ |of Computer Science and Engineering
> []//\\/ |The Hebrew University of Jerusalem
> [// \\ |T
+972-2-5494522 | F
+972-2-5494522
> // \ |
ir...@cs.huji.ac.il <mailto:
ir...@cs.huji.ac.il>
> // |