[slurm-users] Recommended Stable Slurm Version for >100P Scale Clusters

16 views
Skip to first unread message

KK via slurm-users

unread,
Nov 16, 2025, 9:39:35 AM (3 days ago) Nov 16
to slurm...@lists.schedmd.com

We are currently planning to deploy a new HPC system with a total compute capacity exceeding 100 PF. As part of our preparation, we would like to understand which Slurm versions are considered stable and widely used at this scale.

Could you please share your recommendations or experience regarding:

1. Which Slurm version is currently running reliably on very large-scale clusters (>100 PF or >10k nodes)?

2. Whether there are any versions we should avoid due to known issues at large scale.

3. Any best practices or configuration considerations for Slurm deployments of this size.

John Hearns via slurm-users

unread,
Nov 16, 2025, 10:35:59 AM (3 days ago) Nov 16
to KK, Slurm User Community List
I would take a step back and ask how you intend to install and manage this cluster.

CPU only or GPUs ?
OS ?
Interconnect fabric?
Storage ?

Power per rack? Cooling?
Monitoring?

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Paul Edmon via slurm-users

unread,
Nov 18, 2025, 9:53:00 AM (yesterday) Nov 18
to slurm...@lists.schedmd.com

We run at about 50 PF and 1.5k nodes with about 100,000 jobs per day and we use 25.05.4, though we tend to upgrade to latest available so we will be upgrading to 25.11.* soon (when the .1 release comes out). If you are interested I'm happy to share our slurm.conf.

At least from my experience the latest releases have been stable, though you want to avoid .0 releases unless you want to be bleeding edge or need a feature. Most of the kinks are worked out by .1, definitely by .2 of any major release. There still may be weird edge cases but in general it is stable.

-Paul Edmon-

Reply all
Reply to author
Forward
0 new messages