[slurm-users] A Slurm topological scheduling question

David Baker

unread,

Dec 7, 2021, 11:06:01 AM12/7/21

to Slurm User Community List

Hello,

These days we have now enabled topology aware scheduling on our Slurm cluster. One part of the cluster consists of two racks of AMD compute nodes. These racks are, now, treated as separate entities by Slurm. Soon, we may add another set of AMD nodes with slightly difference CPU specs to the existing nodes. We'll aim to balance the new nodes across the racks re cooling/heating requirements. The new nodes will be controlled by a new partition.

Does anyone know if it is possible to regard the two racks as a single entity (by connecting the InfiniBand switches together), and so schedule jobs across the resources in the racks with no loss efficiency. I would be grateful for your comments and ideas, please. The alternative is to put all the new nodes in a completely new rack, but that does mean that we'll have purchase some new Ethernet and IB switches. We are not happy, by the way, to have node/switch connections across racks.

Best regards,

David

Paul Edmon

unread,

Dec 7, 2021, 11:29:45 AM12/7/21

to slurm...@lists.schedmd.com

This should be fine assuming you don't mind the mismatch in CPU speeds. Unless the codes are super sensitive to topology things should be okay as modern IB is wicked fast.

In our environment here we have a variety of different hardware types all networked together on the same IB fabric. That said we create partitions for different hardware types and we don't have a queue that schedules across both, though we do have a backfill serial queue that underlies everything. All of that though is scheduled via a single scheduler with a single topology.conf governing it all. We also have all our internode IP comms going over our IB fabric and it works fine.

-Paul Edmon-

Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)

unread,

Dec 7, 2021, 11:36:10 AM12/7/21

to Slurm User Community List

You can schedule jobs across the two racks, with any given job only using one rack, by specifying

#SBATCH --partition rack1,rack2

It'll only use 1 partition, in order of priority (not liti

I never found a way for topology to do that - all I could get it to do is to prefer to keep things within a single switch, but not to require it.

Noam

Ole Holm Nielsen

unread,

Dec 7, 2021, 4:05:19 PM12/7/21

to slurm...@lists.schedmd.com

Hi David,

The topology.conf file groups nodes into sets such that parallel jobs
will not be scheduled by Slurm across disjoint sets. Even though the
topology.conf man-page refers to network switches, it's really about
topology rather than network.

You may use fake (non-existing) switch names to describe the topology.
For example, we have a small IB sub-cluster with two IB switches defined by:

SwitchName=mell023 Switches=mell0[2-3]
SwitchName=mell02 Nodes=i[004-028]
SwitchName=mell03 Nodes=i[029-050]

If you comment out the first line mell023, you create two disjoint node
groups ("islands") i[004-028] and i[029-050] where jobs won't be
scheduled across node groups.

Physical switches and racks are irrelevant here. In your example, you
could add the new AMD nodes with a fake switch name in order to create a
new "island" of nodes. The IB fabric subnet manager of course keeps
track of the real fabric topology, independently of Slurm.

BTW, let me remind you of my Infiniband topology tool slurmibtopology.sh
for Slurm:
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmibtopology
which generates an initial topology.conf file for IB networks according
to the physical links in the fabric.

I hope this helps.

/Ole

Reply all

Reply to author

Forward