[slurm-users] Topology configuration questions:

41 views
Skip to first unread message

Prentice Bisbal

unread,
Jan 17, 2019, 4:49:54 PM1/17/19
to Slurm User Community List
From https://slurm.schedmd.com/topology.html:

> Note that compute nodes on switches that lack a common parent switch
> can be used, but no job will span leaf switches without a common
> parent (unless the TopologyParam=TopoOptional option is used). For
> example, it is legal to remove the line "SwitchName=s4
> Switches=s[0-3]" from the above topology.conf file. In that case, no
> job will span more than four compute nodes on any single leaf switch.
> This configuration can be useful if one wants to schedule multiple
> phyisical clusters as a single logical cluster under the control of a
> single slurmctld daemon.

My current environment falls into the category of multiple physical
clusters being treated as a single logical cluster under the control of
a single slurmctld daemon. At least, that's my goal.

In my environment, I have 2 "clusters" connected by their own separate
IB fabrics, and one "cluster" connected with 10 GbE. I have a fourth
cluster connected with only 1GbE. For this 4th cluster, we don't want
jobs to span nodes, due to the slow performance of 1 GbE. (This cluster
is intended for serial and low-core count parallel jobs) If I just leave
those nodes out of the topology.conf file, will that have the desired
affect of not allocating multi-node jobs to those nodes, or will it
result in an error of some sort?

--
Prentice


Prentice Bisbal

unread,
Jan 17, 2019, 4:53:17 PM1/17/19
to Slurm User Community List
And a follow-up question: Does topology.conf need to be on all the
nodes, or just the slurm controller? It's not clear from that web page.
I would assume only the controller needs it.

Prentice

Ryan Novosielski

unread,
Jan 17, 2019, 6:34:52 PM1/17/19
to Slurm User Community List
It will print a warning:

[2019-01-10T12:41:32.457] TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology

…which sort of makes it sound like it’s going to ignore the topology plugin, but I believe it works (and the documentation sure indicates it does).


--
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - novo...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

signature.asc

Ryan Novosielski

unread,
Jan 17, 2019, 6:37:14 PM1/17/19
to Slurm User Community List
I don’t actually know the answer to this one, but we have it provisioned to all nodes.

Note that if you care about node weights (eg. NodeName=whatever001 Weight=2, etc. in slurm.conf), using the topology function will disable it. I believe I was promised a warning about that in the future in a conversation with SchedMD.
signature.asc

Fulcomer, Samuel

unread,
Jan 17, 2019, 7:57:06 PM1/17/19
to Slurm User Community List
We use topology.conf to segregate architectures (Sandy->Skylake), and also to isolate individual nodes with 1Gb/s Ethernet rather than IB (older GPU nodes with deprecated IB cards). In the latter case, topology.conf had a switch entry for each node. 

It used to be the case that SLURM was unhappy with nodes defined in slurm.conf not appearing in topology.conf. This may have changed....

Nicholas McCollum

unread,
Jan 17, 2019, 8:06:01 PM1/17/19
to Slurm User Community List
I recommend putting heterogeneous node types each into their own patition to keep jobs from spanning multiple node types.  You can also set QoS's for different partitions and make a job in that QoS only able to be scheduled on nodes=1.  You could also accomplished this with a partition config in your slurm.conf... or use the job_submit.lua plugin to capture jobs submitted to that partition and change max nodes =1.
There's a lot of easy ways to skin that cat.

I personally like using the submit all jobs to all partitions plugin and having users constrain to specific types of nodes using the --constraint=whatever flag.


Nicholas McCollum
Alabama Supercomputer Authority

From: "Fulcomer, Samuel" <samuel_...@brown.edu>
Sent: Thursday, January 17, 2019 5:58 PM
To: Slurm User Community List
Subject: Re: [slurm-users] Topology configuration questions:

Fulcomer, Samuel

unread,
Jan 17, 2019, 8:13:16 PM1/17/19
to Slurm User Community List
Yes, well, the trivial cat-skinning method is to use topology.conf to describe multiple switch topologies confining each architecture to their meta-fabric. We use GPFS as a parallel filesystem, and all nodes are connected, but topology.conf keeps jobs on uniform-architecture collectives.

Prentice Bisbal

unread,
Jan 18, 2019, 9:28:29 AM1/18/19
to slurm...@lists.schedmd.com


On 01/17/2019 06:36 PM, Ryan Novosielski wrote:
> I don’t actually know the answer to this one, but we have it provisioned to all nodes.
>
> Note that if you care about node weights (eg. NodeName=whatever001 Weight=2, etc. in slurm.conf), using the topology function will disable it. I believe I was promised a warning about that in the future in a conversation with SchedMD.

Well, that's going to be a big problem for me. One of the goals of me
overhauling our Slurm config is to take advantage of the node weighting
function to prioritize certain hardware over others in our very
heterogeneous cluster.

I may have to provide a larger description of my hardware/situation to
the list and ask for suggestions on how to best handle the problem.

Prentice

Prentice Bisbal

unread,
Jan 18, 2019, 9:29:25 AM1/18/19
to slurm...@lists.schedmd.com

On 01/17/2019 07:55 PM, Fulcomer, Samuel wrote:
We use topology.conf to segregate architectures (Sandy->Skylake), and also to isolate individual nodes with 1Gb/s Ethernet rather than IB (older GPU nodes with deprecated IB cards). In the latter case, topology.conf had a switch entry for each node.
So Slurm thinks each node has its own switch that is not shared with any other node?

Kilian Cavalotti

unread,
Jan 18, 2019, 11:55:24 AM1/18/19
to Slurm User Community List
On Fri, Jan 18, 2019 at 6:31 AM Prentice Bisbal <pbi...@pppl.gov> wrote:
> > Note that if you care about node weights (eg. NodeName=whatever001 Weight=2, etc. in slurm.conf), using the topology function will disable it. I believe I was promised a warning about that in the future in a conversation with SchedMD.
>
> Well, that's going to be a big problem for me. One of the goals of me
> overhauling our Slurm config is to take advantage of the node weighting
> function to prioritize certain hardware over others in our very
> heterogeneous cluster.

I've heard that too (that enabling the Topology plugin would disable
node weighting), but I don't think it's accurate, both from the
documentation and from observation.

The doc actually says (https://slurm.schedmd.com/topology.html)

"""
NOTE:Slurm first identifies the network switches which provide the
best fit for pending jobs and then selectes the nodes with the lowest
"weight" within those switches. If optimizing resource selection by
node weight is more important than optimizing network topology then do
NOT use the topology/tree plugin.
"""

So the Topology plugin does take precedence over the weighting
algorithm, but it doesn't disable it, AFAIK. And for sites using
disjoint networks, as we do, this is a sane behavior.

Cheers,
--
Kilian

Ryan Novosielski

unread,
Jan 18, 2019, 1:15:06 PM1/18/19
to Slurm User Community List
I’m not sure if that’s a change, or whether that was always the behavior, but as a practical matter, it still really defeats the node weight. We have a fully defined topology for two different clusters, and it happens that the switch with the smallest number of connected nodes has the most specialized equipment (usually the login nodes, a couple of high memory nodes, and a few CUDA nodes). If someone runs a single node job, the job will favor that switch. I can think of a few ways to work around that, I guess, but by default, the behavior seems to be roughly the inverse of the node weights.
signature.asc

Ryan Novosielski

unread,
Jan 18, 2019, 3:00:13 PM1/18/19
to Slurm User Community List
The documentation indicates you need it everywhere:

https://slurm.schedmd.com/topology.conf.html

"Changes to the configuration file take effect upon restart of Slurm daemons, daemon receipt of the SIGHUP signal, or execution of the command "scontrol reconfigure" unless otherwise noted."

I have vague memories of not being able to schedule any jobs if it’s missing, but it’s been awhile now.

> On Jan 17, 2019, at 4:52 PM, Prentice Bisbal <pbi...@pppl.gov> wrote:
>

Prentice Bisbal

unread,
Jan 22, 2019, 1:06:57 PM1/22/19
to slurm...@lists.schedmd.com
Ryan,

Thanks for looking into this. I hadn't had a chance to revisit the
documentation since posing my question. Thanks for doing that for me.

Prentice Bisbal
Lead Software Engineer
Princeton Plasma Physics Laboratory
http://www.pppl.gov

Prentice Bisbal

unread,
Jan 22, 2019, 1:15:51 PM1/22/19
to slurm...@lists.schedmd.com
Killian,

Thanks for the input. Unfortunately, all of this information from you,
Ryan and others, is really ruining my plans, since it makes it look like
my plan to fix a problem wit my cluster will not be as easy to fix as
I'd hoped. One of the issues with my "Frankencluster" is that I'd like
to assign jobs to different nodes based on the network they're on (1
GbE, 10 GbE, IB), along with other criteria, such as features requested.

I think it might be best if I write a longer e-mail to this list
describing my cluster architecture, the problems I'm trying to address,
and different possible approaches, and then get this list's feedback.

Prentice

Ryan Novosielski

unread,
Jan 22, 2019, 3:47:12 PM1/22/19
to Slurm User Community List
Prentice (and others) — if the NodeWeight/topology plugin interaction bothers you, feel free to tack onto bug 6384.

https://bugs.schedmd.com/show_bug.cgi?id=6384
Reply all
Reply to author
Forward
0 new messages