[slurm-users] Interfaces of topology/tree and Topology Awareness

17 views
Skip to first unread message

nico.derl--- via slurm-users

unread,
Mar 20, 2024, 2:43:26 PMMar 20
to slurm...@lists.schedmd.com
Hello everyone,

I'm trying to improve topology awareness in a local Slurm-managed HPC system. It's using the default hierarchical 3-level topology with the tree-plugin. It however does not always confine jobs to the most tightly packed group of nodes, seems to over-provision switches for smaller jobs, and gets slow or overwhelmed with jobs that have a high node count.
I'd like to implement something more literally aligned with best-fit, but I'm having trouble understanding the relevant interfaces to hook into the topology model of Slurm. I would like a high-level explanation of how the tree- and common topology components work, how they integrate into the higher scheduling logic and what the internal topology model looks like. Or some pointers to relevant docs discussing this.

I have read the topology guide and its dev-doc, which does note some of the caveats I mentioned. It however only talks about providing a set of weights to the upper logic levels in the form of a node ranking. I can't see how this ranking resembles the topology and how it's being used. From looking at the signatures and C-code I can tell this much:

topology-tree consumes the topology.conf and generates a ranking of some kind that is passed to topology-common.

topology-common consumes a ranking and uses its own gres-sched to figure out what nodes can fit a job (possibly pulling info from the gres-select-plugin to determine node capabilities).

It's then supposed to apply a best-fit algorithm to efficiently fill up vacant cluster-capacity, but I can't manage to follow this part in the code as everything crumbles into separate files that I can't link correctly in my head.

Thanks in advance.

referenced docs:
<https://slurm.schedmd.com/topology.html> <https://hpc.rz.rptu.de/documentation/topology_plugin.html>
<https://github.com/SchedMD/slurm/tree/master/src/plugins/topology/common>

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Nico Derl via slurm-users

unread,
Apr 15, 2024, 6:07:47 AMApr 15
to slurm...@lists.schedmd.com
I know this isn't a developer forum, but I don't really know where else to ask. I've had no luck with Stackoverflow. Is there no input on this?
Reply all
Reply to author
Forward
0 new messages