Various options that might help reduce job fragmentation.
Turn up debugging on slurmctld and add the DebugFlags like TraceJobs, SelectType, and Steps. With debugging set high enough one can see a good bit of the logic in regard to node selection.
CR_LLN Schedule resources to jobs on the least loaded nodes
(based upon the number of idle CPUs). This is generally
only recommended for an environment with serial jobs as
idle resources will tend to be highly fragmented, result-
ing in parallel jobs being distributed across many nodes.
Note that node Weight takes precedence over how many idle
resources are on each node. Also see the partition con-
figuration parameter LLN use the least loaded nodes in
selected partitions.
Explore node weights. If your nodes are not identical apply node weights to sort your nodes in the order of how you wish them to be selected; on the other hand, even for homogenous nodes you might try sets of weights to have the scheduler within a given scheduling cycle consider a smaller number of nodes of a weight before then considering the next number of nodes of the next weight. The number of nodes within a weight set might be no smaller than 1/3 or 1/4 of the total partition size. YMMV based on for instance ratio of serial jobs to MPI jobs, job length, etc. I have seen evidence that node allocation progresses roughly this way.
Turn on backfill and educate users to better fit both their job resource requirements and the job runtime. This will allow backfill to work more efficiently. Note that backfill choices are made within a given set of job within a partition.
CR_Pack_Nodes
If a job allocation contains more resources than will be
used for launching tasks (e.g. if whole nodes are allo-
cated to a job), then rather than distributing a job's
tasks evenly across its allocated nodes, pack them as
tightly as possible on these nodes. For example, consider
a job allocation containing two entire nodes with eight
CPUs each. If the job starts ten tasks across those two
nodes without this option, it will start five tasks on
each of the two nodes. With this option, eight tasks will
be started on the first node and two tasks on the second
node. This can be superseded by "NoPack" in srun's
"--distribution" option. CR_Pack_Nodes only applies when
the "block" task distribution method is used.
pack_serial_at_end
If used with the select/cons_res or select/cons_tres plug-
in, then put serial jobs at the end of the available nodes
rather than using a best fit algorithm. This may reduce
resource fragmentation for some workloads.
reduce_completing_frag
This option is used to control how scheduling of resources
is performed when jobs are in the COMPLETING state, which
influences potential fragmentation. If this option is not
set then no jobs will be started in any partition when any
job is in the COMPLETING state for less than CompleteWait
seconds. If this option is set then no jobs will be
started in any individual partition that has a job in COM-
PLETING state for less than CompleteWait seconds. In
addition, no jobs will be started in any partition with
nodes that overlap with any nodes in the partition of the
completing job. This option is to be used in conjunction
with CompleteWait.