[slurm-users] Backfill Scheduling

Reed Dier

unread,

Jun 26, 2023, 6:49:01 PM6/26/23

to Slurm User Community List

Hoping this will be an easy one for the community.

The priority schema was recently reworked for our cluster, with only PriorityWeightQOS and PriorityWeightAge contributing to the priority value, while PriorityWeightAssoc, PriorityWeightFairshare, PriorityWeightJobSize, and PriorityWeightPartition are now set to 0, and PriorityFavorSmall set to NO.
The cluster is fairly loaded right now, with a big backlog of work (~250 running jobs, ~40K pending jobs).
The majority of these jobs are arrays, which runs the pending job count up quickly.

What I’m trying to figure out is:
The next highest priority job array in the queue is waiting on resources, everything else on priority, which makes sense.
However, there is a good portion of the cluster unused, seemingly dammed by the next up job being large, while there are much smaller jobs behind it that could easily fit into the available resources footprint.

Is this an issue with the relative FIFO nature of the priority scheduling currently with all of the other factors disabled,
or since my queue is fairly deep, is this due to bf_max_job_test being the default 100, and it can’t look deep enough into the queue to find a job that will fit into what is unoccupied?
PriorityType=priority/multifactor
SchedulerType=sched/backfill

Hoping to know where I might want to swing my hammer next, without whacking the wrong setting

Appreciate any advice,
Reed

Brian Andrus

unread,

Jun 26, 2023, 10:11:06 PM6/26/23

to slurm...@lists.schedmd.com

Reed,

You may want to look at the timelimit aspect of the job(s).

For one to 'squeeze in', it needs to be able to finish before the
resources in use are expected to become available.

Consider:
Job A is running on 2 nodes of a 3 node cluster. It will finish in 1 hour.
Pending job B will run for 2 hours needs 2 nodes, but only 1 is free, it
waits.
Pending job C (with a lower priority) needs 1 node for 2 hours. Hmm,
well it won't finish before the time job B is expected to start, so it
waits.
Pending job D (with even lower priority) needs 1 node for 30 minutes.
That can squeeze in before the additional node for Job B is expected to
be available, so it runs on the idle node.

Brian Andrus

Loris Bennett

unread,

Jun 27, 2023, 2:11:12 AM6/27/23

to Slurm User Community List

Hi Reed,

Reed Dier <reed...@focusvq.com> writes:

> Hoping this will be an easy one for the community.
>
> The priority schema was recently reworked for our cluster, with only
> PriorityWeightQOS and PriorityWeightAge contributing to the priority
> value, while PriorityWeightAssoc, PriorityWeightFairshare,
> PriorityWeightJobSize, and PriorityWeightPartition are now set to 0,
> and PriorityFavorSmall set to NO.
> The cluster is fairly loaded right now, with a big backlog of work (~250 running jobs, ~40K pending jobs).
> The majority of these jobs are arrays, which runs the pending job count up quickly.
>
> What I’m trying to figure out is:
> The next highest priority job array in the queue is waiting on resources, everything else on priority, which makes sense.
> However, there is a good portion of the cluster unused, seemingly
> dammed by the next up job being large, while there are much smaller
> jobs behind it that could easily fit into the available resources
> footprint.
>
> Is this an issue with the relative FIFO nature of the priority scheduling currently with all of the other factors disabled,
> or since my queue is fairly deep, is this due to bf_max_job_test being
> the default 100, and it can’t look deep enough into the queue to find
> a job that will fit into what is unoccupied?

It could be that bf_max_job_test is too low. On our system some users
think it is a good idea to submit lots of jobs with identical resource
requirements by writing a loop around sbatch. Such jobs will exhaust
the bf_max_job_test very quickly. Thus we increased the limit to 1000
and try to persuade users to use job arrays instead of home-grown loops.
This seem to work OK[1].

Cheers,

Loris

> PriorityType=priority/multifactor
> SchedulerType=sched/backfill
>
> Hoping to know where I might want to swing my hammer next, without whacking the wrong setting
>
> Appreciate any advice,
> Reed
>

Footnotes:

[1] One problem we still have to address is that we don't have an
array-enabled version of the 'subgXX' script for the quantum
chemistry program Gaussian. This is a Perl script which parses the
input for the program, generates a job script and submits it. An
array-enabled version would have to stipulate a specific mapping
between the array task ID and the way the input files are
organised. We are currently not sure about the best way to do this
in a suitably generic way.

--
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin

Reed Dier

unread,

Jun 27, 2023, 10:39:52 AM6/27/23

to Slurm User Community List

On Jun 27, 2023, at 1:10 AM, Loris Bennett <loris....@fu-berlin.de> wrote:

Hi Reed,

Reed Dier <reed...@focusvq.com> writes:

Is this an issue with the relative FIFO nature of the priority scheduling currently with all of the other factors disabled,
or since my queue is fairly deep, is this due to bf_max_job_test being
the default 100, and it can’t look deep enough into the queue to find
a job that will fit into what is unoccupied?

It could be that bf_max_job_test is too low. On our system some users
think it is a good idea to submit lots of jobs with identical resource
requirements by writing a loop around sbatch. Such jobs will exhaust
the bf_max_job_test very quickly. Thus we increased the limit to 1000
and try to persuade users to use job arrays instead of home-grown loops.
This seem to work OK[1].

Cheers,

Loris

--
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin

Thanks Loris,

I think this will be the next knob to turn and gives a bit more confidence to that, as we too have many such identical jobs.

On Jun 26, 2023, at 9:10 PM, Brian Andrus <toom...@gmail.com> wrote:

Reed,

You may want to look at the timelimit aspect of the job(s).

For one to 'squeeze in', it needs to be able to finish before the resources in use are expected to become available.

Consider:
Job A is running on 2 nodes of a 3 node cluster. It will finish in 1 hour.
Pending job B will run for 2 hours needs 2 nodes, but only 1 is free, it waits.
Pending job C (with a lower priority) needs 1 node for 2 hours. Hmm, well it won't finish before the time job B is expected to start, so it waits.
Pending job D (with even lower priority) needs 1 node for 30 minutes. That can squeeze in before the additional node for Job B is expected to be available, so it runs on the idle node.

Brian Andrus

Thanks Brian,

Our layout is a bit less exciting, in that none of these are >1 node per job.

So the blocking out nodes for job:node Tetris isn’t really at play here.

The timing however is something I may turn an eye towards.

Most jobs have a “sanity” time limit applied, in that it is not so much an expected time limit, but rather an “if it goes this long, something obviously went awry and we shouldn’t keep holding on to resources” limit.

So its a bit hard to quantify the timing portion, but I haven’t looked into the slurm guesses of when it thinks the next task will start, etc.

The pretty simplistic example at play here is that there are nodes that are ~50-60% loaded for CPU and memory.

The next job up is a “whale” job that wants a ton of resources, cpu and/or memory, but down the line there is a job with 2 cpu’s and 2 gb of memory that can easily slot in to the unused resources.

So my thinking was that the job_test list may be too short to actually get that far down the queue to see that it could shove that job into some holes.

I’ll report back any findings after testing Loris’s suggestions.

Appreciate everyone’s help and suggestions,

Reed

Loris Bennett

unread,

Jun 28, 2023, 1:44:24 AM6/28/23

to Slurm User Community List

You might also want to look at increasing bf_window to the maximum time
limit, as suggested in 'man slurm.conf'. If backfill is not looking far
enough into the future to know whether starting a job early will
negatively impact a 'whale', then that 'whale' could potentially wait
indefinitely. This is what happened on our system when we had a maximum
runtime of 14 days but the 1 day default for bf_window. With both set
to 14 days the problem was solved.

Cheers,

Loris

> I’ll report back any findings after testing Loris’s suggestions.
>
> Appreciate everyone’s help and suggestions,
> Reed
>

Reply all

Reply to author

Forward