[slurm-dev] backfill scheduler look ahead?

Bill Wichser

unread,

Feb 19, 2014, 6:21:53 PM2/19/14

to slurm-dev

Just a question on expected behavior of the backfill scheduler. This is
an SMP machine if that matters. Scheduler is backfill with no preemption.

I have a number of jobs queued. There are three which matter, ordered
by priority. In the current state I have 60 free cores.

job 201 needs 200 cores and will start in 1 hour requiring 24 hours of
runtime
job 202 needs 250 cores and will start in 5 hours requiring 24 hours of
runtime
...
job 300 needs 30 cores and will start in 300 hours requiring 2 hours of
runtime

The job completing in 1 hour will free 252 cores.

Clearly, starting job 300 will not impact job 201's start time in any
way. Yet it will not start since the time overlaps the expected 1 hour
start time of job 201. Is this the expected behavior? I haven't yet
checked the source code to verify that this just looks at the trivial
impact on the next job but I'd expect the scheduler to be able to look a
little deeper than this.

Bill

Moe Jette

unread,

Feb 20, 2014, 1:21:52 PM2/20/14

to slurm-dev

Slurm uses what is known as a conservative backfill scheduling
algorithm. No job will be started that adversely impacts the expected
start time of _any_ higher priority job. The scheduling can also be
effected by a job's requirements for memory, generic resources,
licenses, and resource limits.

Moe Jette
SchedMD LLC

Bill Wichser

unread,

Feb 20, 2014, 6:46:36 PM2/20/14

to slurm-dev

Moe,

That's quite an obfusicated answer! I was looking for a "yes, this is
the expected behavior" or "no, something is amuck."

In the case presented, again I'll say, it is clearly evident that the
job waiting, number 300, can run. It has free cores, the job currently
waiting will have plenty of cores available when the job it is waiting
on finishes, yet it does not start simply because the time it requires
would interfere with the current start time of the currently waiting
job, #201.

But the assertion that job 201 would be held up by starting job 300 is
completely incorrect in this case.

Now if this is the way the scheduler works, by being simple minded about
time constraints, then it is what it is. I'm asking only if this
behavior is the expected behavior. I think you are trying to say that
indeed this is the case.

Sincerely,
Bill

Christopher Samuel

unread,

Feb 20, 2014, 9:41:52 PM2/20/14

to slurm-dev

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 21/02/14 10:46, Bill Wichser wrote:

> In the case presented, again I'll say, it is clearly evident that
> the job waiting, number 300, can run. It has free cores, the job
> currently waiting will have plenty of cores available when the job
> it is waiting on finishes, yet it does not start simply because the
> time it requires would interfere with the current start time of the
> currently waiting job, #201.
>
> But the assertion that job 201 would be held up by starting job 300
> is completely incorrect in this case.

So if I'm interpreting you correctly you are saying that Slurm is not
taking into account the fact that cores that will be released from
jobs finishing.

I wonder if that's because it's not a guarantee as the job may get
extended by an administrator, or on a multi-node system the node
itself may fail?

All the best,
Chris
- --
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlMGukMACgkQO2KABBYQAh9nRACfY0aaO3Mb9L2jsY/s8AfDb/qw
XecAn1gzN9NXuilyIHqaUx3po2AQukQM
=F+bb
-----END PGP SIGNATURE-----

Alejandro Lucero Palau

unread,

Feb 21, 2014, 5:58:00 AM2/21/14

to slurm-dev

Hi Bill,

I think Moe gives you the right answer but it was so concise it can be
easily misunderstood.

If we take the situation you describe with a simple analysis from
backfilling algorithm point of view, the answer is job 300 should be
scheduled without any impact on jobs 201 and 202. However, what I think
Moe tried to say is there are other details to take into account, not
just total number of free cores. Those cores could be really free but,
for example, due to per-node memory requirements they can not be used.
Or maybe you have reservations which are reserving some cores but you
can not see it just looking at free cores. Or you have some licenses or
partitions limitations. Or your system does not allow to share nodes so
free cores does not mean you can use them. All this assuming you do not
have other pending jobs between job 201 and job 300. There is a
backfilling parameter max_job_bf which limits the number of jobs to be
processed by the algorithm. The default number is 50. Also, as
backfilling is so demanding it is suspended after some time. Before
resuming, if something changed in the system, the backfilling algorithm
will start from scratch. You can avoid this using bf_continue parameter.

As you can see there are a lot of details which could have an impact. We
have suffered this situation in the past and it is not always trivial to
see the reason behind scheduling decisions. I added extra debug
information for backfilling algorithm to see how resources were being
reserved by pending jobs and it was helpful. Maybe it would be
interesting to have some way for knowing why a job can not be scheduled.
There are other resource managers giving this detailed information but
it would have a cost, of course.

WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer

Eckert, Phil

unread,

Feb 21, 2014, 10:44:04 AM2/21/14

to slurm-dev

Bill,

In addition to what Alejandro said, there is another consideration.

You indicated the top two high priority jobs and the 30 core job, I'm
assuming that the "..." indicated a number of other queued jobs ahead of
the 30 core job. Also, you didn't state it, but I'm also assuming there
were other jobs running at the time.

If both of these assumtions are true, then you would need to consider the
completion time of all the running jobs in relation to the needs of the
jobs ahead of the 30 core job in the queue. The 60 cores may be needed by
a higher priority job that is waiting for a currently running job, or
jobs, that will complete in less than two hours and provide the number of
cores it needs.

We have been using backfill batch systems, including SLURM, here at LLNL
for over 20 years and trying to answer this question for our users is
never easy. A conclusive way of determining when a job will either start
or be backfilled is to do an squeue and an sinfo then map an X Y
coordinates with time and nodes to represent the blocks that jobs will
use. This is a bit painful, but will provide a lot of insight to backfill.

I hope this is helpful.

Phil Eckert
LLNL

On 2/21/14 2:57 AM, "Alejandro Lucero Palau" <alejandr...@bsc.es>
wrote:

Bill Wichser

unread,

Feb 21, 2014, 12:18:51 PM2/21/14

to slurm-dev

Thanks for the insights.

Yuri D'Elia

unread,

Feb 25, 2014, 10:23:53 AM2/25/14

to slurm-dev

On 02/20/2014 07:21 PM, Moe Jette wrote:
>
> Slurm uses what is known as a conservative backfill scheduling
> algorithm. No job will be started that adversely impacts the expected
> start time of _any_ higher priority job. The scheduling can also be
> effected by a job's requirements for memory, generic resources,
> licenses, and resource limits.

I'm curious whether this could be changed with a setting to disregard
the expected start time of higher priority jobs.

Given that giving/estimating completion times of jobs is akin to sorcery
in many cases, it would be beneficial in my case to always
under-estimate the time limit.

I'm wondering if anybody is running with a overly-conservative TimeLimit
for jobs, and abusing OverTimeLimit [very high value] to achieve this.

I know I would definitely use a EstimatedTimeLimit parameter for
improved backfilling and give an absolute ceiling with TimeLimit (if I
could).

Moe Jette

unread,

Feb 25, 2014, 10:34:53 AM2/25/14

to slurm-dev

I haven't had time to work on this, but one idea would be estimate a
job's run time based upon historic data and use that as a basis for
backfill scheduling. I suspect the results would be better
responsiveness and higher utilization than when basing scheduling
decisions upon the user's time limit.

Moe Jette
SchedMD

Ralph Castain

unread,

Feb 25, 2014, 10:42:51 AM2/25/14

to slurm-dev

FWIW: that has worked very poorly in the past. The problem is that the workload depends heavily upon the data set, and so past performance is a very poor indicator of future behavior except in rare circumstances (e.g., a nightly weather forecast where the data is consistent night after night).

> Moe Jette
> SchedMD

Yuri D'Elia

unread,

Feb 25, 2014, 11:47:10 AM2/25/14

to slurm-dev

On 02/25/2014 04:42 PM, Ralph Castain wrote:
>>> I'm curious whether this could be changed with a setting to
>>> disregard the expected start time of higher priority jobs.
>>>
>>> Given that giving/estimating completion times of jobs is akin to
>>> sorcery in many cases, it would be beneficial in my case to
>>> always under-estimate the time limit.
>>>
>>> I'm wondering if anybody is running with a overly-conservative
>>> TimeLimit for jobs, and abusing OverTimeLimit [very high value]
>>> to achieve this.
>>>
>>> I know I would definitely use a EstimatedTimeLimit parameter for
>>> improved backfilling and give an absolute ceiling with TimeLimit
>>> (if I could).
>>
>> I haven't had time to work on this, but one idea would be estimate
>> a job's run time based upon historic data and use that as a basis
>> for backfill scheduling. I suspect the results would be better
>> responsiveness and higher utilization than when basing scheduling
>> decisions upon the user's time limit.

I can give pretty accurate estimates most of the times.

What I cannot do however is set a conservative timelimit, because that
would kill the job prematurely. As such, I need to give a timelimit with
is within a 5-10x ballpark of the actual figure.

If you think that some --estimated-time would be used by people that are
able to do this, than you could use estimated-time for backfilling, and
use timelimit as a hard limit. You could still default to an
estimatedtime=timelimit when not specified, and get the current behavior.

> FWIW: that has worked very poorly in the past. The problem is that
> the workload depends heavily upon the data set, and so past
> performance is a very poor indicator of future behavior except in
> rare circumstances (e.g., a nightly weather forecast where the data
> is consistent night after night).

I can confirm that. Run time is dataset/parameter dependent.

Reply all

Reply to author

Forward