[slurm-users] Scheduler does not reserve resources

3,442 views
Skip to first unread message

Jérémy Lapierre

unread,
Jan 12, 2022, 11:00:04 AM1/12/22
to slurm...@lists.schedmd.com
Hi To all slurm users,

We have the following issue: jobs with highest priority are pending
forever with "Resources" reason. More specifically, the jobs pending
forever ask for 2 full nodes but all other jobs from other users
(running or pending) need only a 1/4 of a node, then pending jobs asking
for 1/4 of a node always get allocated and the jobs asking for 2 nodes
are pending forever even though the priority is higher than the ones
asking for less resources. I hope I'm clear enough, if not please look
at page 17 on https://slurm.schedmd.com/SUG14/sched_tutorial.pdf, in our
situation an infinite number of jobs will fit before what is job4 in the
scheme p. 17 and thus job4 will never be launched. Here are our
Scheduler options:

FastSchedule = 0
SchedulerParameters =
bf_max_job_test=500,max_job_bf=100,bf_interval=60,bf_continue
SchedulerTimeSlice = 60 sec
SchedulerType = sched/backfill
SlurmSchedLogFile = (null)
SlurmSchedLogLevel = 0

Thanks for your help,
Best regards,
Jeremy

Rodrigo Santibáñez

unread,
Jan 12, 2022, 11:27:57 AM1/12/22
to Slurm User Community List
Hi Jeremy,

I had a similar behavior a long time ago, and I decided to set SchedulerType=sched/builtin to empty X nodes of jobs and execute that high-priority job requesting more than one node. It is not ideal, but the cluster has low load, so a user that requests more than one node doesn't delay too much the execution of other's jobs.

Hope others can help you with a better idea than mine.

Best!

Rémi Palancher

unread,
Jan 17, 2022, 3:05:01 AM1/17/22
to Slurm User Community List
Hi Jérémy,

Le mercredi 12 janvier 2022 à 16:59, Jérémy Lapierre <jeremy....@uni-saarland.de> a écrit :

> Hi To all slurm users,
>
> We have the following issue: jobs with highest priority are pending
> forever with "Resources" reason. More specifically, the jobs pending
> forever ask for 2 full nodes but all other jobs from other users
> (running or pending) need only a 1/4 of a node, then pending jobs asking
> for 1/4 of a node always get allocated and the jobs asking for 2 nodes
> are pending forever even though the priority is higher than the ones
> asking for less resources. I hope I'm clear enough, if not please look
> at page 17 on https://slurm.schedmd.com/SUG14/sched_tutorial.pdf, in our
> situation an infinite number of jobs will fit before what is job4 in the
> scheme p. 17 and thus job4 will never be launched.

Backfilling doesn't delay the scheduled start time of higher priority jobs,
but at least they must have a scheduled start time.

Did you check the start time of your job pending with Resources reason? eg.
with `scontrol show job <id> | grep StartTime`.

Sometimes Slurm is unable to define the start time of a pending job. One
typical reason is the absence of timelimit on the running jobs.

In t his case Slurm is unable to define when the running jobs are over,
when the next highest priority job can start and eventually unable to define
if lower priority jobs actually delay higher priority jobs.

--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io

Jérémy Lapierre

unread,
Jan 17, 2022, 12:00:47 PM1/17/22
to Slurm User Community List

Hi Rodrigo and Rémi,

>I had a similar behavior a long time ago, and I decided to set SchedulerType=sched/builtin to empty X
>nodes of jobs and execute that high-priority job requesting more than one node. It is not ideal, but the
>cluster has low load, so a user that requests more than one node doesn't delay too much the execution
>of other's jobs.

I don't think this would be ideal in our case as we have heavy loads. Also I'm not sure if you mean that we should switch to SchedulerType=sched/builtin permanently or just the time needed for the jobs causing problem to be allocated ? Also we have some other experiences on another cluster and slurm should normally reserve resources we think.

>Backfilling doesn't delay the scheduled start time of higher priority jobs,
>but at least they must have a scheduled start time.
>
>Did you check the start time of your job pending with Resources reason? eg.
>with `scontrol show job <id> | grep StartTime`.

Yes, the scheduled start time have been checked as well, and this time is updated through time such that jobs asking for 1/4 of a node can run on a freshly-free-1/4th-node. This is why I'm saying that the jobs asking for several nodes (tested with 2 nodes here) are pending forever. It is like slurm never wants to have unused resources (which also makes sense, but how can we satisfy "heavy" resources request then ?). On another cluster using slurm, I know that slurm reserves nodes and the node state of those reserved nodes becomes "PLANNED" (or plnd), this way jobs requesting for more resources than available at the time of submission can later be satisfied. This never happens on the cluster which is causing issues.

>Sometimes Slurm is unable to define the start time of a pending job. One
>typical reason is the absence of timelimit on the running jobs.
>In t his case Slurm is unable to define when the running jobs are over,
>when the next highest priority job can start and eventually unable to define
>if lower priority jobs actually delay higher priority jobs. 

Yes we always set up the time limit of our jobs to the max time limit allowed by the partition.

Thanks for your help,

Jeremy

Rodrigo Santibáñez

unread,
Jan 18, 2022, 7:47:19 PM1/18/22
to Slurm User Community List
Hi Jeremy,

If all jobs have the same time limit, backfill is impossible. The documentation says: "Effectiveness of backfill scheduling is dependent upon users specifying job time limits, otherwise all jobs will have the same time limit and backfilling is impossible". I don't know to overcome that...

However, without changing SchedulerType, you could hold pending jobs except for the job you want to execute, then release all jobs when the desired job is allocated. Also, you could define a node or list of nodes available for all jobs excluding nodes for the job of interest, then remove the configuration when the latter is allocated. I preferred to do the second because the "heavy" job and the "light" jobs will be allocated, and I have not to be aware of the queue outside office hours (Again, easier to do in a low utilized cluster).

About "PLANNED", I wasn't aware, and it is a feature of SLURM 21.08. Could be that why you don't see it in your cluster?

Best,

Jérémy Lapierre

unread,
Jan 19, 2022, 4:05:58 AM1/19/22
to Slurm User Community List

Hi Rodrigo,

We indeed have overlooked this. The problem is that in general our jobs need more than 2 days of resources, that's why we select a wall time in the batch scripts equal to the max wall time allowed by the partition. One thing we could try is to set the wall time at ~46h for the "light" jobs in the batch scripts and let 48h for the "heavy" jobs, this way not all jobs will have the same time limit.

Configuring node list for "light" and "heavy" jobs could do the trick. 2 things that could probably be a problem then are (i) even "heavy" jobs having very low Priority would have access to resources  at the expense of "light" jobs with higher priority and (ii) regular intervention would be needed. But maybe there is no other solution.

I thank you a lot for your inputs !

Best,

Jeremy

Reply all
Reply to author
Forward
0 new messages