[slurm-users] estimate queue time using 'sbatch --test-only'

Feng Li

unread,

Sep 15, 2021, 4:13:51 PM9/15/21

to slurm...@lists.schedmd.com

Hi and thanks for reading this!

I am trying to estimate the queue time of a job of a certain size and walltime limit. I am doing this because our project considers multiple HPC resources and needs estimated queue time information to decide where to actually submit the job.

From the man page of ‘sbatch’, I found that the “test-only” option can be used to “validate the batch script and return an estimate of when a job would be scheduled to run given the current job queue and all the other arguments specifying the job requirements”. This looks very promising to us.

I tried several launches in IU BigRed3 and TACC Stampede2 HPCs, the recorded results are shown below. (the last two columns are the estimated queue time and actual queue time). From the results, it looks like the estimated time is quite inaccurate (can be either over-estimated or under-estimated):

-----start of output

site

slurm version

partition

JobID

node

np

walltime_mins

timestamp_estimate

estimated_start

submit_time

actual_start

estimated_wait

actual_wait

stampede2

18.08.5-2

skx-normal

8436162

1

48

10

9/9/2021 16:05

9/11/2021 23:29

9/9/2021 16:08

9/9/2021 16:11

55:23:56

0:02:49

Stampede2

18.08.5-2

skx-normal

8436369

1

48

10

9/9/2021 16:51

9/12/2021 0:04

9/9/2021 16:51

9/9/2021 16:52

55:13:00

0:00:58

Stampede2

18.08.5-2

normal

8436193

1

48

10

9/9/2021 16:17

9/9/2021 18:02

9/9/2021 16:19

1:45:26

0:00:02

Stampede2

18.08.5-2

normal

8436308

2

48

10

9/9/2021 16:40

9/9/2021 18:25

9/9/2021 16:41

1:45:00

0:00:04

Bigred3

20.11.7

general

1727144

1

24

10

9/9/2021 17:57

9/10/2021 12:39

9/9/2021 17:59

18:42:00

0:00:00

Bigred3

20.11.7

general

1734075

1

24

60

9/15/2021 14:54

9/15/2021 15:01

0:00:00

0:07:11

Bigred3

20.11.7

general

1734079

1

24

20

9/15/2021 15:09

0:00:00

0:00:01

Bigred3

20.11.7

general

1734081

4

24

60

9/15/2021 15:11

9/15/2021 15:34

0:00:00

0:22:15

-----end of output

Could you suggest better ways to estimating the queue time? Or are there any specific configurations/situations on those systems on those systems that might affect the qeueue time estimation? (e.g. fair sharing and site-specific QoS settings?)

Below is an example of my measurement for your information:

-----begin of example

lifen@elogin1(:):~$date && sbatch --test-only -n 24 -N 4 -p general -t 00:60:00 --wrap "hostname"

Wed Sep 15 15:11:49 EDT 2021

sbatch: Job 1734080 to start at 2021-09-15T15:11:49 using 24 processors on nodes nid00[935-938] in partition general

lifen@elogin1(:):~$date && sbatch -n 24 -N 4 -p general -t 00:60:00 --wrap "hostname"

Wed Sep 15 15:11:58 EDT 2021

Submitted batch job 1734081

lifen@elogin1(:):~$sacct --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist -j 1734081

User JobID JobName Partition State Timelimit Start End Elapsed MaxRSS MaxVMSize NNodes NCPUS NodeList

--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ---------- ---------------

lifen 1734081 wrap general COMPLETED 01:00:00 2021-09-15T15:34:13 2021-09-15T15:34:13 00:00:00 4 24 nid00[169,883,+

1734081.bat+ batch COMPLETED 2021-09-15T15:34:13 2021-09-15T15:34:13 00:00:00 2136K 226420K 1 18 nid00169

1734081.ext+ extern COMPLETED 2021-09-15T15:34:13 2021-09-15T15:34:13 00:00:00 4K 4K 4 24 nid00[169,883,+

-----end of example

Thanks,

Feng Li

Renfro, Michael

unread,

Sep 15, 2021, 4:24:46 PM9/15/21

to Slurm User Community List

I can imagine at least the following causing differences in the estimated time and the actual start time:

If running users have overestimated their job times, and their jobs finish earlier than expected, the original estimate will be high.
If another user's job submission gets higher priority than yours while your job is still pending (because of scheduler policy including fairshare), your job can get pushed back, and the original estimate will be low.
If the test-only scheduling code doesn't account for backfill, the original estimate could be high.

Haven't looked at the code to see if the test-only parameter goes through a complete scheduling cycle before returning the estimate, but I can guarantee that the first two items above happen all the time on my much simpler cluster here.

Feng Li

unread,

Sep 15, 2021, 5:09:24 PM9/15/21

to Slurm User Community List

Hi Machael,

Thanks for your quick response. All factors you mentioned look valid to me. For 1, a similar situation can also happen when there is a queued large job that just failed.

I feel it can be bad if the backfill scheduler is enabled by default, but the ‘–test-only’ estimation doesn’t consider backfill. (I have not checked the details in the source code yet)

And I am still wondering what can be a better way to do such queue estimation. I searched the mailist archive, and saw many suggestions on using the ‘–test-only’ option, but I have not seen much discussion on how reliable it is.