[slurm-users] backfill scheduler does not work for heterogeneous jobs (version 17.11)

29 views
Skip to first unread message

Ana Jokanović

unread,
Nov 29, 2018, 7:29:43 AM11/29/18
to slurm...@lists.schedmd.com


Hello,

I did a simple test submitting the workload of three jobs (see below) on a cluster of 5 nodes:

sbatch --cpus-per-task=2 --ntasks=6 --time=15 : --cpus-per-task=2 --ntasks=6 --time=15 : --cpus-per-task=2 --ntasks=6 --time=15

sbatch --cpus-per-task=2 --ntasks=6 --time=15 : --cpus-per-task=2 --ntasks=6 --time=15 : --cpus-per-task=2 --ntasks=6 --time=15

sleep 5;

sbatch --ntasks=1 --time=2 : --ntasks=1 --time=1


I would expect that the third submitted job is backfilled but it does not happen.
Here is the job completion log:

JobId=2 UserId=3113 GroupId=8950 Name=sleep JobState=COMPLETED Partition=debug TimeLimit=00:15:00 SubmitTime=1543317694 StartTime=1543317714 EndTime=1543317774 NodeList=s19r2b09 NodeCnt=1 ProcCnt=48 

JobId=3 UserId=3113 GroupId=8950 Name=sleep JobState=COMPLETED Partition=debug TimeLimit=00:15:00 SubmitTime=1543317694 StartTime=1543317714 EndTime=1543317774 NodeList=s19r2b10 NodeCnt=1 ProcCnt=48 

JobId=4 UserId=3113 GroupId=8950 Name=sleep JobState=COMPLETED Partition=debug TimeLimit=00:15:00 SubmitTime=1543317694 StartTime=1543317714 EndTime=1543317774 NodeList=s19r2b12 NodeCnt=1 ProcCnt=48 

JobId=8 UserId=3113 GroupId=8950 Name=sleep JobState=COMPLETED Partition=debug TimeLimit=00:02:00 SubmitTime=1543317699 StartTime=1543317804 EndTime=1543317824 NodeList=s19r2b14 NodeCnt=1 ProcCnt=48 

JobId=9 UserId=3113 GroupId=8950 Name=sleep JobState=COMPLETED Partition=debug TimeLimit=00:01:00 SubmitTime=1543317699 StartTime=1543317804 EndTime=1543317824 NodeList=s19r2b16 NodeCnt=1 ProcCnt=48 

JobId=5 UserId=3113 GroupId=8950 Name=sleep JobState=COMPLETED Partition=debug TimeLimit=00:15:00 SubmitTime=1543317694 StartTime=1543317804 EndTime=1543317864 NodeList=s19r2b09 NodeCnt=1 ProcCnt=48 

JobId=6 UserId=3113 GroupId=8950 Name=sleep JobState=COMPLETED Partition=debug TimeLimit=00:15:00 SubmitTime=1543317694 StartTime=1543317804 EndTime=1543317864 NodeList=s19r2b10 NodeCnt=1 ProcCnt=48 

JobId=7 UserId=3113 GroupId=8950 Name=sleep JobState=COMPLETED Partition=debug TimeLimit=00:15:00 SubmitTime=1543317694 StartTime=1543317804 EndTime=1543317864 NodeList=s19r2b12 NodeCnt=1 ProcCnt=48 


Would you expect this behavior?


Thanks.


Best regards,

Ana

--
Ana Jokanovic, PhD
Barcelona Supercomputing Center
c/ Jordi Girona 1-3, K2M Building, 1st floor
08034 Barcelona - SPAIN
e-mail: ana...@gmail.com or ana.jo...@bsc.es
tel: +34 93 4137246



--
Ana Jokanovic, PhD
Barcelona Supercomputing Center
c/ Jordi Girona 1-3, K2M Building, 1st floor
08034 Barcelona - SPAIN
e-mail: ana...@gmail.com or ana.jo...@bsc.es
tel: +34 93 4137246



--
Ana Jokanovic, PhD
Barcelona Supercomputing Center
c/ Jordi Girona 1-3, K2M Building, 1st floor
08034 Barcelona - SPAIN
e-mail: ana...@gmail.com or ana.jo...@bsc.es
tel: +34 93 4137246

Kenneth Roberts

unread,
Nov 30, 2018, 11:44:36 AM11/30/18
to slurm...@lists.schedmd.com

There are some Limitations that mention backfill on the heterogeneous job support page.

 

https://slurm.schedmd.com/heterogeneous_jobs.html#limitations

 

Maybe there’s some information there to help?

 

Ken

Ana Jokanović

unread,
Dec 3, 2018, 3:40:44 AM12/3/18
to slurm...@lists.schedmd.com
Hi Ken,

I have read this page and I understood that in case of my example the third job should be backfilled. The second job can start after 15 minutes, but the third job requires only two nodes and 2 minutes, thus it can start immediately, but this does not happen. 

In the page that you referred to, they give an example:

For example, consider a heterogeneous job with three components. When considered as independent jobs, the components could be initiated at times now (component 0), now plus 2 hour (component 1), and now plus 1 hours (component 2). When the backfill scheduler runs in the first mode:

  1. Component 0 will be noted to possible to start now, but not initiated due to the additional components to be initiated
  2. Component 1 will be noted to be possible to start in 2 hours
  3. Component 2 will not be considered for scheduling until 2 hours in the future, which leave some additional resources available for scheduling to other jobs

When the backfill scheduler executes next, it will use the second mode and (assuming no other state changes) all three job components will be considered available for scheduling no earlier than 2 hours in the future, which may allow other jobs to be allocated resources before heterogeneous job component 0 could be initiated.

From this example, I understand that in my experiment the third job should be backfilled. The second job can start after 15 minutes, but the third job requires only two nodes and 2 minutes, thus it can start immediately, but this does not happen. 

It seems there is a bug here. I also tried with the version 18.03, but it does not work either.

Ana 

Kenneth Roberts

unread,
Dec 3, 2018, 10:57:36 PM12/3/18
to slurm...@lists.schedmd.com

Hi –

 

The time stamps show that your 1st sbatch job components start at the same time and then run for 1 minute.

 

30 seconds after the simultaneous end of all three components of the 1st sbatch, the two components of the 3rd sbatch and the three components of the 2nd all start. The two components of the 3rd batch each run for 20 seconds. The three components of the 2nd sbatch all run for 1 minute.

 

The 3rd sbatch start was delayed 5 seconds by the sleep, so they didn’t start with the 1st batch.

 

Are you able to give the other parameters of your setup?  The SelectType? The node specs? These will affect scheduling.  Note, I’m wading into deep waters for me ... still learning slurm. (slurming? ;-)

 

From: slurm-users <slurm-use...@lists.schedmd.com> On Behalf Of Ana Jokanovic
Sent: Monday, December 3, 2018 12:40 AM
To: slurm...@lists.schedmd.com
Subject: Re: [slurm-users] backfill scheduler does not work for heterogeneous jobs (version 17.11)

 

Hi Ken,

 

I have read this page and I understood that in case of my example the third job should be backfilled. The second job can start after 15 minutes, but the third job requires only two nodes and 2 minutes, thus it can start immediately, but this does not happen. 

 

In the page that you referred to, they give an example:

 

For example, consider a heterogeneous job with three components. When considered as independent jobs, the components could be initiated at times now (component 0), now plus 2 hour (component 1), and now plus 1 hours (component 2). When the backfill scheduler runs in the first mode:

1.     Component 0 will be noted to possible to start now, but not initiated due to the additional components to be initiated

2.     Component 1 will be noted to be possible to start in 2 hours

3.     Component 2 will not be considered for scheduling until 2 hours in the future, which leave some additional resources available for scheduling to other jobs

Ana Jokanović

unread,
Dec 10, 2018, 2:28:15 AM12/10/18
to slurm...@lists.schedmd.com
Hi Ken,

Here is my slurm.conf:

ControlMachine=s19r2b08


AuthType=auth/none

CryptoType=crypto/openssl

JobCredentialPrivateKey=/home/bsc33/bsc33882/slurm_over_slurm/etc/slurm.key

JobCredentialPublicCertificate=/home/bsc33/bsc33882/slurm_over_slurm/etc/slurm.cert


MpiDefault=none


ProctrackType=proctrack/linuxproc

ReturnToService=1


SlurmctldPidFile=/home/bsc33/bsc33882/slurm_over_slurm/var/run/slurmctld.pid

SlurmctldPort=7001

SlurmdPidFile=/home/bsc33/bsc33882/slurm_over_slurm/var/run/slurmd.%n.pid

SlurmdPort=8009

SlurmdSpoolDir=/home/bsc33/bsc33882/slurm_over_slurm/var/spool/slurmd.%n

SlurmUser=bsc33882

SlurmdUser=bsc33882

StateSaveLocation=/home/bsc33/bsc33882/slurm_over_slurm/var/state

SwitchType=switch/none


TaskPlugin=task/none

TaskPluginParam=autobind=cores


# TIMERS

InactiveLimit=1800

KillWait=60

MinJobAge=300

OverTimeLimit=1

SlurmctldTimeout=300

SlurmdTimeout=300


# SCHEDULING

FastSchedule=1

SchedulerType=sched/backfill

SelectType=select/linear

SchedulerParameters=bf_interval=30,default_queue_depth=50


# LOGGING AND ACCOUNTING


ClusterName=cluster

JobCompType=jobcomp/script

JobCompLoc=/home/bsc33/bsc33882/slurm_over_slurm/script/trace.sh

JobAcctGatherFrequency=30

JobAcctGatherType=jobacct_gather/none

SlurmctldDebug=7

SlurmctldLogFile=/home/bsc33/bsc33882/slurm_over_slurm/var/slurmctld.log

SlurmdDebug=7

SlurmdLogFile=/home/bsc33/bsc33882/slurm_over_slurm/var/slurmd.%n.log

DebugFlags=Backfill,SelectType

# COMPUTE NODES

NodeName=s19r2b[09-10,12,14,16] CPUs=48 Sockets=2 CoresPerSocket=24 ThreadsPerCore=1 State=IDLE Port=7009

PartitionName=debug Nodes=s19r2b[09-10,12,14,16] Default=YES MaxTime=INFINITE State=UP

Reply all
Reply to author
Forward
0 new messages