[slurm-users] Slurm Fairshare / Multifactor Priority

Julius, Chad

unread,

May 29, 2019, 10:17:56 AM5/29/19

to slurm...@lists.schedmd.com

All,

We rushed our Slurm install due to a short timeframe and missed some important items. We are now looking to implement a better system than the first in, first out we have now. My question, are the defaults listed in the slurm.conf file a good start? Would anyone be willing to share their Scheduling section in their .conf? Also we are looking to increase the maximum array size but I don’t see that in the slurm.conf in version 17. Am I looking at an upgrade of Slurm in the near future or can I just add MaxArraySize=somenumber?

The defaults as of 17.11.8 are:

# SCHEDULING

#SchedulerAuth=

#SchedulerPort=

#SchedulerRootFilter=

#PriorityType=priority/multifactor

#PriorityDecayHalfLife=14-0

#PriorityUsageResetPeriod=14-0

#PriorityWeightFairshare=100000

#PriorityWeightAge=1000

#PriorityWeightPartition=10000

#PriorityWeightJobSize=1000

#PriorityMaxAge=1-0

Chad Julius

Cyberinfrastructure Engineer Specialist

Division of Technology & Security

SOHO 207, Box 2231

Brookings, SD 57007

Phone: 605-688-5767

www.sdstate.edu

cid:image007.png@01D24AF4.6CEECA30

Paul Edmon

unread,

May 29, 2019, 10:40:14 AM5/29/19

to slurm...@lists.schedmd.com

Sure. Here is what we have:

########################## Scheduling #####################################
### This section is specific to scheduling

### Tells the scheduler to enforce limits for all partitions
### that a job submits to.
EnforcePartLimits=ALL

### Let's slurm know that we have a jobsubmit.lua script
JobSubmitPlugins=lua

### When a job is launched this has slurmctld send the user information
### instead of having AD do the lookup on the node itself.
LaunchParameters=send_gids

### Maximum sizes for Jobs.
MaxJobCount=200000
MaxArraySize=10000
DefMemPerCPU=100

### Job Timers
CompleteWait=0

### We set the EpilogMsgTime long so that Epilog Messages don't pile up all
### at one time due to forced exit which can cause problems for the master.
EpilogMsgTime=3000000
InactiveLimit=0
KillWait=30

### This only applies to the reservation time limit, the job must still obey
### the partition time limit.
ResvOverRun=UNLIMITED
MinJobAge=600
Waittime=0

### Scheduling parameters
### FastSchedule 2 lets slurm know not to auto detect the node config
### but rather follow our definition. We also use setting 2 as due to our geographic
### size nodes may drop out of slurm and then reconnect. If we had 1 they would be
### set to drain when they reconnect. Setting it to 2 allows them to rejoin with out
### issue.
FastSchedule=2
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

### Govern's default preemption behavior
PreemptType=preempt/partition_prio
PreemptMode=REQUEUE

### default_queue_depth should be some multiple of the partition_job_depth,
### ideally number_of_partitions * partition_job_depth, but typically the main
### loop exits prematurely if you go over about 400. A partition_job_depth of
### 10 seems to work well.
SchedulerParameters=\
default_queue_depth=1150,\
partition_job_depth=10,\
max_sched_time=50,\
bf_continue,\
bf_interval=30,\
bf_resolution=600,\
bf_window=11520,\
bf_max_job_part=0,\
bf_max_job_user=10,\
bf_max_job_test=10000,\
bf_max_job_start=1000,\
bf_ignore_newly_avail_nodes,\
kill_invalid_depend,\
pack_serial_at_end,\
nohold_on_prolog_fail,\
preempt_strict_order,\
preempt_youngest_first,\
max_rpc_cnt=8

################################ Fairshare ################################
### This section sets the fairshare calculations

PriorityType=priority/multifactor

### Settings for fairshare calculation frequency and shape.
FairShareDampeningFactor=1
PriorityDecayHalfLife=28-0
PriorityCalcPeriod=1

### Settings for fairshare weighting.
PriorityMaxAge=7-0
PriorityWeightAge=10000000
PriorityWeightFairshare=20000000
PriorityWeightJobSize=0
PriorityWeightPartition=0
PriorityWeightQOS=1000000000

I'm happy to chat about any of the settings if you want, or share our full config.

-Paul Edmon-

Paul Edmon

unread,

May 29, 2019, 10:41:10 AM5/29/19

to slurm...@lists.schedmd.com

For reference we are running 18.08.7

-Paul Edmon-

Kilian Cavalotti

unread,

May 29, 2019, 11:06:00 AM5/29/19

to Slurm User Community List

Hi Paul,

I'm wondering about this part in your SchedulerParameters:

### default_queue_depth should be some multiple of the partition_job_depth,
### ideally number_of_partitions * partition_job_depth, but typically the main
### loop exits prematurely if you go over about 400. A partition_job_depth of
### 10 seems to work well.

Do you remember if that's still the case, or if it's in relation with a reported issue? That sure sounds like something that would need to be fixed if it hasn't been already.

Cheers,

--

Kilian

--

Kilian

Christoph Brüning

unread,

May 29, 2019, 11:07:58 AM5/29/19

to slurm...@lists.schedmd.com

Hi Chad,

for us (also running slurm 17.11), the crucial point was the balance
between PriorityWeightFairshare, PriorityWeightAge and PriorityMaxAge.

We set the PriorityWeightAge high (higher than PriorityWeightFairshare,
in fact), so that even a job by some power user will eventually be the
first in the queue and can't be sort of DDoS-ed by jobs from little-used
accounts.
The question then is: How long must that job have already been waiting
in the queue?

Consider the following simplified account tree:
root
/ \
A B
/ \ |
X Y Z

When the cluster is basically occupied by X, this also has an impact on
Y's fair share value. This can lead to a situation where the difference
between X's and Y's fair share value is pretty small, even though Y has
hardly used any resources.
With a low value of PriorityMaxAge, the situation is basically FIFO
between X and Y, as X's jobs only need a couple of hours (or even less)
in the queue to compensate the difference in fair share priority.

We're currently running with the following settings, and since the
increase of PriorityMaxAge to three weeks it works fine:

PriorityMaxAge=21-0
PriorityWeightAge=1500000
PriorityWeightFairshare=1000000

For the array jobs, you can set MaxArraySize. But remember to increase
MaxJobCount as well!

Best,
Christoph

On 29/05/2019 16.17, Julius, Chad wrote:
> All,
>
> We rushed our Slurm install due to a short timeframe and missed some
> important items. We are now looking to implement a better system than
> the first in, first out we have now. My question, are the defaults
> listed in the slurm.conf file a good start? Would anyone be willing to
> share their Scheduling section in their .conf? Also we are looking to
> increase the maximum array size but I don’t see that in the slurm.conf
> in version 17. Am I looking at an upgrade of Slurm in the near future
> or can I just add MaxArraySize=somenumber?
>
> The defaults as of 17.11.8 are:
>
> # SCHEDULING
>
> #SchedulerAuth=
>
> #SchedulerPort=
>
> #SchedulerRootFilter=
>
> #PriorityType=priority/multifactor
>
> #PriorityDecayHalfLife=14-0
>
> #PriorityUsageResetPeriod=14-0
>
> #PriorityWeightFairshare=100000
>
> #PriorityWeightAge=1000
>
> #PriorityWeightPartition=10000
>
> #PriorityWeightJobSize=1000
>
> #PriorityMaxAge=1-0
>

> *Chad Julius*
>
> Cyberinfrastructure Engineer Specialist
>
> *Division of Technology & Security*

>
> SOHO 207, Box 2231
>
> Brookings, SD 57007
>
> Phone: 605-688-5767
>

> www.sdstate.edu <http://www.sdstate.edu/>
>
> cid:image0...@01D24AF4.6CEECA30
>

--
Dr. Christoph Brüning
Universität Würzburg
Rechenzentrum
Am Hubland
D-97074 Würzburg
Tel.: +49 931 31-80499

Paul Edmon

unread,

May 29, 2019, 11:16:44 AM5/29/19

to slurm...@lists.schedmd.com

I believe it is still the case, but I haven't tested it. I put this in way back when partition_job_depth was first introduced (which was eons ago now). We run about 100 or so partitions, so this has served us well as a general rule. What happens is that if you set partition job depth too deep it may not get through all the partitions before it has to give up and start again. This lead to partition starvation in the past where there were jobs waiting to be scheduled in a partition that had space but they never started because the main loop never got to them. The backfill loop took to long to clean up thus those jobs took forever to schedule.

With the various improvements to the scheduler this may no longer be the case, but I haven't taken the time to test it on our cluster as our current set up has worked well.

-Paul Edmon-

Reply all

Reply to author

Forward