[slurm-users] Fwd: Using PreemptExemptTime

177 views
Skip to first unread message

Phil Kauffman

unread,
Feb 2, 2022, 2:12:40 PM2/2/22
to slurm...@lists.schedmd.com
Does anyone have a working example using PreemptExemptTime?

My goal is to make a higher priority job wait 24 hours before actually
preempting a lower priority job. Another way, any job is entitled to 24
hours run time before being preempted. The preempted job should be
suspended, ideally. If requeue is necessary that is ok.

It's been asked before here:
https://groups.google.com/g/slurm-users/c/mK4_M4hpXL8/m/sRhT53VYBQAJ

I've run through many iterations attempting to set `PreemptExemptTime`
in slurm.conf and in QOS.

Setting `PreemptType=preempt/partition_prio`:
- The preempted job gets suspended but `PreemptExemptTime` is ignored.

Setting `PreemptType=preempt/qos`
- Configuring inside the QOS as well as globally in slurm.conf
- `PreemptExemptTime` is respected but both jobs continue to run at the
same time using 200% of the resources, which is not wanted.


Details from my test cluster below my signature. Any ideas on what I
should check or missing? Maybe I misunderstood something.

Cheers,

Phil



In my tests I'm using 3 mins as the PreemptExemptTime.

# Nodes
NodeName=slurm[2-5] CPUs=1 Sockets=1 CoresPerSocket=1 ThreadsPerCore=2
RealMemory=1800 MemSpecLimit=200 State=UNKNOWN



### experiment using PreemptType=preempt/qos
PartitionName=DEFAULT OverSubscribe=FORCE:1 Nodes=slurm[2-4]
PartitionName=active Default=YES QOS=normal
PartitionName=hipri Default=NO QOS=expedite

PreemptType=preempt/qos
PreemptMode=SUSPEND,GANG
PreemptExemptTime=00:03:00
SchedulerParameters=preempt_strict_order
PriorityType=priority/multifactor
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
# QOS
[root@slurm2 slurm-llnl]# sacctmgr show qos -p --noheader
normal|1|00:00:00||00:03:00|cluster|||1.000000||||||||||||||||||
expedite|2|00:00:00|normal|00:03:00|cluster|||1.000000||||||||||||||||||





### Experiment using PreemptType=preempt/partition_prio
PartitionName=low Default=NO OverSubscribe=NO PriorityTier=10
PreemptMode=requeue
PartitionName=med Default=NO OverSubscribe=FORCE:1 PriorityTier=20
PreemptMode=suspend
PartitionName=hi Default=NO OverSubscribe=FORCE:1 PriorityTier=30
PreemptMode=off

PreemptType=preempt/partition_prio
PreemptMode=SUSPEND,GANG
PreemptExemptTime=00:03:00
SchedulerParameters=preempt_strict_order
PriorityType=priority/multifactor
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory


John DeSantis

unread,
Feb 3, 2022, 9:10:08 AM2/3/22
to slurm...@lists.schedmd.com
Phil,

>> Does anyone have a working example using PreemptExemptTime?
>>
>> My goal is to make a higher priority job wait 24 hours before actually
>> preempting a lower priority job. Another way, any job is entitled to 24
>> hours run time before being preempted. The preempted job should be
>> suspended, ideally. If requeue is necessary that is ok.

We do and it is working as expected. Please see relevant snippets below.

> [~] $ scontrol show config|grep Preempt
> PreemptMode = CANCEL
> PreemptType = preempt/qos
> PreemptExemptTime = 00:00:00

> Name Priority PreemptExe
> -------------------- -------- ----------
> interactive 1000 01:00:00
> preempt 500 01:00:00
> preempt_short 500 00:30:00
> rchii 1000 01:00:00


>> Details from my test cluster below my signature. Any ideas on what I
>> should check or missing? Maybe I misunderstood something.

I don't think you've missed anything. The only bit of information I can add is that we previously were using GraceTime (which requires PreemptMode=CANCEL). Unfortunately, depending on the application, it wouldn't always be clear that a job was preempted in the application's output, or within the slurmctld logs. When we switched to PreemptExemptTime, all application output and SLURM logs stated preempted as the reason.

I know you want to suspend preempted jobs, but what happens if you cancel them instead?

HTH,

John DeSantis


On 2/2/22 14:12, Phil Kauffman wrote:
> Does anyone have a working example using PreemptExemptTime?
>
> My goal is to make a higher priority job wait 24 hours before actually
> preempting a lower priority job. Another way, any job is entitled to 24
> hours run time before being preempted. The preempted job should be
> suspended, ideally. If requeue is necessary that is ok.
>
> It's been asked before here:
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fg%2Fslurm-users%2Fc%2FmK4_M4hpXL8%2Fm%2FsRhT53VYBQAJ&data=04%7C01%7Cdesantis%40usf.edu%7C6228bd43128249e5c34308d9e6801523%7C741bf7dee2e546df8d6782607df9deaa%7C0%7C0%7C637794920273664110%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=kjiBW8EMaKmxZPjLJJBqHvHRtTC2JluqiC%2FibtWhY3w%3D&reserved=0
> [EXTERNAL EMAIL] DO NOT CLICK links or attachments unless you recognize the sender and know the content is safe.

Phil Kauffman

unread,
Feb 3, 2022, 5:12:38 PM2/3/22
to slurm...@lists.schedmd.com
> I know you want to suspend preempted jobs, but what happens if you
> cancel them instead?

Thanks John. Your response definitely helped me. I have done as you
suggested and tested CANCEL which worked.


For John and everyone else: below are the results of my tests. My
apologies for the wall of text.

In my testing I believe I have only further confirmed that there is a
difference between what the man page says should work and what actually
happens when attempting to use SUSPEND,GANG using PreemptType qos or
partition_prio.


I've verified using 'preempt/qos' that using CANCEL or REQUEUE and
launching jobs on the same partition works as you say and the man page
describes.

Below are my tests:

For all tests the below was configured:
# sacctmgr show qos format=name,priority,preempt -p
Name|Priority|Preempt|
normal|1||
expedite|2|normal|

QOS `expedite` can preempt QOS `normal`.


Test 1: preempt/qos, CANCEL

slurm.conf:
PreemptType: preempt/qos
PreemptMode: 'CANCEL' # requeue works with this option as well.

PreemptExemptTime: '00:00:00'

PartitionName=DEFAULT OverSubscribe=FORCE:1 Nodes=slurm[2-4]
PartitionName=active Default=YES QOS=normal
PartitionName=hipri Default=NO QOS=expedite


sacctmgr -i modify qos where name=normal set PreemptExemptTime=00:03:00
PreemptMode=CANCEL
sacctmgr -i modify qos where name=expedite set PreemptExemptTime=-1
PreemptMode=OFF



Result: PASS
'normal' QOS job gets canceled and 'expedite' job starts after waiting
for PreemptExemptTime.


Test 2: preempt/qos, REQUEUE

slurm.conf:
PreemptType: preempt/qos
PreemptMode: 'CANCEL' # requeue works with this option as well.

PreemptExemptTime: '00:00:00'

PartitionName=DEFAULT OverSubscribe=FORCE:1 Nodes=slurm[2-4]
PartitionName=active Default=YES QOS=normal
PartitionName=hipri Default=NO QOS=expedite

QOS:
sacctmgr -i modify qos where name=normal set
PreemptExemptTime=00:03:00 PreemptMode=REQUEUE
sacctmgr -i modify qos where name=expedite set PreemptExemptTime=-1
PreemptMode=OFF


Result: PASS
'normal' QOS job gets requeued and 'expedite' job starts after waiting
for PreemptExemptTime.



Test 3: preempt/qos, SUSPEND,GANG

slurm.conf
PreemptType: preempt/qos
PreemptMode: 'SUSPEND,GANG'
PreemptExemptTime: '00:00:00'

PartitionName=DEFAULT OverSubscribe=FORCE:1 Nodes=slurm[2-4]
PartitionName=active Default=YES QOS=normal
PartitionName=hipri Default=NO QOS=expedite

QOS:
sacctmgr -i modify qos where name=normal set
PreemptExemptTime=00:03:00 PreemptMode=SUSPEND
sacctmgr -i modify qos where name=expedite set PreemptExemptTime=-1
PreemptMode=OFF

This page: https://slurm.schedmd.com/preempt.html
PreemptMode > SUSPEND > NOTE


"If PreemptType=preempt/qos is configured and if the preempted job(s)
and the preemptor job from are on the same partition, then they will
share resources with the Gang scheduler (time-slicing)."

Result for same partition: PASS
Submitting on the same partition with a different QOS enables the jobs
share time on the same resource.


Now getting to the function I wanted:

"If not (i.e. if the preemptees and preemptor are on different
partitions) then the preempted jobs will remain suspended until the
preemptor ends."

Result for submitting on a different and overlapping partitions: FAIL

Submitting 'normal' QOS level jobs and then one 'expedited' job from
another user results in both jobs running on the same node. No
suspending, requeue, or cancel has occurred. This is not wanted,
probably ever.

The desired behavior is to suspend the job and is what is described in
the man page, however I don't see that occurring.


Test 4: preempt/partition_prio, SUSPEND,GANG

slurm.conf
PreemptType: preempt/partition_prio
PreemptMode: 'SUSPEND,GANG'
PreemptExemptTime: '00:03:00'

PartitionName=active OverSubscribe=FORCE:1 PriorityTier=1
PreemptMode=suspend
PartitionName=hipri OverSubscribe=FORCE:1 PriorityTier=2 PreemptMode=off

Result: FAIL
User A's job gets preempted by user B and gets suspended, which is
desired, however PreemptExemptTime is not respected and the job is
preempted immediately.


I see the following possibilities:

a. The man page does *not* accurately describe the function or my
interpretation was incorrect.
b. I have something misconfigured.
c. I have found a bug.

Cheers,

Phil

Reply all
Reply to author
Forward
0 new messages