[slurm-users] Priority access for a group of users

3,127 views
Skip to first unread message

David Baker

unread,
Feb 15, 2019, 4:08:03 AM2/15/19
to slurm...@lists.schedmd.com

Hello.


We have a small set of compute nodes owned by a group. The group has agreed that the rest of the HPC community can use these nodes providing that they (the owners) can always have priority access to the nodes. The four nodes are well provisioned (1 TByte memory each plus 2 GRID K2 graphics cards) and so there is no need to worry about preemption. In fact I'm happy for the nodes to be used as well as possible by all users. It's just that jobs from the owners must take priority if resources are scarce.  


What is the best way to achieve the above in slurm? I'm planning to place the nodes in their own partition. The node owners will have priority access to the nodes in that partition, but will have no advantage when submitting jobs to the public resources. Does anyone please have any ideas how to deal with this?


Best regards,

David


Marcus Wagner

unread,
Feb 15, 2019, 7:20:37 AM2/15/19
to slurm...@lists.schedmd.com
Hi David,

as far as I know, you can use the PriorityTier (partition parameter) to achieve this. According to the manpages (if I remember right) jobs from higher priority tier partitions have precedence over jobs from lower priority tier partitions, without taking the normal fairshare priority into consideration.

Best
Marcus
-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Paul Edmon

unread,
Feb 15, 2019, 10:06:23 AM2/15/19
to slurm...@lists.schedmd.com

Yup, PriorityTier is what we use to do exactly that here.  That said unless you turn on preemption jobs may still pend if there is no space.  We run with REQUEUE on which has worked well.


-Paul Edmon-

david baker

unread,
Feb 15, 2019, 11:57:26 AM2/15/19
to Slurm User Community List
Hi Paul, Marcus,

Thank you for your replies. Using partition priority all makes sense. I was thinking of doing something similar with a set of nodes purchased by another group. That is, having a private high priority partition and a lower priority "scavenger" partition for the public. In this case scavenger jobs will get killed when preempted. 

In the present case , I did wonder if it would be possible to do something with just a single partition -- hence my question.Your replies have convinced me that two partitions will work -- with preemption leading to re-queued jobs. 

Best regards,
David 

Henkel, Andreas

unread,
Feb 18, 2019, 2:28:38 AM2/18/19
to Slurm User Community List
Hi David,

I think there is another option if you don’t want to use preemption. If the max runlimit is small (several hours for example) working without preemption may be acceptable. 
Assign a qos with a priority boost to the owners of the node. Then whenever they submit jobs to the partition they get to the top off the queue.
This works only if there is one dedicated partition for those nodes but accessible for all users of course. 

Best,
Andreas 

Marcus Wagner

unread,
Feb 18, 2019, 3:08:33 AM2/18/19
to slurm...@lists.schedmd.com
Hi Andreas,


doesn't it suffice to use priority tier partitions? You don't need to use preemption at all, do you?


Best
Marcus

Henkel

unread,
Feb 18, 2019, 4:53:04 AM2/18/19
to slurm...@lists.schedmd.com

Hi Marcus,

sure, using Prioritytier is fine. And my point wasn't so much about preepmtion but exactely about to use just one partition and no preemption instead of two partitions, which is what David was asking for, isn't? But actuallym, I forgot that you can do it in one partition too by using preempt/qos. Though we haven't use that.

Best,

Andreas

-- 
Dr. Andreas Henkel
Operativer Leiter HPC
Zentrum für Datenverarbeitung
Johannes Gutenberg Universität
Anselm-Franz-von-Bentzelweg 12
55099 Mainz
Telefon: +49 6131 39 26434
OpenPGP Fingerprint: FEC6 287B EFF3
7998 A141 03BA E2A9 089F 2D8E F37E
0xE2A9089F2D8EF37E.asc
signature.asc

Prentice Bisbal

unread,
Feb 19, 2019, 9:13:15 AM2/19/19
to slurm...@lists.schedmd.com

I just set this up a couple of weeks ago myself. Creating two partitions is definitely the way to go. I created one partition, "general" for normal, general-access jobs, and another, "interruptible" for general-access jobs that can be interrupted, and then set PriorityTier accordingly in my slurm.conf file (Node names omitted for clarity/brevity).

PartitionName=general Nodes=... MaxTime=48:00:00 State=Up PriorityTier=10 QOS=general
PartitionName=interruptible Nodes=... MaxTime=48:00:00 State=Up PriorityTier=1 QOS=interruptible

I then set PreemptMode=Requeue, because I'd rather have jobs requeued than suspended. And it's been working great. There are few other settings I had to change. The best documentation for all the settings you need to change is https://slurm.schedmd.com/preempt.html

Everything has been working exactly as desired and advertised. My users who needed the ability to run low-priority, long-running jobs are very happy.

The one caveat is that jobs that will be killed and requeued need to support checkpoint/restart. So when this becomes a production thing, users are going to have to acknowledge that they will only use this partition for jobs that have some sort of checkpoint/restart capability.

Prentice 

david baker

unread,
Mar 1, 2019, 7:21:59 AM3/1/19
to Slurm User Community List
Hello,

Following up on implementing preemption in Slurm. Thank you again for all the advice. After a short break I've been able to run some basic experiments. Initially, I have kept things very simple and made the following changes in my slurm.conf...

# Premption settings
PreemptType=preempt/partition_prio
PreemptMode=requeue

PartitionName=relgroup nodes=red[465-470] ExclusiveUser=YES MaxCPUsPerNode=40 DefaultTime=02:00:00 MaxTime=60:00:00 QOS=relgroup State=UP AllowAccounts=relgroup Priority=10 PreemptMode=off

# Scavenger partition
PartitionName=scavenger nodes=red[465-470] ExclusiveUser=YES MaxCPUsPerNode=40 DefaultTime=00:15:00 MaxTime=02:00:00 QOS=scavenger State=UP AllowGroups=jfAccessToIridis5 PreemptMode=requeue

The nodes in the relgroup queue are owned by the General Relativity group and, of course, they have priority to these nodes. The general population can scavenge these nodes via the scavenger queue. When I use "preemptmode=cancel" I'm happy that the relgroup jobs can preempt the scavenger jobs (and the scavenger jobs are cancelled). When I set the preempt mode to "requeue" I see that the scavenger jobs are still cancelled/killed. Have I missed an important configuration change or is it that lower priority jobs will always be killed and not re-queued?

Could someone please advise me on this issue? Also I'm wondering if I really understand the "requeue" option. Does that mean re-queued and run from the beginning or run from the current state (needing check pointing)?

Best regards,
David

Antony Cleave

unread,
Mar 1, 2019, 9:45:59 AM3/1/19
to Slurm User Community List
I have always assumed that cancel just kills the job whereas requeue will cancel and then start from the beginning. I know that requeue does this. I never tried cancel.

I'm a fan of the suspend mode myself but that is dependent on users not asking for all the ram by default. If you can educate the users then this works really well as the low priority job stays in ram in suspended mode while the high priority job completes and then the low priority job continues from where it stopped. No checkpoints and no killing.

Antony 


Thomas M. Payerle

unread,
Mar 1, 2019, 10:05:18 AM3/1/19
to Slurm User Community List
My understanding is that with PreemptMode=requeue, the running scavenger job processes on the node will be killed, but the job will be placed back int he queue (assuming the job's specific parameters allow this.  A job can have a --no-requeue flag set, in which case I assume it behaves the same as PreemptMode=cancel).

When a job which has been requeued starts up a second (or Nth time), I believe Slurm basically just reruns the job script.  If the job did not do any checkpointing, this means the job starts from the very beginning.  If the job does checkpointing in some fashion, then depending on how the checkpointing was implemented and the cluster environment, the script might or might not have to check for the existence of checkpointing data in order to resume at the last checkpoint.
--
Tom Payerle
DIT-ACIGS/Mid-Atlantic Crossroads        pay...@umd.edu
5825 University Research Park               (301) 405-6135
University of Maryland
College Park, MD 20740-3831

Michael Gutteridge

unread,
Mar 1, 2019, 10:44:23 AM3/1/19
to Slurm User Community List

Along those lines, there is the slurm.conf setting for _JobRequeue_ which controls the default behavior for jobs' ability to be re-queued.

 - Michael

Mark Hahn

unread,
Mar 1, 2019, 3:47:56 PM3/1/19
to Slurm User Community List
> I'm a fan of the suspend mode myself but that is dependent on users not
> asking for all the ram by default. If you can educate the users then this
> works really well as the low priority job stays in ram in suspended mode
> while the high priority job completes and then the low priority job
> continues from where it stopped. No checkpoints and no killing.

Me too - in fact, I'm not afraid of swap space, so don't mind if a suspended
job gets swapped out (it won't thrash).

It's incredibly useful to support "debug" jobs, which are expected to run
only briefly, but for which someone is waiting.

In most of the previous schedulers we've used (which shall not be named),
we often ran into problems with this: once the victim was suspended, the
scheduler would start other jobs, not just the preemptor, on the resources
made available - sometimes making it impossible to resume the victim,
depending on the mixture of new jobs/sizes/priorities.

In principle, the fix for this would be either to only permit the single
preemptor onto the victim's resources, or at least to backfill only into
the bubble caused by the preemptor (no more than an hour).

I would be interested to know whether other Slurm sites do this successfully,
particularly in avoiding the victim-stays-suspended priority inversion.

thanks,
--
Mark Hahn | SHARCnet Sysadmin | ha...@sharcnet.ca | http://www.sharcnet.ca
| McMaster RHPCS | ha...@mcmaster.ca | 905 525 9140 x24687
| Compute/Calcul Canada | http://www.computecanada.ca

david baker

unread,
Mar 4, 2019, 11:52:30 AM3/4/19
to Slurm User Community List
Hello,

Thank you for reminding me about the sbatch "--requeue" option. When I submit test jobs using this option the preemption and subsequent restart of a job works as expected. I've also played around with "preemptmode=suspend" and that also works, however I suspect we won't use that on these "diskless" nodes. 

As I note I can scavenge resources and preempt jobs myself (I am a member of the "relgroup" and the general public). That is..

            347104 scavenger    myjob     djb1 PD       0:00      1 (Resources)
            347105  relgroup    myjob     djb1  R      17:00      1 red465

On the other hand I do not seem to be able to preempt a job submitted by a colleague. That is, my colleague submits a job to the scavenger queue, it starts to run. I then submit a job to the relgroup queue, however that job fails to preempt my colleague's job and stays in pending status.

Does anyone understand what might be wrong, please? 

Best regards,
David

Michael Gutteridge

unread,
Mar 6, 2019, 8:09:42 AM3/6/19
to Slurm User Community List
It is likely that your job still does not have enough priority to preempt the scavenge job.  Have a look at the output of `sprio` to see the priority of those jobs and what factors are in play.  It may be necessary to increase the partition priority or adjust some of the job priority factors to get the behavior you're wanting.

 - Michael
Reply all
Reply to author
Forward
0 new messages