[slurm-dev] Job priority calculation when submitted to multiple partitions with different priorities

23 views
Skip to first unread message

Corey Keasling

unread,
Aug 11, 2017, 3:51:41 PM8/11/17
to slurm-dev

Hi Slurm-Dev,

I'm trying to determine how a job's multifactor priority is calculated
when the job is submitted to multiple partitions where each partition
has a different priority factor. I'm running 16.05.6 with ill-defined
plans to move to 17.02.

My cluster is partitioned such that one partition is a subset of another
with the subset having a 10x higher PriorityJobFactor. The intent is to
give greater priority on the subset to the group that purchased it while
allowing all users to run on all nodes. Thus I hope to permit the
privileged group to submit jobs to both partitions simultaneously, but
to have their greater priority apply only to the subset. However, based
on squeue and sprio, this may not be happening.

squeue -P reports identical priorities for both entries (i.e., the same
job but considered for p1 and p2). sprio seems to report the priority
as calculated for the first partition in the list (i.e., if submitted
via sbatch -p1,p2 the job has gets the p1 priority factor, while sbatch
-p2,p1 gives the p2 priority factor).

So what's actually going on under the hood? Does the scheduler
calculate priorities for each (job,partition) pair separately, or only
once?

Thank you for your help!

--
Corey Keasling
Software Manager
JILA Computing Group
University of Colorado-Boulder
440 UCB Room S244
Boulder, CO 80309-0440
303-492-9643

Corey Keasling

unread,
Aug 11, 2017, 4:39:53 PM8/11/17
to slurm-dev

Hello again,

Looks like I'll make more definite plans to upgrade. Per the Changelog
for 17.02.3:

-- Fix updating job priority on multiple partitions to be correct.

Corey

--
Corey Keasling
Software Manager
JILA Computing Group
University of Colorado-Boulder
440 UCB Room S244
Boulder, CO 80309-0440
303-492-9643

Corey Keasling

unread,
Aug 14, 2017, 3:17:13 PM8/14/17
to slurm-dev

Once more, hello Slurm-Dev,

The problem remains after upgrading to 17.02.6 today. A job submitted
to multiple partitions and pending for Resources has a single priority
which reflects the PriorityJobFactor of the partition that is first in
the list. Is this a bug? I spent a while digging through the bug
tracker and couldn't find anything, although changelog entries for 17.11
might be relevant. Thoughts?

Thank you!

Corey

Skouson, Gary B

unread,
Aug 15, 2017, 11:34:32 AM8/15/17
to slurm-dev
I've also seen that. I'm not sure it's a "bug". It's just a result of the current structure of the code.

The job structure doesn't have a place to put multiple priorities, so it seems like you end up with the priority of whatever priority was checked last during scheduling.

If you plow through the multifactor backfill stuff, it actually checks the job with each of the appropriate priorities for each of the partitions, but it seems to leave the value in the job structure of whatever gets checked last. Since the lowest priority is checked last, that usually ends up as the job priority in the job structure.

There are reasons why the backfill code may not get through all the jobs, so it's not always the case that the lowest priority is the one that sticks in the job record, but that seems to be the usual result.

With our configuration, I can submit a set of jobs that can't start right away and the priorities look like:

JOBID PARTITION USER ACCOUNT NOD ST TIME_LEFT START_TIME SUBMIT_TIME PRIOR NODELIST(REASON)
1878958 large,bkfi skouson mscfops 256 PD 8:00:00 N/A 2017-08-15T08:07:41 24000 (None)
1878959 large,bkfi skouson mscfops 256 PD 8:00:00 N/A 2017-08-15T08:07:41 24000 (None)
1878960 large,bkfi skouson mscfops 256 PD 8:00:00 N/A 2017-08-15T08:07:41 24000 (None)
1878961 large,bkfi skouson mscfops 256 PD 8:00:00 N/A 2017-08-15T08:07:41 24000 (None)
1878962 large,bkfi skouson mscfops 256 PD 8:00:00 N/A 2017-08-15T08:07:41 300 (None)
1878963 large,bkfi skouson mscfops 256 PD 8:00:00 N/A 2017-08-15T08:07:41 300 (None)

The large partition requires a QOS with a 4 job limit, but the bkfill partition (with a lower priority that doesn't allow resource reservation) can run lots of jobs. The initial squeue immediately after submission shows the priorities above. Sometimes, the initial squeue shows some of each priority, sometimes it's all the same priority, I'm not sure why that is.

Waiting for the backfill schedule to run, results in each job getting an estimated start time, since they're in the large partition with a priority above the bf_min_prio_reserve threshold. However the job priority is only 304, which is the priority backfill checks last. Running sprio also ends up with the same priority of 304

I think this could be fixed to list each job/partition combo with its own priority, However, I didn't see an easy way that didn't require changes to the job structure.

-----
Gary Skouson

Moe Jette

unread,
Aug 15, 2017, 12:32:37 PM8/15/17
to slurm-dev

Per-partition priority information will be available in Slurm version
17.11

Corey Keasling

unread,
Aug 15, 2017, 2:51:28 PM8/15/17
to slurm-dev

Okay, that makes sense. It wasn't clear whether it was a reporting
issue or an actual calculation issue. Since it seems to be just
misreporting by omission the results of the priority calculation, it
sounds like my scheme will work. I'll look forward to 17.11 for proper
reporting.

Thank you!

Corey

Corey Keasling

unread,
Aug 15, 2017, 2:52:33 PM8/15/17
to slurm-dev

Great. I'll upgrade to 17.11.1 when it's released :-)

Corey
Reply all
Reply to author
Forward
0 new messages