[slurm-users] associations, limits,qos

569 views
Skip to first unread message

Nizar Abed

unread,
Jan 23, 2021, 12:47:37 AM1/23/21
to Slurm User Community List
Hi list,

I’m trying to enforce limits based on associations, but behavior is not as expected.

In slurm.conf:
AccountingStorageEnforce=associations,limit,qos

Two partitions:
part1.q
part2.q

One user:
user1

One QOS:
qos1
MaxJobsPU is not set


I’d like to have an association for user 1 for each partition, with same qos

      User   Def Acct     Admin    Cluster    Account  Partition     Share   Priority MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS 
---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- 
     user1  account1      None       cl1   account1       part1.q         1               3                                                                       qos1           
     user1  account1      None       cl1   account1       part2.q         1               4                                                                       qos1           







User1 submit 6 jobs to part2.q:
4 start running
2 in pending(AssocMaxJobsLimit)


User1 submit 6 jobs to part1.q:
3 start running
3 in pending(AssocMaxJobsLimit)

This ok and expected behavior.

But when user1 submits 12 jobs like:

sbatch -p part1.q,part2.q slurm-job.sh 

Only 3 jobs running on part1.q: association of part1.q
Other 9 jobs on AssocMaxJobsLimit

Why 4 jobs doesn’t start on part2.q?

Worst case(listing part2.q before part1.q):
sbatch -p part2.q,part1.q slurm-job.sh 
4(!) jobs running on part1.q


Is it possible to allow user to submit to multiple partitions, and slurm picks up correct association for each partition?
What I’m missing here?


Thanks,
Nizar




Durai Arasan

unread,
Jan 25, 2021, 8:48:08 AM1/25/21
to Slurm User Community List
Hi,

Jobs submitted with sbatch cannot run on multiple partitions. The job will be submitted to the partition where it can start first. (from sbatch reference)

Best,
Durai

Nizar Abed

unread,
Jan 25, 2021, 9:04:41 AM1/25/21
to Slurm User Community List
Hi,

Right, I understand this, what I’m describing is:

If a job is submitted to multiple partitions, -p part1, part2... 
When the required resources in a partition (where the job can run) become available, I’d expect the job to be dispatched to the partition with the correct association(qos and limits), but it’s not the case.
Jobs start running on partition with QOS(apparently higher priority), although there is no association entry for this combination(partition/QOS)

Similar case:

All the best,
Nizar

Diego Zuccato

unread,
Jan 29, 2021, 2:47:22 AM1/29/21
to Slurm User Community List, Durai Arasan
Il 25/01/21 14:46, Durai Arasan ha scritto:

> Jobs submitted with sbatch cannot run on multiple partitions. The job
> will be submitted to the partition where it can start first. (from
> sbatch reference)
Did I misunderstand or heterogeneous jobs can workaround this limitation?

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Diego Zuccato

unread,
Jan 29, 2021, 3:40:56 AM1/29/21
to Slurm User Community List, Durai Arasan
Il 29/01/21 08:47, Diego Zuccato ha scritto:

>> Jobs submitted with sbatch cannot run on multiple partitions. The job
>> will be submitted to the partition where it can start first. (from
>> sbatch reference)
> Did I misunderstand or heterogeneous jobs can workaround this limitation?
My quick test seems to confirm that it works in 18.08 (as packaged in
Debian Buster).

My test jobscript:
-8<--
#!/bin/bash
#SBATCH --time 1
#SBATCH --cpus-per-task=1 --mem-per-cpu=1g --ntasks=2 --constraint=blade
#SBATCH packjob
#SBATCH --cpus-per-task=1 --mem-per-cpu=1g --ntasks=4 --constraint=matrix
srun --label : hostname
-8<--
Its output:
-8<--
1: str957-bl0-01
0: str957-bl0-01
3: str957-mtx-01
5: str957-mtx-01
4: str957-mtx-01
2: str957-mtx-01
-8<--

bl-* and mtx-* nodes are in disjoint partitions:
-8<--
$ scontrol show partitions
PartitionName=b1
[...]
Nodes=str957-bl0-[01-02]
[...]
PartitionName=m1
[...]
Nodes=str957-mtx-[00-15]
-8<--
Reply all
Reply to author
Forward
0 new messages