[slurm-users] Association limit problem

17 views
Skip to first unread message

Gestió Servidors via slurm-users

unread,
Apr 17, 2024, 7:44:28 AMApr 17
to slurm...@lists.schedmd.com

Hello,

 

I’m doing some test with “associations” with “sacctmgr”. I have created three users (user_1, user_2 and user_3). For each of these users, I have created an association:

 

[root@myserver log]# sacctmgr show user user_1 --associations

      User   Def Acct     Admin    Cluster    Account  Partition     Share   Priority MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS

---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- -------------------- ---------

    user_1       test      None     q50004       test    aolin.q         1                  4        2       10                                                 normal

    user_1       test      None     q50004       test cuda-staf+         1                  4        2       10                                                 normal

 

[root@myserver log]# sacctmgr show user user_2 --associations

      User   Def Acct     Admin    Cluster    Account  Partition     Share   Priority MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS

---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- -------------------- ---------

    user_2       test      None     q50004       test cuda-int.q         1                                    4                                                 normal

 

[root@myserver log]# sacctmgr show user user_3 --associations

      User   Def Acct     Admin    Cluster    Account  Partition     Share   Priority MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS

---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- -------------------- ---------

    user_3       test      None     q50004       test research.q         1                           2        1                                                 normal

    user_3       test      None     q50004       test     xeon.q         1                           2        1                                                 normal

 

All users belong to “Test” account:

[root@myserver log]# sacctmgr show account test --association

   Account                Descr                  Org    Cluster ParentName       User     Share   Priority GrpJobs GrpNodes  GrpCPUs  GrpMem GrpSubmit     GrpWall  GrpCPUMins MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS

---------- -------------------- -------------------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- ------- --------- ----------- ----------- ------- -------- -------- --------- ----------- ----------- -------------------- ---------

      test                 test                 test     q50004       root                    1                                                                                                                                                          normal

      test                 test                 test     q50004                user_1         1                                                                                      4        2       10                                                 normal

      test                 test                 test     q50004                user_1         1                                                                                      4        2       10                                                 normal

      test                 test                 test     q50004                user_2         1                                                                                                        4                                                 normal

      test                 test                 test     q50004                user_3         1                                                                                               2        1                                                 normal

      test                 test                 test     q50004                user_3         1                                                                                               2        1                                                 normal

 

 

When I submit with “user_1”, all tests are running fine: some jobs are queued and executed and some jobs are rejected because of the limits.

However, with users “user_2” and “user_3” I can’t submit any job. All jobs are rejected with these messages:

     11168 research.     test          user_3  PENDING         0:00  2024-04-17T12:53:21                  N/A    1    1     OK                  N/A (AssocMaxCpuPerJo (null)

     11173 research.     test          user_3  PENDING         0:00  2024-04-17T13:06:02                  N/A    1    1     OK                  N/A (AssocMaxCpuPerJo (null)

     11174 research.     test          user_3  PENDING         0:00  2024-04-17T13:06:16                  N/A    1    1     OK                  N/A (AssocMaxCpuPerJo (null)

     11176 research.     test          user_3  PENDING         0:00  2024-04-17T13:07:23                  N/A    1    1     OK                  N/A (AssocMaxCpuPerJo (null)

     11180 research.     test          user_3  PENDING         0:00  2024-04-17T13:08:45                  N/A    1    1     OK                  N/A (AssocMaxCpuPerJo (null)

 

For example, user “user_3” are trying to submit in this way (test.sh script only is a simple “sleep 50”:

sbatch -p aolin.q -N 2 ./test.sh à sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

sbatch -p aolin.q -N 1 ./test.sh à sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

sbatch -p research.q -N 1 ./test.sh à submitted but not running à nodelist(reason)= (AssocMaxCpuPerJobLimit) -> WHY???

sbatch -p research.q -N 1 -n 1 ./test.sh à submitted but not running à nodelist(reason)= (AssocMaxCpuPerJobLimit) à WHY???

sbatch -p xeon.q -N 1 -n 1 ./test.sh à submitted and running!!

 

[root@myserver log]# squeue

     JOBID PARTITION     NAME            USER    STATE         TIME          SUBMIT_TIME           START_TIME NODE CPUS OVER_S        TRES_PER_NODE NODELIST(REASON)  DEPENDENCY        REQ_NODES   NODELIST

     11202 research.     test          user_3  PENDING         0:00  2024-04-17T13:33:31                  N/A    1    1     OK                  N/A (AssocMaxCpuPerJo (null)

     11200 research.     test          user_3  PENDING         0:00  2024-04-17T13:33:17                  N/A    1    1     OK                  N/A (AssocMaxCpuPerJo (null)

     11212    xeon.q     test          user_3  RUNNING         0:18  2024-04-17T13:36:10  2024-04-17T13:36:10    1    1     OK                  N/A aolin-cpu-1       (null)             aolin-cpu-1

 

Why? What am I doing wrong? Where is the limit that I am not seeing?

 

 

Thanks a lot!

 

Reply all
Reply to author
Forward
0 new messages