[slurm-users] Extremely sluggish squeue -p partition

450 views
Skip to first unread message

Williams, Jenny Avis

unread,
Dec 7, 2020, 11:43:47 PM12/7/20
to Slurm User Community List

I have an interesting condition that has been going on for a few days that could use the feedback of those more familiar with the way slurm works under the hood.

Conditions :

Slurm v20.02.3 The cluster is relatively quiet given the time of year, and the commands are running on the host on which slurmctld is running, so it should not be exiting the host. Sdiag output included at the end.

 

Problem:

Queries done on partitions by either squeue or sacct are very slow, even squeue on an empty partition !!  with the following exceptions:

 sacct queries for running jobs

  squeue on jobID

  squeue on user

 

Question:

What is distinct in the case of squeue on partitions as compared to the sacct on running jobs, or the squeue on user examples, that might account for this behavior?

 

#  setup

# Slow cases

squeue_running="squeue -p general -t running|wc -l"

squeue_pending="squeue -p general -t pending|wc -l"

sacct_pending="sacct -r general -s pending|wc -l"

empty_squeue="squeue -p staff |wc -l"

 

 

 

# Fast cases

sacct_running="sacct -r general -s running|wc -l"

squeue_user="squeue -u auser |wc -l"

 

 

 

# time eval $squeue_pending

17642

 

real    0m30.771s

user    0m0.392s

sys     0m0.835s

 

# time eval $squeue_running

5175

 

real    0m31.137s

user    0m0.304s

sys     0m0.831s

 

# time eval $empty_squeue

1

 

real    0m31.563s

user    0m0.279s

sys     0m0.844s

 

# time eval $sacct_pending

17529

 

real    0m25.901s

user    0m0.332s

sys     0m0.175s

 

 

 

#  time eval $sacct_running

15541

 

real    0m1.135s

user    0m0.149s

sys     0m0.062s

 

# time eval $squeue_user 

3086

 

real    0m0.422s

user    0m0.029s

sys     0m0.031s

 

# time sacct -r staff

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode

------------ ---------- ---------- ---------- ---------- ---------- --------

 

real    0m0.209s

user    0m0.007s

sys     0m0.002s

 

 

 

 

 

 

 

*******************************************************

sdiag output at Mon Dec 07 16:25:37 2020 (1607376337)

Data since      Mon Dec 07 15:09:07 2020 (1607371747)

*******************************************************

Server thread count:  16

Agent queue size:     0

Agent count:          0

Agent thread count:   0

DBD Agent queue size: 0

 

Jobs submitted: 3550

Jobs started:   5064

Jobs completed: 4683

Jobs canceled:  83

Jobs failed:    0

 

Job states ts:  Mon Dec 07 16:24:53 2020 (1607376293)

Jobs pending:   39099

Jobs running:   5915

 

Main schedule statistics (microseconds):

        Last cycle:   96597

        Max cycle:    1992372

        Total cycles: 338

        Mean cycle:   404511

        Mean depth cycle:  1195

        Cycles per minute: 4

        Last queue length: 31916

 

Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)

        Total backfilled jobs (since last slurm start): 4610

        Total backfilled jobs (since last stats cycle start): 4610

        Total backfilled heterogeneous job components: 0

        Total cycles: 71

        Last cycle when: Mon Dec 07 16:24:25 2020 (1607376265)

        Last cycle: 23893637

        Max cycle:  43863303

        Mean cycle: 22141327

        Last depth cycle: 22840

        Last depth cycle (try sched): 1318

        Depth Mean: 23541

        Depth Mean (try depth): 1299

        Last queue length: 31917

        Queue length mean: 31864

        Last table size: 164

        Mean table size: 114

 

Latency for 1000 calls to gettimeofday(): 19 microseconds

 

Remote Procedure Call statistics by message type

        REQUEST_PARTITION_INFO                  ( 2009) count:16568  ave_time:314    total_time:5202676

        REQUEST_FED_INFO                        ( 2049) count:7822   ave_time:194    total_time:1520744

        REQUEST_JOB_USER_INFO                   ( 2039) count:7756   ave_time:70747  total_time:548714078

        REQUEST_JOB_INFO_SINGLE                 ( 2021) count:5626   ave_time:119832 total_time:674179337

        REQUEST_COMPLETE_PROLOG                 ( 6018) count:5068   ave_time:2126092 total_time:10775038166

        MESSAGE_EPILOG_COMPLETE                 ( 6012) count:4774   ave_time:1044194 total_time:4984985393

        REQUEST_STEP_COMPLETE                   ( 5016) count:4770   ave_time:1149196 total_time:5481666326

        REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:4746   ave_time:4107848 total_time:19495849569

        REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:3180   ave_time:893436 total_time:2841126933

        REQUEST_NODE_INFO                       ( 2007) count:2994   ave_time:61229  total_time:183321804

        MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:369    ave_time:365319 total_time:134802881

        REQUEST_JOB_INFO                        ( 2003) count:219    ave_time:5586738 total_time:1223495821

        REQUEST_KILL_JOB                        ( 5032) count:83     ave_time:3698948 total_time:307012762

        REQUEST_PRIORITY_FACTORS                ( 2026) count:48     ave_time:302008 total_time:14496396

        REQUEST_UPDATE_NODE                     ( 3002) count:30     ave_time:234937 total_time:7048123

        REQUEST_BUILD_INFO                      ( 2001) count:27     ave_time:102471 total_time:2766731

        REQUEST_JOB_READY                       ( 4019) count:23     ave_time:128717 total_time:2960508

        REQUEST_PING                            ( 1008) count:15     ave_time:139    total_time:2088

        REQUEST_CANCEL_JOB_STEP                 ( 5005) count:8      ave_time:1320084 total_time:10560677

        REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:7      ave_time:2417071 total_time:16919499

        REQUEST_NODE_INFO_SINGLE                ( 2040) count:6      ave_time:189125 total_time:1134750

        REQUEST_RESOURCE_ALLOCATION             ( 4001) count:6      ave_time:4771306 total_time:28627836

        REQUEST_UPDATE_PARTITION                ( 3005) count:6      ave_time:633112 total_time:3798672

        REQUEST_JOB_STEP_CREATE                 ( 5001) count:5      ave_time:1015017 total_time:5075087

        REQUEST_STATS_INFO                      ( 2035) count:0      ave_time:0      total_time:0

..

Pending RPC statistics

        No pending RPCs

 

Bas van der Vlies

unread,
Dec 8, 2020, 3:55:00 AM12/8/20
to Slurm User Community List
Just a question do you use allowgroups for partition access? we had similar problems that squeue or other slurm commands hang when slurm fetch the groups form LDAP for
each partition also the one it already fetched. So we configured `nscd` to prevent this.


--
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam
| T +31 (0) 20 800 1300 | bas.van...@surf.nl | www.surf.nl |

Williams, Jenny Avis

unread,
Dec 10, 2020, 5:22:50 PM12/10/20
to Slurm User Community List

We do have one partition that uses AllowGroups instead of AllowAccounts.  Testing with that partition closed did not change things.

 

This started Dec 2nd or 3rd – I noticed it on the 3rd.

Reply all
Reply to author
Forward
0 new messages