[slurm-users] sacct issue: jobs staying in "RUNNING" state

620 views
Skip to first unread message

Will Dennis

unread,
Jul 16, 2019, 2:44:04 PM7/16/19
to slurm...@lists.schedmd.com

Hi all,

 

Was looking at the running jobs on one groups cluster, and saw there was an insane amount of “running” jobs when I did a sacct -X -s R; then looked at output of squeue, and found a much more reasonable number...

 

root@slurm-controller1:/ # sacct -X -p -s R | wc -l

8895

root@ slurm-controller1:/ # squeue | wc -l

43

 

In looking for the cause, I see a large amount of the following in the slurmctld.log file:

 

[2019-07-16T09:36:51.464] error: slurmdbd: agent queue is full (20140), discarding DBD_STEP_START:1442 request

[2019-07-16T09:40:27.515] error: slurmdbd: agent queue filling (20140), RESTART SLURMDBD NOW

[2019-07-16T09:40:27.515] error: slurmdbd: agent queue is full (20140), discarding DBD_JOB_COMPLETE:1424 request

[2019-07-16T09:40:27.515] error: slurmdbd: agent queue is full (20140), discarding DBD_STEP_COMPLETE:1441 request

[2019-07-16T09:42:40.766] error: slurmdbd: agent queue filling (20140), RESTART SLURMDBD NOW

[2019-07-16T09:42:40.766] error: slurmdbd: agent queue is full (20140), discarding DBD_STEP_START:1442 request

[2019-07-16T09:46:05.905] error: slurmdbd: agent queue filling (20140), RESTART SLURMDBD NOW

[2019-07-16T09:46:05.905] error: slurmdbd: agent queue is full (20140), discarding DBD_STEP_COMPLETE:1441 request

[2019-07-16T09:46:05.905] error: slurmdbd: agent queue is full (20140), discarding DBD_JOB_COMPLETE:1424 request

[2019-07-16T09:48:42.616] error: slurmdbd: agent queue filling (20140), RESTART SLURMDBD NOW

[2019-07-16T09:48:42.616] error: slurmdbd: agent queue is full (20140), discarding DBD_JOB_COMPLETE:1424 request

[2019-07-16T09:48:42.616] error: slurmdbd: agent queue is full (20140), discarding DBD_STEP_COMPLETE:1441 request

[2019-07-16T09:53:00.188] error: slurmdbd: agent queue filling (20140), RESTART SLURMDBD NOW

[2019-07-16T09:53:00.188] error: slurmdbd: agent queue is full (20140), discarding DBD_JOB_COMPLETE:1424 request

[2019-07-16T09:53:00.189] error: slurmdbd: agent queue is full (20140), discarding DBD_STEP_COMPLETE:1441 request

 

What may be the cause of this issue? And, is there any way now to correct the accounting records in the db?

 

Thanks,

Will

Will Dennis

unread,
Jul 16, 2019, 9:06:02 PM7/16/19
to slurm...@lists.schedmd.com

A few more things to note:

 

- (Should have mentioned this earlier) running Slurm 17.11.7 ( via https://launchpad.net/~jonathonf/+archive/ubuntu/slurm )

- Restarted slurmctld and slurmdbd, but still getting the slurmdbd errors as before in slurmctld.log

- Ran “mysqlcheck --databases slurm_acct_db --auto-repair”, output was:

slurm_acct_db.acct_coord_table                     OK

slurm_acct_db.acct_table                           OK

slurm_acct_db.clus_res_table                       OK

slurm_acct_db.cluster_table                        OK

slurm_acct_db.convert_version_table                OK

slurm_acct_db.federation_table                     OK

slurm_acct_db.macluster_assoc_table                OK

slurm_acct_db.macluster_assoc_usage_day_table      OK

slurm_acct_db.macluster_assoc_usage_hour_table     OK

slurm_acct_db.macluster_assoc_usage_month_table    OK

slurm_acct_db.macluster_event_table                OK

slurm_acct_db.macluster_job_table                  OK

slurm_acct_db.macluster_last_ran_table             OK

slurm_acct_db.macluster_resv_table                 OK

slurm_acct_db.macluster_step_table                 OK

slurm_acct_db.macluster_suspend_table              OK

slurm_acct_db.macluster_usage_day_table            OK

slurm_acct_db.macluster_usage_hour_table           OK

slurm_acct_db.macluster_usage_month_table          OK

slurm_acct_db.macluster_wckey_table                OK

slurm_acct_db.macluster_wckey_usage_day_table      OK

slurm_acct_db.macluster_wckey_usage_hour_table     OK

slurm_acct_db.macluster_wckey_usage_month_table    OK

slurm_acct_db.qos_table                            OK

slurm_acct_db.res_table                            OK

slurm_acct_db.table_defs_table                     OK

slurm_acct_db.tres_table                           OK

slurm_acct_db.txn_table                            OK

slurm_acct_db.user_table                           OK

- Nothing in /var/log/mysql/error.log for as far back as logs go

- Ran “sacctmgr show runaway”, there were a LOT of runaway jobs; chose “Y” to fix, then output of “sacctmgr show runaway” was nil. A few minutes later however, “sacctmgr show runaway” had entries again.

 

If someone knows what else I might try to isolate/resolve this issue, please kindly assist...

Chris Samuel

unread,
Jul 17, 2019, 12:27:08 AM7/17/19
to slurm...@lists.schedmd.com
On 16/7/19 11:43 am, Will Dennis wrote:

> [2019-07-16T09:36:51.464] error: slurmdbd: agent queue is full (20140),
> discarding DBD_STEP_START:1442 request

So it looks like your slurmdbd cannot keep up with the rate of these
incoming steps and is having to throw away messages.

> [2019-07-16T09:40:27.515] error: slurmdbd: agent queue filling (20140),
> RESTART SLURMDBD NOW

Have you tried doing what it told you to?

You may want to look at the performance of you MySQL server to see if
it's failing to keep up with what slurmdbd is asking it to do.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Brian W. Johanson

unread,
Jul 17, 2019, 10:45:02 AM7/17/19
to slurm...@lists.schedmd.com
Once you have the database performance issues addressed, sacctmgr can clean up
the entries for completed jobs listed as running.
'sacctmgr list/show runawayjobs'

RunawayJobs
              Used only with the list or show command to report current jobs
that have been orphanded on the local cluster and are now runaway.  If there are
jobs in this state it will also give you an option to "fix" them.


Will Dennis

unread,
Jul 17, 2019, 12:56:46 PM7/17/19
to Slurm User Community List
Not thinking that the server (which runs both the Slurm controller daemon and the DB) is the issue... It's a Dell PowerEdge R430 platform, with dual Intel Xeon E5-2640v3 CPUs and 256GB memory, and RAID-1 array of 1TB SATA disks.

top - 09:29:26 up 101 days, 14:57, 3 users, load average: 0.06, 0.02, 0.00
Tasks: 421 total, 1 running, 241 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.1 sy, 0.0 ni, 99.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 26392008+total, 25892228+free, 784884 used, 4212904 buff/cache
KiB Swap: 999420 total, 999420 free, 0 used. 26091148+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
331 mysql 20 0 2096608 351552 19752 S 0.3 0.1 63:15.09 mysqld
28476 slurm 20 0 3800496 20940 5320 S 0.7 0.0 10:58.34 slurmctld
[...]


mytop output:
MySQL on localhost (5.7.26) load 0.00 0.02 0.00 1/523 20112 up 77+03:04:52 [09:51:44]
Queries: 3.0M qps: 0 Slow: 0.0 Se/In/Up/De(%): 69/02/01/00
Sorts: 0 qps now: 2 Slow qps: 0.0 Threads: 3 ( 1/ 5) 25/00/00/00
Key Efficiency: 80.4% Bps in/out: 97.0/481.3 Now in/out: 118.6/ 2.9k

Id User Host/IP DB Time Cmd State Query
-- ---- ------- -- ---- --- ----- ----------
2790 slurm localhost slurm_acct 393 Sleep
2792 slurm localhost slurm_acct 1 Sleep
2786 root localhost 0 Query starting show full processlist


SHOW ENGINE INNODB STATUS output:
=====================================
2019-07-17 09:34:17 0x7f4018226700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 14 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 11655 srv_active, 0 srv_shutdown, 6650270 srv_idle
srv_master_thread log flush and writes: 6659931
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 24514
OS WAIT ARRAY INFO: signal count 26260
RW-shared spins 0, rounds 51554, OS waits 23964
RW-excl spins 0, rounds 41462, OS waits 260
RW-sx spins 1279, rounds 18628, OS waits 154
Spin rounds per wait: 51554.00 RW-shared, 41462.00 RW-excl, 14.56 RW-sx
------------
TRANSACTIONS
------------
Trx id counter 524318
Purge done for trx's n:o < 524193 undo n:o < 0 state: running but idle
History list length 5
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 421388533043752, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421388533041912, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421388533042832, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421388533040992, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (write thread)
I/O thread 7 state: waiting for completed aio requests (write thread)
I/O thread 8 state: waiting for completed aio requests (write thread)
I/O thread 9 state: waiting for completed aio requests (write thread)
Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 0, 0] ,
ibuf aio reads:, log i/o's:, sync i/o's:
Pending flushes (fsync) log: 0; buffer pool: 0
1135 OS file reads, 416077 OS file writes, 105781 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 0, seg size 2, 31 merges
merged operations:
insert 119, delete mark 0, delete 0
discarded operations:
insert 0, delete mark 0, delete 0
Hash table size 34673, node heap has 4 buffer(s)
Hash table size 34673, node heap has 2 buffer(s)
Hash table size 34673, node heap has 1 buffer(s)
Hash table size 34673, node heap has 1 buffer(s)
Hash table size 34673, node heap has 14 buffer(s)
Hash table size 34673, node heap has 1 buffer(s)
Hash table size 34673, node heap has 1 buffer(s)
Hash table size 34673, node heap has 4 buffer(s)
0.07 hash searches/s, 0.71 non-hash searches/s
---
LOG
---
Log sequence number 126853542
Log flushed up to 126853542
Pages flushed up to 126853542
Last checkpoint at 126853533
0 pending log flushes, 0 pending chkp writes
62541 log i/o's done, 0.00 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 137428992
Dictionary memory allocated 330655
Buffer pool size 8191
Free buffers 3112
Database pages 5051
Old database pages 1844
Modified db pages 0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 29, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 1073, created 3978, written 338632
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 5051, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
0 read views open inside InnoDB
Process ID=331, Main thread ID=139913181783808, state: sleeping
Number of rows inserted 2092221, updated 153579, deleted 1501, read 2793036658
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.21 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT


I don't immediately see any performance problems with either the OS or MySQL...

Will Dennis

unread,
Jul 17, 2019, 3:57:47 PM7/17/19
to Slurm User Community List
OK, as it turns out, it was a problem like this bug: https://bugs.schedmd.com/show_bug.cgi?id=3819 ( cf https://bugs.schedmd.com/show_bug.cgi?id=2741 as well )

Back in May, I posted the following thread: https://lists.schedmd.com/pipermail/slurm-users/2019-May/003372.html - to which I never got any response, so I had to move on and let it go.

Little did I know that it would cause an issue...

Tim Wickberg 2017-05-17 15:00:45 MDT
"The issue is that the (apparently invalid) update message from the slurmctld to the slurmdbd will be continually replayed until successful. Until that happens, all job history is being queued by slurmctld - this can lead to increased memory consumption on there until this gets resolved.

If not fixed soon, this can eventually lead to loss of some accounting info. I'm bumping the priority up a notch, and will get you some additional info shortly.
"

So we had to insert fake reservations (yes, there was more than one) into the <clustername>_resv_table in the Slurm accounting DB, and then the slurmctld could update them, and then that solved the issue.

#funnotfun

- Will
Reply all
Reply to author
Forward
0 new messages