Whole cluster stalled due to deadlock on a single node?

634 views
Skip to first unread message

Momin Rashid

unread,
Mar 19, 2015, 2:23:43 AM3/19/15
to codersh...@googlegroups.com
I tried posting this earlier, but I am not sure what happened to it. 

Our three node mariadb-galera cluster stalled today.  It happened during the peak hour, and the following was in the logs.

----------------------------
END OF INNODB MONITOR OUTPUT
============================
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long


The log entry starts with a deadlock, and then there are pages after pages of data in the log.
=====================================
2015-03-18 15:24:29 7f0dbebfc700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 3 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 88651 srv_active, 0 srv_shutdown, 148822 srv_idle
srv_master_thread log flush and writes: 237472
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 38384
OS WAIT ARRAY INFO: signal count 799918
Mutex spin waits 125676, rounds 62111, OS waits 1165
RW-shared spins 368006, rounds 1392595, OS waits 31472
RW-excl spins 27058, rounds 1371282, OS waits 5049
Spin rounds per wait: 0.49 mutex, 3.78 RW-shared, 50.68 RW-excl
------------------------
LATEST DETECTED DEADLOCK
------------------------
2015-03-18 13:00:05 7f0f73bfe700
*** (1) TRANSACTION:
TRANSACTION 1529701, ACTIVE 0 sec updating or deleting
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 2, OS thread handle 0x7f0f84d77700, query id 29026609 Delete_rows_log_event::ha_delete_row(737612)
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1674 page no 1870 n bits 208 index `PRIMARY` of table `mirth`.`d_mm11` trx table locks 2 total table locks 3  trx id 1529701 lock mode S locks rec but not gap waiting lock hold time 0 wait time before grant 0
*** (2) TRANSACTION:
TRANSACTION 1529702, ACTIVE 0 sec updating or deleting
mysql tables in use 1, locked 1

This stopped the whole cluster from doing anything, and the cluster did not recover from the above problem, which is a big cause for concern to us.  In the end we had to shutdown the offending node, immediately after which the cluster started serving requests.

Could someone explain to me what happened here?  and how to avoid/recover from such situations in the future?  We are using the latest available version of mariadb 10, and mariadb-galera server packages from the official mariadb yum repositories.

Thanks!
Message has been deleted

Momin Rashid

unread,
Mar 19, 2015, 2:09:51 PM3/19/15
to codersh...@googlegroups.com
Same thing happened this morning, and on the same node (perhaps the same node is just a coincidence?).  Any ideas?

Rodrigo Bernardo

unread,
Mar 2, 2017, 11:42:52 AM3/2/17
to codership
I am having the same problem with Percona Cluster 5.6.24-72.2
We are seeing this error 3-4 times a month. I have a core file I could provide if that's helpful....

Thanks,
Rodrigo.

Rodrigo Bernardo

unread,
Mar 2, 2017, 11:42:53 AM3/2/17
to codership
I am having the same problem with Percona Cluster 5.6.24-72.2-56-log
I have a core file I could provide if it's useful...
Is there a bug report already for this?

Thanks,
Rodrigo.

James Wang

unread,
Mar 3, 2017, 4:23:17 AM3/3/17
to codership
same here: PXC 5.6.32

I think the design needs to be back to the drawing board:  currently all nodes are tightly coupled to each other and any one node can bring down the cluster is not acceptable. 

Momin Rashid

unread,
Mar 7, 2017, 5:00:07 PM3/7/17
to James Wang, codership, rodri.b...@gmail.com
Are you using xa-datasource? For me, that was causing the issue, and once I moved to local-tx-datasource the problem went away.

--
You received this message because you are subscribed to a topic in the Google Groups "codership" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/codership-team/m_CxvyGLzUs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to codership-tea...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

James Wang

unread,
Mar 8, 2017, 4:35:36 AM3/8/17
to codership, jwang...@gmail.com, rodri.b...@gmail.com
I don't use xs-datasource:

+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| Com_xa_commit   | 0     |
| Com_xa_end      | 0     |
| Com_xa_prepare  | 0     |
| Com_xa_recover  | 0     |
| Com_xa_rollback | 0     |
| Com_xa_start    | 0     |
+-----------------+-------+

Rodrigo Bernardo

unread,
Mar 8, 2017, 8:03:11 AM3/8/17
to codership, jwang...@gmail.com, rodri.b...@gmail.com
I am not using xa datasources either... ( I believe )
Reply all
Reply to author
Forward
0 new messages