Issue with 3 node cluster replication stuck

448 views
Skip to first unread message

Julien Semaan

unread,
May 26, 2017, 11:19:29 AM5/26/17
to codership
Hi,

I am currently running MariaDB 10.1 along with galera 25.3.19 and I've had a network problem between the 3 boxes that resulted in bad checksums and problematic packets being received by galera cluster.
Example of error message:
2017-05-22  1:54:36 140620215613184 [Warning] WSREP: checksum failed, hdr: len=0 has_crc32=0 has_crc32c=0 crc32=1144144452
2017-05-22  1:54:42 140620215613184 [Warning] WSREP: unserialize error invalid protocol version 1: 71 (Protocol error)
   at gcomm/src/gcomm/datagram.hpp:unserialize():133
2017-05-22  1:54:48 140620215613184 [Warning] WSREP: unserialize error invalid protocol version 4: 71 (Protocol error)
   at gcomm/src/gcomm/datagram.hpp:unserialize():133
2017-05-22  1:54:48 140620215613184 [Warning] WSREP: unserialize error invalid protocol version 6: 71 (Protocol error)
   at gcomm/src/gcomm/datagram.hpp:unserialize():133

After the network issue was over, I was expecting galera to be able to recover gracefully but instead 2 of the 3 instances were refusing connections since the maximum amount of MariaDB connections was reached.
When looking in the active queries, I was seeing queries that were never finishing, thus never freeing the connections.

After looking at the wsrep status variables I saw that the wsrep_local_send_queue had reached the maximum and was never going down on the 2 problematic instances.

Looking at the logs, I found absolutely nothing being logged after the network issue so I'm unsure where the issue lies.

After restarting the 2 problematic instances, everything re-synced properly and went back into a healthy state (I'm unsure if there was data loss though)

I've attached the logs of the 3 instances starting from the moment there were complaining about the network issue, note that instance 2 and 3 were the ones with the maxed out send queues and maxed out connections.
I've also attached the output of the wsrep variables of the 3 servers

I'd like to know if there is any setting I could tweak to prevent this from happening again or if I've potentially hit a bug in Galera

Any help will be greatly appreciated !

- Julien
mdb-1.log
mdb-2.log
mdb-3.log
mdb-statuses.log

Ralph

unread,
Jul 1, 2018, 10:51:27 AM7/1/18
to codership
This issue is caused by bad network condition, so keep perfect connection is very important.
Do you have database backup?
You can retrieve all data in 3 nodes and restart the cluster.


在 2017年5月26日星期五 UTC+8下午11:19:29,Julien Semaan写道:
Reply all
Reply to author
Forward
0 new messages