Issue with 3 node cluster replication stuck

448 views

Skip to first unread message

Julien Semaan

unread,

May 26, 2017, 11:19:29 AM5/26/17

to codership

Hi,

I am currently running MariaDB 10.1 along with galera 25.3.19 and I've had a network problem between the 3 boxes that resulted in bad checksums and problematic packets being received by galera cluster.

Example of error message:

2017-05-22 1:54:36 140620215613184 [Warning] WSREP: checksum failed, hdr: len=0 has_crc32=0 has_crc32c=0 crc32=1144144452

2017-05-22 1:54:42 140620215613184 [Warning] WSREP: unserialize error invalid protocol version 1: 71 (Protocol error)

at gcomm/src/gcomm/datagram.hpp:unserialize():133

2017-05-22 1:54:48 140620215613184 [Warning] WSREP: unserialize error invalid protocol version 4: 71 (Protocol error)

at gcomm/src/gcomm/datagram.hpp:unserialize():133

2017-05-22 1:54:48 140620215613184 [Warning] WSREP: unserialize error invalid protocol version 6: 71 (Protocol error)

at gcomm/src/gcomm/datagram.hpp:unserialize():133

After the network issue was over, I was expecting galera to be able to recover gracefully but instead 2 of the 3 instances were refusing connections since the maximum amount of MariaDB connections was reached.

When looking in the active queries, I was seeing queries that were never finishing, thus never freeing the connections.

After looking at the wsrep status variables I saw that the wsrep_local_send_queue had reached the maximum and was never going down on the 2 problematic instances.

Looking at the logs, I found absolutely nothing being logged after the network issue so I'm unsure where the issue lies.

After restarting the 2 problematic instances, everything re-synced properly and went back into a healthy state (I'm unsure if there was data loss though)

I've attached the logs of the 3 instances starting from the moment there were complaining about the network issue, note that instance 2 and 3 were the ones with the maxed out send queues and maxed out connections.

I've also attached the output of the wsrep variables of the 3 servers

I'd like to know if there is any setting I could tweak to prevent this from happening again or if I've potentially hit a bug in Galera

Any help will be greatly appreciated !

- Julien

mdb-1.log

mdb-2.log

mdb-3.log

mdb-statuses.log

Ralph

unread,

Jul 1, 2018, 10:51:27 AM7/1/18

to codership

This issue is caused by bad network condition, so keep perfect connection is very important.

Do you have database backup?

You can retrieve all data in 3 nodes and restart the cluster.

在 2017年5月26日星期五 UTC+8下午11:19:29，Julien Semaan写道：

Reply all

Reply to author

Forward

0 new messages