Hi,
I am currently running MariaDB 10.1 along with galera 25.3.19 and I've had a network problem between the 3 boxes that resulted in bad checksums and problematic packets being received by galera cluster.
Example of error message:
2017-05-22 1:54:36 140620215613184 [Warning] WSREP: checksum failed, hdr: len=0 has_crc32=0 has_crc32c=0 crc32=1144144452
2017-05-22 1:54:42 140620215613184 [Warning] WSREP: unserialize error invalid protocol version 1: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2017-05-22 1:54:48 140620215613184 [Warning] WSREP: unserialize error invalid protocol version 4: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2017-05-22 1:54:48 140620215613184 [Warning] WSREP: unserialize error invalid protocol version 6: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
After the network issue was over, I was expecting galera to be able to recover gracefully but instead 2 of the 3 instances were refusing connections since the maximum amount of MariaDB connections was reached.
When looking in the active queries, I was seeing queries that were never finishing, thus never freeing the connections.
After looking at the wsrep status variables I saw that the wsrep_local_send_queue had reached the maximum and was never going down on the 2 problematic instances.
Looking at the logs, I found absolutely nothing being logged after the network issue so I'm unsure where the issue lies.
After restarting the 2 problematic instances, everything re-synced properly and went back into a healthy state (I'm unsure if there was data loss though)
I've attached the logs of the 3 instances starting from the moment there were complaining about the network issue, note that instance 2 and 3 were the ones with the maxed out send queues and maxed out connections.
I've also attached the output of the wsrep variables of the 3 servers
I'd like to know if there is any setting I could tweak to prevent this from happening again or if I've potentially hit a bug in Galera
Any help will be greatly appreciated !
- Julien