Hello,
We are trying to understand how a 3-node MariaDB Galera cluster got into a bad state, where all nodes are non-primary and have marked their peers as partitioned. We know how to recover the cluster by bootstrapping, but would like to know how this came about.
In the logs, we see many messages like
150131 0:45:59 [Note] WSREP: (801845e4-a8e2-11e4-b6f1-ba02aca81e90, 'tcp://0.0.0.0:4567') address 'tcp://10.85.15.107:4567' pointing to uuid 801845e4-a8e2-11e4-b6f1-ba02aca81e90 is blacklisted, skipping
This is sometimes seen during a healthy state and sometimes is repeated for 100's of lines when we are in a bad state. Seems like nodes occasionally blacklist themselves, or other cluster members. There are other messages related to potential networking issues, though we are not aware of any persistent connectivity issues in our environment.
For example:
[ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():141
and
150130 22:45:30 [Note] WSREP: (a7489639-97b5-11e4-b171-020823f89106, 'tcp://0.0.0.0:4567') reconnecting to 9e803b0f-a8d1-11e4-a798-d2a8cb4a589c (tcp://10.85.15.108:4567), attempt 0
Can someone explain how "blacklisting" would happen? Overall, trying to learn whether this is a significant part of the failure, or something we can overlook for now.
Thanks,
Raina Masand
Cloud Foundry Services Team