Hi My names Ross,
First post here please be gentle :D
I'm currently in the process of setting up a Galera cluster over WAN throughout various data centres across the world mostly Europe though , However we seem to be having problems with the cluster breaking up into partitions. The behaviour seems quite strange as it dopes not happen at a particular time and is not specific to a certain number of nodes or group of nodes. In some cases after an indeterminable amount of time the nodes join the `main` cluster and leave the partitioned cluster although not always.
We are currently using the following versions across all the nodes : Server version: 5.6.21 MySQL Community Server (GPL), wsrep_25.9
We build our nodes via puppet so the my.cnf below is persistent on all nodes ( I have obfuscated our IP's and removed node names etc)
[mysql]
port = 54998
socket = /usr/local/mysql-galera/tmp/mysql-galera.sock
pid-file=/usr/local/mysql-galera/galera.pid
[mysqld]
federated
basedir=/usr/local/mysql-galera
port = 54998
socket = /usr/local/mysql-galera/tmp/mysql-galera.sock
datadir = /usr/local/mysql-galera/data
log-error=/var/log/galera/staging.log
tmpdir = /usr/local/mysql-galera/tmp
wsrep_cluster_name = $cluster_name
wsrep_node_name = $node_name
wsrep_node_address = 172.16.**.**
pid-file=/usr/local/mysql-galera/galera.pid
user=mysql
wsrep_provider = /usr/lib64/galera-3/libgalera_smm.so
wsrep_notify_cmd = /usr/local/bin/galeranotify.py
wsrep_sst_method = xtrabackup-v2
wsrep_sst_auth = wsrep:citnow2015
wsrep_provider_options = "evs.keepalive_period = PT3S"
wsrep_provider_options = "evs.suspect_timeout = PT30S"
wsrep_provider_options = "evs.inactive_timeout = PT1M"
wsrep_provider_options = "evs.install_timeout = PT1M"
wsrep_provider_options = "evs.inactive_check_period = PT3S"
wsrep_provider_options = "evs.join_retrans_period = PT1S"
wsrep_sst_receive_address = 172.16.**.**:4444
The only thing in the servers logs are to do with evs::proto , I can only assume this is eviction related but I cant find much on it in an attempt to find out I added the evs variables to out my.cnf as you can see above but the problem still persists
2015-10-28 21:04:20 25974 [Note] WSREP: (05d47528, 'tcp://
0.0.0.0:4567') address 'tcp://172.16.*.**:4567' pointing to uuid 05d47528 is blacklisted, skipping
2015-10-28 21:04:20 25974 [Note] WSREP: (05d47528, 'tcp://
0.0.0.0:4567') address 'tcp://172.16.*.**:4567' pointing to uuid 05d47528 is blacklisted, skipping
2015-10-28 21:04:21 25974 [Note] WSREP: (05d47528, 'tcp://
0.0.0.0:4567') reconnecting to c26a3519 (tcp://192.168.**.**:4567), attempt 0
2015-10-28 21:04:22 25974 [Note] WSREP: evs::proto(05d47528, OPERATIONAL, view_id(REG,05d47528,63)) suspecting node: c26a3519
2015-10-28 21:04:22 25974 [Note] WSREP: evs::proto(05d47528, OPERATIONAL, view_id(REG,05d47528,63)) suspected node without join message, declaring inactive
2015-10-28 21:04:23 25974 [Note] WSREP: declaring 2612d743 at tcp://192.168.**.**:4567 stable
2015-10-28 21:04:23 25974 [Note] WSREP: declaring 3b467ac0 at tcp://192.168.**.**:54567 stable
2015-10-28 21:04:23 25974 [Note] WSREP: declaring 4cfabdc3 at tcp://
10.44.1.***:4567 stable
2015-10-28 21:04:23 25974 [Note] WSREP: declaring 5d9a2da1 at tcp://192.168.**.**:4567 stable
2015-10-28 21:04:23 25974 [Note] WSREP: declaring 8aa69abb at tcp://
192.168.44.**:4567 stable
2015-10-28 21:04:23 25974 [Note] WSREP: declaring 910c09a4 at tcp://10.44.*.***:4567 stable
2015-10-28 21:04:23 25974 [Note] WSREP: declaring e4502e3e at tcp://192.168.**.**:4567 stable
2015-10-28 21:04:23 25974 [Note] WSREP: declaring ef10664a at tcp://192.168.**.**:4567 stable
2015-10-28 21:04:23 25974 [Note] WSREP: declaring fe8e5a2c at tcp://192.168.*.*:4567 stable
2015-10-28 21:04:23 25974 [Note] WSREP: Node 05d47528 state prim
2015-10-28 21:04:24 25974 [Note] WSREP: view(view_id(PRIM,05d47528,64) memb {
05d47528,0
2612d743,0
3b467ac0,0
4cfabdc3,0
5d9a2da1,0
8aa69abb,0
910c09a4,0
e4502e3e,0
ef10664a,0
fe8e5a2c,0
} joined {
} left {
} partitioned {
c26a3519,0
})
Can anyone provide any help ?
Many Thanks,
Ross McFadyen