I have a PXC cluster set up with 5 nodes, across 2 DCs. The WAN connection has a latency in the 5-15ms range pretty much all the time and a pretty decent 50M bandwith. I am having an issue where my cluster is going non-prim with all of the nodes seemingly unable to see each other, only to re-merge after a couple of minutes and then carry on fine. I have mostly ruled out a network issue by having a constant ping going between each node and noting no dropping packets or huge latency spikes during the 'outages'. I don't know where to start on debugging this and finding a root cause. The error logs on the servers all show the same progression of messages during the outage:
2017-09-19T13:05:30.171752-05:00 0 [Note] WSREP: (016bbb53, 'tcp://
0.0.0.0:4567') connection to peer c7e18d78 with addr tcp://
172.18.241.143:4567 timed out, no messages seen in PT3S
2017-09-19T13:05:30.171941-05:00 0 [Note] WSREP: (016bbb53, 'tcp://
0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://
172.18.241.143:4567 2017-09-19T13:05:35.172199-05:00 0 [Note] WSREP: (016bbb53, 'tcp://
0.0.0.0:4567') connection to peer c7e18d78 with addr tcp://
172.18.241.143:4567 timed out, no messages seen in PT3S
2017-09-19T13:05:40.058658-05:00 0 [Note] WSREP: (016bbb53, 'tcp://
0.0.0.0:4567') connection to peer c7e18d78 with addr tcp://
172.18.241.143:4567 timed out, no messages seen in PT3S
2017-09-19T13:05:41.606075-05:00 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,016bbb53,11220)
memb {
016bbb53,0
}
joined {
}
left {
}
partitioned {
213178d6,0
7163045f,0
853c2fb9,1
c7e18d78,1
}
)
Any help on this would be much appreciated.