Cluster goes non-prim spontaneously

75 views
Skip to first unread message

Sal Gonzalez

unread,
Sep 19, 2017, 5:30:04 PM9/19/17
to Percona Discussion
Hello all,

    I have a PXC cluster set up with 5 nodes, across 2 DCs.  The WAN connection has a latency in the 5-15ms range pretty much all the time and a pretty decent 50M bandwith. I am having an issue where my cluster is going non-prim with all of the nodes seemingly unable to see each other, only to re-merge after a couple of minutes and then carry on fine.  I have mostly ruled out a network issue by having a constant ping going between each node and noting no dropping packets or huge latency spikes during the 'outages'.  I don't know where to start on debugging this and finding a root cause.  The error logs on the servers all show the same progression of messages during the outage:

2017-09-19T13:05:30.171752-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') connection to peer c7e18d78 with addr tcp://172.18.241.143:4567 timed out, no messages seen in PT3S
2017-09-19T13:05:30.171941-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.18.241.143:4567 
2017-09-19T13:05:31.671930-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') reconnecting to c7e18d78 (tcp://172.18.241.143:4567), attempt 0
2017-09-19T13:05:31.740160-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') connection established to c7e18d78 tcp://172.18.241.143:4567
2017-09-19T13:05:35.172199-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') connection to peer c7e18d78 with addr tcp://172.18.241.143:4567 timed out, no messages seen in PT3S
2017-09-19T13:05:36.672369-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') reconnecting to c7e18d78 (tcp://172.18.241.143:4567), attempt 0
2017-09-19T13:05:36.764887-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') connection established to c7e18d78 tcp://172.18.241.143:4567
2017-09-19T13:05:40.058658-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') connection to peer c7e18d78 with addr tcp://172.18.241.143:4567 timed out, no messages seen in PT3S
2017-09-19T13:05:41.197548-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') reconnecting to c7e18d78 (tcp://172.18.241.143:4567), attempt 0
2017-09-19T13:05:41.268988-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') connection established to c7e18d78 tcp://172.18.241.143:4567
2017-09-19T13:05:41.606075-05:00 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,016bbb53,11220)
memb {
016bbb53,0
}
joined {
}
left {
}
partitioned {
213178d6,0
7163045f,0
853c2fb9,1
c7e18d78,1
}
)

Any help on this would be much appreciated.

Lorraine Pocklington

unread,
Sep 20, 2017, 9:24:12 AM9/20/17
to Percona Discussion
Hello
I found this on the Percona forum. At the bottom the poster mentions the solution they found


Also, there's a reference to jumbo frames in this presentation here which might be helpful (seems to reference the same thing)? 


I hope these links help and I haven't sent you on a goose chase. Let me know maybe?


Sal Gonzalez

unread,
Sep 20, 2017, 10:32:58 AM9/20/17
to Percona Discussion
Thank you for this, there is some new information in that slideshow that I hadn't seen before.  Curiously the percona forum contradicts the slideshow on the MTU setting.  I am going to talk it through that piece with our network guy.

I ran the dropwatch tool and I am seeing quite a lot of drops being reported, on the order of 75/second.  I am unsure if this is a little a lot or somewhere in between but I imagine every drop means somebody is having to restransmit creating a packetstorm...



Thank you again for the leads!

Lorraine Pocklington

unread,
Sep 20, 2017, 10:43:47 AM9/20/17
to Percona Discussion
You're welcome :)
Reply all
Reply to author
Forward
0 new messages