Cluster goes non-prim spontaneously

Sal Gonzalez

unread,

Sep 19, 2017, 5:30:04 PM9/19/17

to Percona Discussion

Hello all,

I have a PXC cluster set up with 5 nodes, across 2 DCs. The WAN connection has a latency in the 5-15ms range pretty much all the time and a pretty decent 50M bandwith. I am having an issue where my cluster is going non-prim with all of the nodes seemingly unable to see each other, only to re-merge after a couple of minutes and then carry on fine. I have mostly ruled out a network issue by having a constant ping going between each node and noting no dropping packets or huge latency spikes during the 'outages'. I don't know where to start on debugging this and finding a root cause. The error logs on the servers all show the same progression of messages during the outage:

2017-09-19T13:05:30.171752-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') connection to peer c7e18d78 with addr tcp://172.18.241.143:4567 timed out, no messages seen in PT3S

2017-09-19T13:05:30.171941-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.18.241.143:4567

2017-09-19T13:05:31.671930-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') reconnecting to c7e18d78 (tcp://172.18.241.143:4567), attempt 0

2017-09-19T13:05:31.740160-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') connection established to c7e18d78 tcp://172.18.241.143:4567

2017-09-19T13:05:35.172199-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') connection to peer c7e18d78 with addr tcp://172.18.241.143:4567 timed out, no messages seen in PT3S

2017-09-19T13:05:36.672369-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') reconnecting to c7e18d78 (tcp://172.18.241.143:4567), attempt 0

2017-09-19T13:05:36.764887-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') connection established to c7e18d78 tcp://172.18.241.143:4567

2017-09-19T13:05:40.058658-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') connection to peer c7e18d78 with addr tcp://172.18.241.143:4567 timed out, no messages seen in PT3S

2017-09-19T13:05:41.197548-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') reconnecting to c7e18d78 (tcp://172.18.241.143:4567), attempt 0

2017-09-19T13:05:41.268988-05:00 0 [Note] WSREP: (016bbb53, 'tcp://0.0.0.0:4567') connection established to c7e18d78 tcp://172.18.241.143:4567

2017-09-19T13:05:41.606075-05:00 0 [Note] WSREP: Current view of cluster as seen by this node

view (view_id(NON_PRIM,016bbb53,11220)

memb {

016bbb53,0

}

joined {

}

left {

}

partitioned {

213178d6,0

7163045f,0

853c2fb9,1

c7e18d78,1

}

)

Any help on this would be much appreciated.

Lorraine Pocklington

unread,

Sep 20, 2017, 9:24:12 AM9/20/17

to Percona Discussion

Hello

I found this on the Percona forum. At the bottom the poster mentions the solution they found

https://www.percona.com/forums/questions-discussions/percona-xtradb-cluster/49109-connection-to-peer-with-addr-tcp-4567-timed-out-no-messages-seen-in-pt3s

Also, there's a reference to jumbo frames in this presentation here which might be helpful (seems to reference the same thing)?

https://www.percona.com/sites/default/files/presentations/LinuxPiter-TuningForDB.potx_.pdf

I hope these links help and I haven't sent you on a goose chase. Let me know maybe?

Sal Gonzalez

unread,

Sep 20, 2017, 10:32:58 AM9/20/17

to Percona Discussion

Thank you for this, there is some new information in that slideshow that I hadn't seen before. Curiously the percona forum contradicts the slideshow on the MTU setting. I am going to talk it through that piece with our network guy.

I ran the dropwatch tool and I am seeing quite a lot of drops being reported, on the order of 75/second. I am unsure if this is a little a lot or somewhere in between but I imagine every drop means somebody is having to restransmit creating a packetstorm...

Thank you again for the leads!

Lorraine Pocklington

unread,

Sep 20, 2017, 10:43:47 AM9/20/17

to Percona Discussion

You're welcome :)

Reply all

Reply to author

Forward