galera db node crashes when trying to reconnect to garbd abritator node

393 views
Skip to first unread message

Tom Maguire

unread,
Mar 5, 2013, 5:58:08 AM3/5/13
to codersh...@googlegroups.com
Hi all,

I was hoping for some help regarding an issue I experienced recently. I apologise if the title of the threat does not totally reflect what is going on in the issue as I'm am having trouble deciphering what happened from the log.

There seem to be two main issues:

  1. One of our DB nodes decided it needed to reconnect with an abitrator (see section a) of log below).
  2. "[ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():712: Will never receive state. Need to abort."  (section d) of log)

Our setup is a 5 node cluster - 3 x database nodes (galera 3.2.2(r137), mysql 5.5.23-23.6), 2 x abritrators (galera galera 3.2.2(r137)). 

I was wondering what would cause the node to attempt to reconnect to the abritrator in the first place, and then why it was not possible to do so without taking itself offline.

I apologies about how vague the questions are, but if anyone could give me some explanation to what is happening in the below log, it would be much appreciated.

Many thanks
Tom

a) db node decides it needs to reconnect to abritrator node (10.177.29.92)

Mar  1 19:14:15 pwpdb2n1 mysqld: 130301 19:14:15 [Note] WSREP: (7d0d9f8c-79d8-11e2-0800-0e368da98bb3, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.176.161.169:4567 tcp://10.176.193.179:4567 tcp://10.177.18.3:4567 tcp://10.177.29.92:4567 
Mar  1 19:14:16 pwpdb2n1 mysqld: 130301 19:14:16 [Note] WSREP: (7d0d9f8c-79d8-11e2-0800-0e368da98bb3, 'tcp://0.0.0.0:4567') reconnecting to f8b59949-7ad2-11e2-0800-e69380c792c8 (tcp://10.176.161.169:4567), attempt 0
Mar  1 19:14:16 pwpdb2n1 mysqld: 130301 19:14:16 [Note] WSREP: (7d0d9f8c-79d8-11e2-0800-0e368da98bb3, 'tcp://0.0.0.0:4567') reconnecting to f8c14d61-7ad2-11e2-0800-1e76ffed10e5 (tcp://10.176.193.179:4567), attempt 0
Mar  1 19:14:16 pwpdb2n1 mysqld: 130301 19:14:16 [Note] WSREP: (7d0d9f8c-79d8-11e2-0800-0e368da98bb3, 'tcp://0.0.0.0:4567') reconnecting to d5e54ccf-79d7-11e2-0800-5ed8cd887f42 (tcp://10.177.18.3:4567), attempt 0
Mar  1 19:14:16 pwpdb2n1 mysqld: 130301 19:14:16 [Note] WSREP: (7d0d9f8c-79d8-11e2-0800-0e368da98bb3, 'tcp://0.0.0.0:4567') reconnecting to d8d85a8e-79d7-11e2-0800-a128aa8a2bf8 (tcp://10.177.29.92:4567), attempt 0
Mar  1 19:14:17 pwpdb2n1 mysqld: 130301 19:14:17 [Note] WSREP: evs::proto(7d0d9f8c-79d8-11e2-0800-0e368da98bb3, OPERATIONAL, view_id(REG,7d0d9f8c-79d8-11e2-0800-0e368da98bb3,24)) suspecting node: d8d85a8e-79d7-11e2-0800-a128aa8a2bf8
Mar  1 19:14:18 pwpdb2n1 mysqld: 130301 19:14:18 [Note] WSREP: evs::proto(7d0d9f8c-79d8-11e2-0800-0e368da98bb3, GATHER, view_id(REG,7d0d9f8c-79d8-11e2-0800-0e368da98bb3,24)) suspecting node: d8d85a8e-79d7-11e2-0800-a128aa8a2bf8
Mar  1 19:14:18 pwpdb2n1 mysqld: 130301 19:14:18 [Note] WSREP: evs::proto(7d0d9f8c-79d8-11e2-0800-0e368da98bb3, GATHER, view_id(REG,7d0d9f8c-79d8-11e2-0800-0e368da98bb3,24)) suspecting node: d8d85a8e-79d7-11e2-0800-a128aa8a2bf8
Mar  1 19:14:19 pwpdb2n1 mysqld: 130301 19:14:19 [Note] WSREP: evs::proto(7d0d9f8c-79d8-11e2-0800-0e368da98bb3, GATHER, view_id(REG,7d0d9f8c-79d8-11e2-0800-0e368da98bb3,24)) suspecting node: d8d85a8e-79d7-11e2-0800-a128aa8a2bf8
Mar  1 19:14:19 pwpdb2n1 mysqld: 130301 19:14:19 [Note] WSREP: evs::proto(7d0d9f8c-79d8-11e2-0800-0e368da98bb3, GATHER, view_id(REG,7d0d9f8c-79d8-11e2-0800-0e368da98bb3,24)) suspecting node: d8d85a8e-79d7-11e2-0800-a128aa8a2bf8
Mar  1 19:14:20 pwpdb2n1 mysqld: 130301 19:14:20 [Note] WSREP: evs::proto(7d0d9f8c-79d8-11e2-0800-0e368da98bb3, GATHER, view_id(REG,7d0d9f8c-79d8-11e2-0800-0e368da98bb3,24)) suspecting node: d8d85a8e-79d7-11e2-0800-a128aa8a2bf8
Mar  1 19:14:20 pwpdb2n1 mysqld: 130301 19:14:20 [Note] WSREP: evs::proto(7d0d9f8c-79d8-11e2-0800-0e368da98bb3, GATHER, view_id(REG,7d0d9f8c-79d8-11e2-0800-0e368da98bb3,24)) suspecting node: d5e54ccf-79d7-11e2-0800-5ed8cd887f42
Mar  1 19:14:20 pwpdb2n1 mysqld: 130301 19:14:20 [Note] WSREP: evs::proto(7d0d9f8c-79d8-11e2-0800-0e368da98bb3, GATHER, view_id(REG,7d0d9f8c-79d8-11e2-0800-0e368da98bb3,24)) suspecting node: d8d85a8e-79d7-11e2-0800-a128aa8a2bf8
Mar  1 19:14:20 pwpdb2n1 mysqld: 130301 19:14:20 [Note] WSREP: evs::proto(7d0d9f8c-79d8-11e2-0800-0e368da98bb3, GATHER, view_id(REG,7d0d9f8c-79d8-11e2-0800-0e368da98bb3,24)) suspecting node: f8b59949-7ad2-11e2-0800-e69380c792c8
Mar  1 19:14:20 pwpdb2n1 mysqld: 130301 19:14:20 [Note] WSREP: evs::proto(7d0d9f8c-79d8-11e2-0800-0e368da98bb3, GATHER, view_id(REG,7d0d9f8c-79d8-11e2-0800-0e368da98bb3,24)) suspecting node: f8c14d61-7ad2-11e2-0800-1e76ffed10e5
Mar  1 19:14:21 pwpdb2n1 mysqld: 130301 19:14:21 [Note] WSREP: view(view_id(NON_PRIM,7d0d9f8c-79d8-11e2-0800-0e368da98bb3,24) memb {
Mar  1 19:14:21 pwpdb2n1 mysqld: #0117d0d9f8c-79d8-11e2-0800-0e368da98bb3,
Mar  1 19:14:21 pwpdb2n1 mysqld: } joined {
Mar  1 19:14:21 pwpdb2n1 mysqld: } left {
Mar  1 19:14:21 pwpdb2n1 mysqld: } partitioned {
Mar  1 19:14:21 pwpdb2n1 mysqld: #011d5e54ccf-79d7-11e2-0800-5ed8cd887f42,
Mar  1 19:14:21 pwpdb2n1 mysqld: #011d8d85a8e-79d7-11e2-0800-a128aa8a2bf8,
Mar  1 19:14:21 pwpdb2n1 mysqld: #011f8b59949-7ad2-11e2-0800-e69380c792c8,
Mar  1 19:14:21 pwpdb2n1 mysqld: #011f8c14d61-7ad2-11e2-0800-1e76ffed10e5,
Mar  1 19:14:21 pwpdb2n1 mysqld: })
Mar  1 19:14:21 pwpdb2n1 mysqld: 130301 19:14:21 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
Mar  1 19:14:21 pwpdb2n1 mysqld: 130301 19:14:21 [Note] WSREP: Flow-control interval: [16, 16]
Mar  1 19:14:21 pwpdb2n1 mysqld: 130301 19:14:21 [Note] WSREP: Received NON-PRIMARY.
Mar  1 19:14:21 pwpdb2n1 mysqld: 130301 19:14:21 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 1239887)
Mar  1 19:14:21 pwpdb2n1 mysqld: 130301 19:14:21 [Note] WSREP: New cluster view: global state: 9133fe08-59a7-11e2-0800-766eb736d615:1239887, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
Mar  1 19:14:21 pwpdb2n1 mysqld: 130301 19:14:21 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Mar  1 19:14:21 pwpdb2n1 mysqld: 130301 19:14:21 [Note] WSREP: view(view_id(NON_PRIM,7d0d9f8c-79d8-11e2-0800-0e368da98bb3,25) memb {
Mar  1 19:14:21 pwpdb2n1 mysqld: #0117d0d9f8c-79d8-11e2-0800-0e368da98bb3,
Mar  1 19:14:21 pwpdb2n1 mysqld: } joined {
Mar  1 19:14:21 pwpdb2n1 mysqld: } left {
Mar  1 19:14:21 pwpdb2n1 mysqld: } partitioned {
Mar  1 19:14:21 pwpdb2n1 mysqld: #011d5e54ccf-79d7-11e2-0800-5ed8cd887f42,
Mar  1 19:14:21 pwpdb2n1 mysqld: #011d8d85a8e-79d7-11e2-0800-a128aa8a2bf8,
Mar  1 19:14:21 pwpdb2n1 mysqld: #011f8b59949-7ad2-11e2-0800-e69380c792c8,
Mar  1 19:14:21 pwpdb2n1 mysqld: #011f8c14d61-7ad2-11e2-0800-1e76ffed10e5,
Mar  1 19:14:21 pwpdb2n1 mysqld: })
Mar  1 19:14:21 pwpdb2n1 mysqld: 130301 19:14:21 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
Mar  1 19:14:21 pwpdb2n1 mysqld: 130301 19:14:21 [Note] WSREP: Flow-control interval: [16, 16]
Mar  1 19:14:21 pwpdb2n1 mysqld: 130301 19:14:21 [Note] WSREP: Received NON-PRIMARY.
Mar  1 19:14:21 pwpdb2n1 mysqld: 130301 19:14:21 [Note] WSREP: New cluster view: global state: 9133fe08-59a7-11e2-0800-766eb736d615:1239887, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
Mar  1 19:14:21 pwpdb2n1 mysqld: 130301 19:14:21 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

b) two minute gap before anything further happens

Mar  1 19:16:06 pwpdb2n1 mysqld: 130301 19:16:06 [Note] WSREP: (7d0d9f8c-79d8-11e2-0800-0e368da98bb3, 'tcp://0.0.0.0:4567') reconnecting to f8b59949-7ad2-11e2-0800-e69380c792c8 (tcp://10.176.161.169:4567), attempt 30
Mar  1 19:16:06 pwpdb2n1 mysqld: 130301 19:16:06 [Note] WSREP: (7d0d9f8c-79d8-11e2-0800-0e368da98bb3, 'tcp://0.0.0.0:4567') reconnecting to d5e54ccf-79d7-11e2-0800-5ed8cd887f42 (tcp://10.177.18.3:4567), attempt 30
Mar  1 19:16:06 pwpdb2n1 mysqld: 130301 19:16:06 [Note] WSREP: (7d0d9f8c-79d8-11e2-0800-0e368da98bb3, 'tcp://0.0.0.0:4567') reconnecting to d8d85a8e-79d7-11e2-0800-a128aa8a2bf8 (tcp://10.177.29.92:4567), attempt 30
Mar  1 19:16:06 pwpdb2n1 mysqld: 130301 19:16:06 [Note] WSREP: (7d0d9f8c-79d8-11e2-0800-0e368da98bb3, 'tcp://0.0.0.0:4567') reconnecting to f8c14d61-7ad2-11e2-0800-1e76ffed10e5 (tcp://10.176.193.179:4567), attempt 30

c) another minute gap

Mar  1 19:17:01 pwpdb2n1 CRON[29393]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Mar  1 19:17:10 pwpdb2n1 mysqld: 130301 19:17:10 [Note] WSREP: (7d0d9f8c-79d8-11e2-0800-0e368da98bb3, 'tcp://0.0.0.0:4567') cleaning up duplicate 0x7fd5a000d6a0 after established 0x7fd5a0027fb0
Mar  1 19:17:10 pwpdb2n1 mysqld: 130301 19:17:10 [Note] WSREP: (7d0d9f8c-79d8-11e2-0800-0e368da98bb3, 'tcp://0.0.0.0:4567') cleaning up established 0x7fd5a000ec10 which is duplicate of 0x7fd5a0027df0
Mar  1 19:17:10 pwpdb2n1 mysqld: 130301 19:17:10 [Note] WSREP: (7d0d9f8c-79d8-11e2-0800-0e368da98bb3, 'tcp://0.0.0.0:4567') cleaning up duplicate 0x7fd5a000d7f0 after established 0x7fd5a00040a0
Mar  1 19:17:10 pwpdb2n1 mysqld: 130301 19:17:10 [Note] WSREP: (7d0d9f8c-79d8-11e2-0800-0e368da98bb3, 'tcp://0.0.0.0:4567') turning message relay requesting off
Mar  1 19:17:10 pwpdb2n1 mysqld: 130301 19:17:10 [Note] WSREP: declaring d5e54ccf-79d7-11e2-0800-5ed8cd887f42 stable
Mar  1 19:17:10 pwpdb2n1 mysqld: 130301 19:17:10 [Note] WSREP: declaring d8d85a8e-79d7-11e2-0800-a128aa8a2bf8 stable
Mar  1 19:17:10 pwpdb2n1 mysqld: 130301 19:17:10 [Note] WSREP: declaring f8b59949-7ad2-11e2-0800-e69380c792c8 stable
Mar  1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:10 [Note] WSREP: declaring f8c14d61-7ad2-11e2-0800-1e76ffed10e5 stable
Mar  1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: view(view_id(PRIM,7d0d9f8c-79d8-11e2-0800-0e368da98bb3,27) memb {
Mar  1 19:17:11 pwpdb2n1 mysqld: #0117d0d9f8c-79d8-11e2-0800-0e368da98bb3,
Mar  1 19:17:11 pwpdb2n1 mysqld: #011d5e54ccf-79d7-11e2-0800-5ed8cd887f42,
Mar  1 19:17:11 pwpdb2n1 mysqld: #011d8d85a8e-79d7-11e2-0800-a128aa8a2bf8,
Mar  1 19:17:11 pwpdb2n1 mysqld: #011f8b59949-7ad2-11e2-0800-e69380c792c8,
Mar  1 19:17:11 pwpdb2n1 mysqld: #011f8c14d61-7ad2-11e2-0800-1e76ffed10e5,
Mar  1 19:17:11 pwpdb2n1 mysqld: } joined {
Mar  1 19:17:11 pwpdb2n1 mysqld: } left {
Mar  1 19:17:11 pwpdb2n1 mysqld: } partitioned {
Mar  1 19:17:11 pwpdb2n1 mysqld: })
Mar  1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 5
Mar  1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 9de8f841-82a4-11e2-0800-cf9032d48067
Mar  1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: STATE EXCHANGE: sent state msg: 9de8f841-82a4-11e2-0800-cf9032d48067
Mar  1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: STATE EXCHANGE: got state msg: 9de8f841-82a4-11e2-0800-cf9032d48067 from 0 (pwpdb2n2)
Mar  1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: STATE EXCHANGE: got state msg: 9de8f841-82a4-11e2-0800-cf9032d48067 from 1 (garb)
Mar  1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: STATE EXCHANGE: got state msg: 9de8f841-82a4-11e2-0800-cf9032d48067 from 2 (garb)
Mar  1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: STATE EXCHANGE: got state msg: 9de8f841-82a4-11e2-0800-cf9032d48067 from 3 (pwpdb2n1)
Mar  1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: STATE EXCHANGE: got state msg: 9de8f841-82a4-11e2-0800-cf9032d48067 from 4 (pwpdb2n3)
Mar  1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: Quorum results:
Mar  1 19:17:11 pwpdb2n1 mysqld: #011version    = 2,
Mar  1 19:17:11 pwpdb2n1 mysqld: #011component  = PRIMARY,
Mar  1 19:17:11 pwpdb2n1 mysqld: #011conf_id    = 24,
Mar  1 19:17:11 pwpdb2n1 mysqld: #011members    = 4/5 (joined/total),
Mar  1 19:17:11 pwpdb2n1 mysqld: #011act_id     = 1239961,
Mar  1 19:17:11 pwpdb2n1 mysqld: #011last_appl. = 1239850,
Mar  1 19:17:11 pwpdb2n1 mysqld: #011protocols  = 0/4/2 (gcs/repl/appl),
Mar  1 19:17:11 pwpdb2n1 mysqld: #011group UUID = 9133fe08-59a7-11e2-0800-766eb736d615
Mar  1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: Flow-control interval: [36, 36]
Mar  1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 1239961)
Mar  1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: State transfer required: 
Mar  1 19:17:11 pwpdb2n1 mysqld: #011Group state: 9133fe08-59a7-11e2-0800-766eb736d615:1239961
Mar  1 19:17:11 pwpdb2n1 mysqld: #011Local state: 9133fe08-59a7-11e2-0800-766eb736d615:1239887
Mar  1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: New cluster view: global state: 9133fe08-59a7-11e2-0800-766eb736d615:1239961, view# 25: Primary, number of nodes: 5, my index: 0, protocol version 2
Mar  1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Warning] WSREP: Gap in state sequence. Need state transfer.
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: You have configured 'rsync' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: Assign initial position for certification: 1239961, protocol version: 2
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: Prepared IST receiver, listening at: tcp://198.61.162.118:4568
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: Node 0 (pwpdb2n2) requested state transfer from '*any*'. Selected 3 (pwpdb2n1)(SYNCED) as donor.
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 1239962)
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: Requesting state transfer: success, donor: 3
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 134177276 bytes
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 0/1 locked buffers
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Warning] WSREP: 3 (pwpdb2n1): State transfer to 0 (pwpdb2n2) failed: -125 (Operation canceled)

d) Wsrep Error

Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():712: Will never receive state. Need to abort.
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: gcomm: terminating thread
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: gcomm: joining thread
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: gcomm: closing backend
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: view(view_id(NON_PRIM,7d0d9f8c-79d8-11e2-0800-0e368da98bb3,27) memb {
Mar  1 19:17:13 pwpdb2n1 mysqld: #0117d0d9f8c-79d8-11e2-0800-0e368da98bb3,
Mar  1 19:17:13 pwpdb2n1 mysqld: } joined {
Mar  1 19:17:13 pwpdb2n1 mysqld: } left {
Mar  1 19:17:13 pwpdb2n1 mysqld: } partitioned {
Mar  1 19:17:13 pwpdb2n1 mysqld: #011d5e54ccf-79d7-11e2-0800-5ed8cd887f42,
Mar  1 19:17:13 pwpdb2n1 mysqld: #011d8d85a8e-79d7-11e2-0800-a128aa8a2bf8,
Mar  1 19:17:13 pwpdb2n1 mysqld: #011f8b59949-7ad2-11e2-0800-e69380c792c8,
Mar  1 19:17:13 pwpdb2n1 mysqld: #011f8c14d61-7ad2-11e2-0800-1e76ffed10e5,
Mar  1 19:17:13 pwpdb2n1 mysqld: })
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: view((empty))
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: Receiving IST: 74 writesets, seqnos 1239887-1239961
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: gcomm: closed
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: /usr/sbin/mysqld: Terminated.
Mar  1 19:17:13 pwpdb2n1 mysqld_safe: Number of processes running now: 0
Mar  1 19:17:13 pwpdb2n1 mysqld_safe: mysqld restarted
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: Read nil XID from storage engines, skipping position init
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: wsrep_load(): Galera 23.2.2(r137) by Codership Oy <in...@codership.com> loaded succesfully.
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: Found saved state: 9133fe08-59a7-11e2-0800-766eb736d615:1239887
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: Reusing existing '/var/lib/mysql//galera.cache'.
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: Passing config to GCS: base_host = 198.61.162.118; base_port = 4567; cert.log_conflicts = no; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: Assign initial position for certification: 1239887, protocol version: -1
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: wsrep_sst_grab()
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: Start replication
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: Setting initial position to 9133fe08-59a7-11e2-0800-766eb736d615:1239887
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: protonet asio version 0
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: backend: asio
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: GMCast version 0
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: (9f89f474-82a4-11e2-0800-2db5db65fe8f, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: (9f89f474-82a4-11e2-0800-2db5db65fe8f, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: EVS version 0
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: PC version 0
Mar  1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '10.176.161.169:4567,10.176.193.179:4567,10.177.29.92:4567,10.177.18.3:4567'
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: declaring d5e54ccf-79d7-11e2-0800-5ed8cd887f42 stable
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: declaring d8d85a8e-79d7-11e2-0800-a128aa8a2bf8 stable
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: declaring f8b59949-7ad2-11e2-0800-e69380c792c8 stable
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: declaring f8c14d61-7ad2-11e2-0800-1e76ffed10e5 stable
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: view(view_id(PRIM,9f89f474-82a4-11e2-0800-2db5db65fe8f,29) memb {
Mar  1 19:17:14 pwpdb2n1 mysqld: #0119f89f474-82a4-11e2-0800-2db5db65fe8f,
Mar  1 19:17:14 pwpdb2n1 mysqld: #011d5e54ccf-79d7-11e2-0800-5ed8cd887f42,
Mar  1 19:17:14 pwpdb2n1 mysqld: #011d8d85a8e-79d7-11e2-0800-a128aa8a2bf8,
Mar  1 19:17:14 pwpdb2n1 mysqld: #011f8b59949-7ad2-11e2-0800-e69380c792c8,
Mar  1 19:17:14 pwpdb2n1 mysqld: #011f8c14d61-7ad2-11e2-0800-1e76ffed10e5,
Mar  1 19:17:14 pwpdb2n1 mysqld: } joined {
Mar  1 19:17:14 pwpdb2n1 mysqld: } left {
Mar  1 19:17:14 pwpdb2n1 mysqld: } partitioned {
Mar  1 19:17:14 pwpdb2n1 mysqld: #0117d0d9f8c-79d8-11e2-0800-0e368da98bb3,
Mar  1 19:17:14 pwpdb2n1 mysqld: })
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: gcomm: connected
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: Opened channel 'my_wsrep_cluster'
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: Waiting for SST to complete.
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 5
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 9fd7b569-82a4-11e2-0800-8328a32ef77e
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: STATE EXCHANGE: sent state msg: 9fd7b569-82a4-11e2-0800-8328a32ef77e
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: STATE EXCHANGE: got state msg: 9fd7b569-82a4-11e2-0800-8328a32ef77e from 2 (garb)
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: STATE EXCHANGE: got state msg: 9fd7b569-82a4-11e2-0800-8328a32ef77e from 3 (pwpdb2n1)
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: STATE EXCHANGE: got state msg: 9fd7b569-82a4-11e2-0800-8328a32ef77e from 0 (pwpdb2n2)
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: STATE EXCHANGE: got state msg: 9fd7b569-82a4-11e2-0800-8328a32ef77e from 1 (garb)
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: STATE EXCHANGE: got state msg: 9fd7b569-82a4-11e2-0800-8328a32ef77e from 4 (pwpdb2n3)
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: Quorum results:
Mar  1 19:17:14 pwpdb2n1 mysqld: #011version    = 2,
Mar  1 19:17:14 pwpdb2n1 mysqld: #011component  = PRIMARY,
Mar  1 19:17:14 pwpdb2n1 mysqld: #011conf_id    = 25,
Mar  1 19:17:14 pwpdb2n1 mysqld: #011members    = 4/5 (joined/total),
Mar  1 19:17:14 pwpdb2n1 mysqld: #011act_id     = 1239962,
Mar  1 19:17:14 pwpdb2n1 mysqld: #011last_appl. = -1,
Mar  1 19:17:14 pwpdb2n1 mysqld: #011protocols  = 0/4/2 (gcs/repl/appl),
Mar  1 19:17:14 pwpdb2n1 mysqld: #011group UUID = 9133fe08-59a7-11e2-0800-766eb736d615
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: Flow-control interval: [36, 36]
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 1239962)
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: State transfer required: 
Mar  1 19:17:14 pwpdb2n1 mysqld: #011Group state: 9133fe08-59a7-11e2-0800-766eb736d615:1239962
Mar  1 19:17:14 pwpdb2n1 mysqld: #011Local state: 9133fe08-59a7-11e2-0800-766eb736d615:1239887
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Note] WSREP: New cluster view: global state: 9133fe08-59a7-11e2-0800-766eb736d615:1239962, view# 26: Primary, number of nodes: 5, my index: 0, protocol version 2
Mar  1 19:17:14 pwpdb2n1 mysqld: 130301 19:17:14 [Warning] WSREP: Gap in state sequence. Need state transfer.
Mar  1 19:17:16 pwpdb2n1 mysqld: 130301 19:17:16 [Note] WSREP: Running: 'wsrep_sst_rsync 'joiner' '198.61.162.118' 'root:' '/var/lib/mysql/' '/etc/mysql/conf.d/wsrep.cnf' '29446' 2>sst.err'
Mar  1 19:17:16 pwpdb2n1 rsyncd[29550]: rsyncd version 3.0.9 starting, listening on port 4444
Mar  1 19:17:16 pwpdb2n1 mysqld: 130301 19:17:16 [Note] WSREP: Prepared SST request: rsync|198.61.162.118:4444/rsync_sst
Mar  1 19:17:16 pwpdb2n1 mysqld: 130301 19:17:16 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Mar  1 19:17:16 pwpdb2n1 mysqld: 130301 19:17:16 [Note] WSREP: Assign initial position for certification: 1239962, protocol version: 2
Mar  1 19:17:16 pwpdb2n1 mysqld: 130301 19:17:16 [Note] WSREP: Prepared IST receiver, listening at: tcp://198.61.162.118:4568
Mar  1 19:17:16 pwpdb2n1 mysqld: 130301 19:17:16 [Note] WSREP: Node 0 (pwpdb2n2) requested state transfer from '*any*'. Selected 3 (pwpdb2n1)(SYNCED) as donor.
Mar  1 19:17:16 pwpdb2n1 mysqld: 130301 19:17:16 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 1239967)
Mar  1 19:17:16 pwpdb2n1 mysqld: 130301 19:17:16 [Note] WSREP: Requesting state transfer: success, donor: 3
Mar  1 19:17:16 pwpdb2n1 rsyncd[29605]: connect from pwpdb2n1 (198.101.155.56)
Mar  1 19:17:17 pwpdb2n1 rsyncd[29605]: rsync to rsync_sst/ from pwpdb2n1 (198.101.155.56)
Mar  1 19:17:17 pwpdb2n1 rsyncd[29605]: receiving file list
Mar  1 19:17:25 pwpdb2n1 monit[14810]: 'mysqld' process is not running
Mar  1 19:17:27 pwpdb2n1 monit[14810]: 'mysqld' trying to restart
Mar  1 19:17:27 pwpdb2n1 monit[14810]: 'mysqld' start: /etc/init.d/mysql
Mar  1 19:17:28 pwpdb2n1 mysqld_safe: Starting mysqld daemon with databases from /var/lib/mysql
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: Read nil XID from storage engines, skipping position init
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: wsrep_load(): Galera 23.2.2(r137) by Codership Oy <in...@codership.com> loaded succesfully.
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: Reusing existing '/var/lib/mysql//galera.cache'.
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: Passing config to GCS: base_host = 198.61.162.118; base_port = 4567; cert.log_conflicts = no; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: wsrep_sst_grab()
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: Start replication
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: protonet asio version 0
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: backend: asio
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: GMCast version 0
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: (a85b721d-82a4-11e2-0800-0c9df1e4fd7d, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: (a85b721d-82a4-11e2-0800-0c9df1e4fd7d, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: EVS version 0
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: PC version 0
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '10.176.161.169:4567,10.176.193.179:4567,10.177.29.92:4567,10.177.18.3:4567'

e) Address already in use error - I think this is due to monit trying to restart the mysqld processs

Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [ERROR] WSREP: Address already in use
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [ERROR] WSREP: failed to open gcomm backend connection: 98: error while trying to listen 'tcp://0.0.0.0:4567?socket.non_blocking=1', asio error 'Address already in use': 98 (Address already in use)
Mar  1 19:17:28 pwpdb2n1 mysqld: #011 at gcomm/src/asio_tcp.cpp:listen():757
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():195: Failed to open backend connection: -98 (Address already in use)
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1290: Failed to open channel 'my_wsrep_cluster' at 'gcomm://10.176.161.169:4567,10.176.193.179:4567,10.177.29.92:4567,10.177.18.3:4567': -98 (Address already in use)
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [ERROR] WSREP: gcs connect failed: Address already in use
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [ERROR] WSREP: wsrep::connect() failed: 6
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [ERROR] Aborting
Mar  1 19:17:28 pwpdb2n1 mysqld: 
Mar  1 19:17:28 pwpdb2n1 mysqld: 130301 19:17:28 [Note] WSREP: Service disconnected.
Mar  1 19:17:29 pwpdb2n1 mysqld: 130301 19:17:29 [Note] WSREP: Some threads may fail to exit.
Mar  1 19:17:29 pwpdb2n1 mysqld: 130301 19:17:29 [Note] /usr/sbin/mysqld: Shutdown complete
Mar  1 19:17:29 pwpdb2n1 mysqld: 
Mar  1 19:17:29 pwpdb2n1 mysqld_safe: mysqld from pid file /var/run/mysqld/mysqld.pid ended
Mar  1 19:17:41 pwpdb2n1 /etc/init.d/mysql[30187]: 0 processes alive
Mar  1 19:17:41 pwpdb2n1 /etc/init.d/mysql[30187]: 

f) two minute gap whilst nodes re-sync

Mar  1 19:19:50 pwpdb2n1 rsyncd[29605]: sent 52774 bytes  received 1115909574 bytes  total size 1115581435
Mar  1 19:19:50 pwpdb2n1 rsyncd[30759]: connect from pwpdb2n1 (198.101.155.56)
Mar  1 19:19:50 pwpdb2n1 rsyncd[30759]: rsync to rsync_sst/ from pwpdb2n1 (198.101.155.56)
Mar  1 19:19:50 pwpdb2n1 rsyncd[30759]: receiving file list
Mar  1 19:19:50 pwpdb2n1 mysqld: 130301 19:19:50 [Note] WSREP: 3 (pwpdb2n1): State transfer to 0 (pwpdb2n2) complete.
Mar  1 19:19:50 pwpdb2n1 mysqld: 130301 19:19:50 [Note] WSREP: Member 3 (pwpdb2n1) synced with group.
Mar  1 19:19:50 pwpdb2n1 rsyncd[30759]: sent 54 bytes  received 180 bytes  total size 45
Mar  1 19:19:52 pwpdb2n1 rsyncd[29550]: rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(549) [Receiver=3.0.9]
Mar  1 19:19:52 pwpdb2n1 mysqld: 130301 19:19:52 [Note] WSREP: SST complete, seqno: 1239967
Mar  1 19:19:52 pwpdb2n1 mysqld: 130301 19:19:52 InnoDB: The InnoDB memory heap is disabled
Mar  1 19:19:52 pwpdb2n1 mysqld: 130301 19:19:52 InnoDB: Mutexes and rw_locks use GCC atomic builtins
Mar  1 19:19:52 pwpdb2n1 mysqld: 130301 19:19:52 InnoDB: Compressed tables use zlib 1.2.3.3
Mar  1 19:19:52 pwpdb2n1 mysqld: 130301 19:19:52 InnoDB: Using Linux native AIO
Mar  1 19:19:52 pwpdb2n1 mysqld: 130301 19:19:52 InnoDB: Initializing buffer pool, size = 128.0M
Mar  1 19:19:52 pwpdb2n1 mysqld: 130301 19:19:52 InnoDB: Completed initialization of buffer pool
Mar  1 19:19:52 pwpdb2n1 mysqld: 130301 19:19:52 InnoDB: highest supported file format is Barracuda.
Mar  1 19:19:52 pwpdb2n1 mysqld: InnoDB: Log scan progressed past the checkpoint lsn 6727790026
Mar  1 19:19:52 pwpdb2n1 mysqld: 130301 19:19:52  InnoDB: Database was not shut down normally!
Mar  1 19:19:52 pwpdb2n1 mysqld: InnoDB: Starting crash recovery.
Mar  1 19:19:52 pwpdb2n1 mysqld: InnoDB: Reading tablespace information from the .ibd files...
Mar  1 19:19:52 pwpdb2n1 mysqld: InnoDB: Restoring possible half-written data pages from the doublewrite
Mar  1 19:19:52 pwpdb2n1 mysqld: InnoDB: buffer...
Mar  1 19:19:52 pwpdb2n1 mysqld: InnoDB: Doing recovery: scanned up to log sequence number 6727795293
Mar  1 19:19:52 pwpdb2n1 mysqld: 130301 19:19:52  InnoDB: Starting an apply batch of log records to the database...
Mar  1 19:19:53 pwpdb2n1 mysqld: InnoDB: Progress in percents: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 
Mar  1 19:19:53 pwpdb2n1 mysqld: InnoDB: Apply batch completed
Mar  1 19:19:53 pwpdb2n1 mysqld: 130301 19:19:53  InnoDB: Waiting for the background threads to start
Mar  1 19:19:54 pwpdb2n1 mysqld: 130301 19:19:54 InnoDB: 1.1.8 started; log sequence number 6727795293
Mar  1 19:19:54 pwpdb2n1 mysqld: 130301 19:19:54 [Note] Event Scheduler: Loaded 0 events
Mar  1 19:19:54 pwpdb2n1 mysqld: 130301 19:19:54 [Note] WSREP: Signalling provider to continue.
Mar  1 19:19:54 pwpdb2n1 mysqld: 130301 19:19:54 [Note] WSREP: SST received: 9133fe08-59a7-11e2-0800-766eb736d615:1239967
Mar  1 19:19:54 pwpdb2n1 mysqld: 130301 19:19:54 [Note] WSREP: 0 (pwpdb2n2): State transfer from 3 (pwpdb2n1) complete.
Mar  1 19:19:54 pwpdb2n1 mysqld: 130301 19:19:54 [Note] WSREP: Shifting JOINER -> JOINED (TO: 1239982)
Mar  1 19:19:54 pwpdb2n1 mysqld: 130301 19:19:54 [Note] WSREP: Member 0 (pwpdb2n2) synced with group.
Mar  1 19:19:54 pwpdb2n1 mysqld: 130301 19:19:54 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1239982)
Mar  1 19:19:54 pwpdb2n1 mysqld: 130301 19:19:54 [Note] /usr/sbin/mysqld: ready for connections.
Mar  1 19:19:54 pwpdb2n1 mysqld: Version: '5.5.23'  socket: '/var/run/mysqld/mysqld.sock'  port: 1470  Source distribution, wsrep_23.6.r3755
Mar  1 19:19:54 pwpdb2n1 mysqld: 130301 19:19:54 [Note] WSREP: Synchronized with group, ready for connections
Mar  1 19:19:54 pwpdb2n1 mysqld: 130301 19:19:54 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Mar  1 19:19:59 pwpdb2n1 mysqld: 130301 19:19:59 [Warning] IP address '10.176.194.157' could not be resolved: Name or service not known
Mar  1 19:20:01 pwpdb2n1 mysqld: 130301 19:20:01 [Warning] IP address '10.176.163.146' could not be resolved: Name or service not known
Mar  1 19:20:01 pwpdb2n1 CRON[30787]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Mar  1 19:20:24 pwpdb2n1 monit[14810]: 'mysqld' process is running with pid 29446

Alex Yurchenko

unread,
Mar 5, 2013, 10:01:22 AM3/5/13
to codersh...@googlegroups.com
1) I'm not sure why you believe that

> 1. One of our DB nodes decided it needed to reconnect with an
> abitrator (see section a) of log below).

I think it tried to reconnect with whoever was in his last primary
component. It does not really matter if it was an arbitrator.

2) That node we're talking about lost connection to a primary component
for whatever reason. Look like genuine network failure to me. Upon
reconnect it found that it misses some transactions and needs to resync:

> Mar 1 19:17:11 pwpdb2n1 mysqld: #011Group state:
> 9133fe08-59a7-11e2-0800-766eb736d615:1239961
> Mar 1 19:17:11 pwpdb2n1 mysqld: #011Local state:
> 9133fe08-59a7-11e2-0800-766eb736d615:1239887

3) Without restart it can resync only with IST:

> Mar 1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: You
> have
> configured 'rsync' state snapshot transfer method which cannot be
> performed
> on a running server. Wsrep provider won't be able to fall back to it
> if
> other means of state transfer are unavailable. In that case you will
> need
> to restart the server.

4) was chosen as donor but for whatever reason it could not serve IST,
maybe because it also disconnected from primary component and
configuration change made it to invalidate its transaction cache. This
might be a usability bug.

> Mar 1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP:
> Requesting state transfer: success, donor: 3
> Mar 1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: GCache
> DEBUG: RingBuffer::seqno_reset(): discarded 134177276 bytes
> Mar 1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Note] WSREP: GCache
> DEBUG: RingBuffer::seqno_reset(): found 0/1 locked buffers
> Mar 1 19:17:13 pwpdb2n1 mysqld: 130301 19:17:13 [Warning] WSREP: 3
> (pwpdb2n1): State transfer to 0 (pwpdb2n2) failed: -125 (Operation
> canceled)

5) Note that you seem to have messed up node names:

> Mar 1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: New
> COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 5
> Mar 1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP:
> STATE_EXCHANGE: sent state UUID: 9de8f841-82a4-11e2-0800-cf9032d48067
> Mar 1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: STATE
> EXCHANGE: sent state msg: 9de8f841-82a4-11e2-0800-cf9032d48067
> Mar 1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: STATE
> EXCHANGE: got state msg: 9de8f841-82a4-11e2-0800-cf9032d48067 from 0
> (pwpdb2n2)
> Mar 1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: STATE
> EXCHANGE: got state msg: 9de8f841-82a4-11e2-0800-cf9032d48067 from 1
> (garb)
> Mar 1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: STATE
> EXCHANGE: got state msg: 9de8f841-82a4-11e2-0800-cf9032d48067 from 2
> (garb)
> Mar 1 19:17:11 pwpdb2n1 mysqld: 130301 19:17:11 [Note] WSREP: STATE
> EXCHANGE: got state msg: 9de8f841-82a4-11e2-0800-cf9032d48067 from 3
> (pwpdb2n1)

- apparently this node has wsrep_node_name=pwpbdb2n2 whereas its
hostname is pwpdb2n1.

6) And then the miracle of automatic cluster management:

> Mar 1 19:17:17 pwpdb2n1 rsyncd[29605]: rsync to rsync_sst/ from
> pwpdb2n1
> (198.101.155.56)
> Mar 1 19:17:17 pwpdb2n1 rsyncd[29605]: receiving file list
> Mar 1 19:17:25 pwpdb2n1 monit[14810]: 'mysqld' process is not
> running

- monit decides to help things and restart mysqld in the middle of it
receiving SST. Apparently because it could not connect to it.

And I guess the rest is history?

Regards,
Alex


On 2013-03-05 12:58, Tom Maguire wrote:
> Hi all,
>
> I was hoping for some help regarding an issue I experienced recently.
> I
> apologise if the title of the threat does not totally reflect what is
> going
> on in the issue as I'm am having trouble deciphering what happened
> from the
> log.
>
> There seem to be two main issues:
>
>
> 1. One of our DB nodes decided it needed to reconnect with an
> abitrator
> (see section a) of log below).
> 2. "[ERROR] WSREP:
> gcs/src/gcs_group.c:gcs_group_handle_join_msg():712:
> Will never receive state. Need to abort." (section d) of log)
>
>
> Our setup is a 5 node cluster - 3 x database nodes (galera
> 3.2.2(r137),
> mysql 5.5.23-23.6), 2 x abritrators (galera galera 3.2.2(r137)).
>
> I was wondering what would cause the node to attempt to reconnect to
> the
> abritrator in the first place, and then why it was not possible to do
> so
> without taking itself offline.
>
> I apologies about how vague the questions are, but if anyone could
> give me
> some explanation to what is happening in the below log, it would be
> much appreciated.
>
> Many thanks
> Tom
>
> *a) db node decides it needs to reconnect to abritrator node
> (10.177.29.92)*
> *b) two minute gap before anything further happens*
>
> Mar 1 19:16:06 pwpdb2n1 mysqld: 130301 19:16:06 [Note] WSREP:
> (7d0d9f8c-79d8-11e2-0800-0e368da98bb3, 'tcp://0.0.0.0:4567')
> reconnecting
> to f8b59949-7ad2-11e2-0800-e69380c792c8 (tcp://10.176.161.169:4567),
> attempt 30
> Mar 1 19:16:06 pwpdb2n1 mysqld: 130301 19:16:06 [Note] WSREP:
> (7d0d9f8c-79d8-11e2-0800-0e368da98bb3, 'tcp://0.0.0.0:4567')
> reconnecting
> to d5e54ccf-79d7-11e2-0800-5ed8cd887f42 (tcp://10.177.18.3:4567),
> attempt 30
> Mar 1 19:16:06 pwpdb2n1 mysqld: 130301 19:16:06 [Note] WSREP:
> (7d0d9f8c-79d8-11e2-0800-0e368da98bb3, 'tcp://0.0.0.0:4567')
> reconnecting
> to d8d85a8e-79d7-11e2-0800-a128aa8a2bf8 (tcp://10.177.29.92:4567),
> attempt
> 30
> Mar 1 19:16:06 pwpdb2n1 mysqld: 130301 19:16:06 [Note] WSREP:
> (7d0d9f8c-79d8-11e2-0800-0e368da98bb3, 'tcp://0.0.0.0:4567')
> reconnecting
> to f8c14d61-7ad2-11e2-0800-1e76ffed10e5 (tcp://10.176.193.179:4567),
> attempt 30
>
> *c) another minute gap*
> *d) Wsrep Error*
> *e) Address already in use error - I think this is due to monit
> trying to
> restart the mysqld processs*
> *f) two minute gap whilst nodes re-sync*
--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011
Reply all
Reply to author
Forward
0 new messages