cannot restart cluster, depending on the order to restart nodes.

60 views
Skip to first unread message

Yos Tj

unread,
Nov 24, 2015, 8:15:34 PM11/24/15
to codership
Hi,

I cannot restart the cluster with the scenario below.
What's wrong with my procedure?

(1) Three-node cluster is running. Node#1, #2, #3, respectively.

(2) Kill node#1 on purpose by 'killall -9 mysqld_safe; killall -9 mysqld'.

(3) Kill node#2, too.

    In this scenario, assume that node#2 will not be available for some time.

(4) Restart node#1 by 'mysqld_safe &' to join the cluster again,
    but node#1 won't start completely.

Node#1 continues to output error log like:
151125  0:32:40 [Note] WSREP: (71e534c1, 'tcp://0.0.0.0:4567') address 'tcp://192.168.0.125:4567' pointing to uuid 71e534c1 is blacklisted, skipping
151125  0:32:40 [Note] WSREP: (71e534c1, 'tcp://0.0.0.0:4567') address 'tcp://192.168.0.125:4567' pointing to uuid 71e534c1 is blacklisted, skipping
151125  0:32:41 [Note] WSREP: (71e534c1, 'tcp://0.0.0.0:4567') address 'tcp://192.168.0.125:4567' pointing to uuid 71e534c1 is blacklisted, skipping

At the same time, node#3 continues to output error log like:
151125  0:31:37 [Note] WSREP: (a0214ca0, 'tcp://0.0.0.0:4567') reconnecting to dd1fa25c (tcp://192.168.0.124:4567), attempt 330
151125  0:32:08 [Note] WSREP: (a0214ca0, 'tcp://0.0.0.0:4567') reconnecting to dd1fa25c (tcp://192.168.0.124:4567), attempt 360
151125  0:32:38 [Note] WSREP: (a0214ca0, 'tcp://0.0.0.0:4567') reconnecting to dd1fa25c (tcp://192.168.0.124:4567), attempt 390

Their IP addresses are:
node#1 192.168.0.125
node#2 192.168.0.124
node#3 192.168.0.153

(5) As a try, restart node#2 by 'mysqld_safe &'.
    Then, node#1 and #3 proceeded, and finally become ready to connect.

Node#1's error log above was followed by:
151125  0:48:24 [Note] WSREP: declaring a0214ca0 at tcp://192.168.0.153:4567 stable
151125  0:48:24 [Note] WSREP: declaring dd1fa25c at tcp://192.168.0.124:4567 stable
151125  0:48:24 [Note] WSREP: re-bootstrapping prim from partitioned components
151125  0:48:24 [Note] WSREP: view(view_id(PRIM,71e534c1,25) memb {
        71e534c1,0
        a0214ca0,0
        dd1fa25c,0
} joined {
} left {
} partitioned {
})
151125  0:48:24 [Note] WSREP: save pc into disk
(snip)

Node#3's error log above was followed by:
151125  0:48:24 [Note] WSREP: declaring 71e534c1 at tcp://192.168.0.125:4567 stable
151125  0:48:24 [Note] WSREP: declaring dd1fa25c at tcp://192.168.0.124:4567 stable
151125  0:48:24 [Note] WSREP: re-bootstrapping prim from partitioned components
151125  0:48:24 [Note] WSREP: view(view_id(PRIM,71e534c1,25) memb {
        71e534c1,0
        a0214ca0,0
        dd1fa25c,0
} joined {
} left {
} partitioned {
})
(snip)

I'm using mariadb-galera-10.0.20-linux-x86_64 on CentOS 6.

Regards,

hunter86bg

unread,
Jan 29, 2016, 4:34:08 PM1/29/16
to codership
You'd better check this link : how to recover a cluster
The main point is this, the node that was stopped last - goes first. If you start it later , it will just inform you that it has more info than the rest and it will shut down. 

Yos Tj

unread,
Feb 21, 2016, 11:29:28 PM2/21/16
to codership
Sorry for my late response.
No I can understand the behavior above, after checking the Percona blog you showed.
Thanks a lot!!

Reply all
Reply to author
Forward
0 new messages