We lost one of the nodes due to iowait seems like trouble with the disks and although at first it was out of the cluster then it managed to bring the cluster down.
Here is the log from one of the 2 nodes that remained in the cluster until they failed:
130125 11:10:49 [Note] WSREP: forgetting 0704ff7e-6674-11e2-0800-ba95e5b58df8 (tcp://
192.168.0.64:4567)
130125 11:10:49 [Note] WSREP: (0dc1b343-6673-11e2-0800-ee35ea40f125, 'tcp://
0.0.0.0:4567') turning message relay requesting off
130125 11:10:49 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
130125 11:10:49 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 7e514cf0-66d7-11e2-0800-7ae6c87f3d84
130125 11:10:49 [Note] WSREP: STATE EXCHANGE: sent state msg: 7e514cf0-66d7-11e2-0800-7ae6c87f3d84
130125 11:10:49 [Note] WSREP: STATE EXCHANGE: got state msg: 7e514cf0-66d7-11e2-0800-7ae6c87f3d84 from 0 (data2)
130125 11:10:49 [Note] WSREP: STATE EXCHANGE: got state msg: 7e514cf0-66d7-11e2-0800-7ae6c87f3d84 from 1 (data1)
130125 11:10:49 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 14,
members = 2/2 (joined/total),
act_id = 202213024,
last_appl. = 202212665,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = 402367df-5fd0-11e2-0800-58b321ec9eec
130125 11:10:49 [Note] WSREP: Flow-control interval: [253, 256]
130125 11:10:49 [Note] WSREP: New cluster view: global state: 402367df-5fd0-11e2-0800-58b321ec9eec:202213024, view# 15: Primary, number of nodes: 2, my index: 0, protocol version 2
130125 11:10:49 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130125 11:10:49 [Note] WSREP: Assign initial position for certification: 202213024, protocol version: 2
130125 11:10:55 [Note] WSREP: cleaning up 0704ff7e-6674-11e2-0800-ba95e5b58df8 (tcp://
192.168.0.64:4567)
130125 11:25:27 [Note] WSREP: (0dc1b343-6673-11e2-0800-ee35ea40f125, 'tcp://
0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://
192.168.0.64:4567130125 11:25:27 [Note] WSREP: (0dc1b343-6673-11e2-0800-ee35ea40f125, 'tcp://
0.0.0.0:4567') turning message relay requesting off
130125 11:26:28 [Warning] WSREP: evs::proto(0dc1b343-6673-11e2-0800-ee35ea40f125, GATHER, view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15)) install timer expired
evs::proto(evs::proto(0dc1b343-6673-11e2-0800-ee35ea40f125, GATHER, view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15)), GATHER) {
current_view=view(view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15) memb {
0dc1b343-6673-11e2-0800-ee35ea40f125,
91f61538-6675-11e2-0800-9e4d229b2a23,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=256363,safe_seq=256363,node_index=node: {idx=0,range=[256364,256363],safe_seq=256363} node: {idx=1,range=[256364,256363],safe_seq=256363} ,msg_index=,recovery_index= (0,256363),evs::msg{version=0
,type=1,user_type=255,order=0,seq=256363,seq_range=0,aru_seq=256362,flags=0,source=0dc1b343-6673-11e2-0800-ee35ea40f125,source_view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),range_uuid=00000000-0000-0000-0000-000000000000,r
ange=[-1,-1],fifo_seq=22894590,node_list=()
}
(1,256363),evs::msg{version=0,type=1,user_type=1,order=4,seq=256363,seq_range=0,aru_seq=256362,flags=4,source=91f61538-6675-11e2-0800-9e4d229b2a23,source_view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),range_uuid=000
00000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=22654543,node_list=()
}
},
fifo_seq=22894792,
last_sent=256363,
known={
0704ff7e-6674-11e2-0800-ba95e5b58df8,evs::node{operational=1,suspected=0,installed=0,fifo_seq=22467029,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=11096617,seq_range=-1,aru_seq=11096619,flags=4,source=0704ff7e-6674-11e2-0800-ba95e5b58df8,source_view_id=view_id(REG,0704ff7e-6674-11e2-0800-ba95e5b58df8,14),range_uuid=00000000-0000-0
000-0000-000000000000,range=[-1,-1],fifo_seq=22467029,node_list=( 0704ff7e-6674-11e2-0800-ba95e5b58df8,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0704ff7e-6674-11e2-0800-ba95e5b58df8,14),safe_seq=11096619,im_
--More--
range=[11096622,11096621],}
0dc1b343-6673-11e2-0800-ee35ea40f125,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
91f61538-6675-11e2-0800-9e4d229b2a23,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
)
},
}
0dc1b343-6673-11e2-0800-ee35ea40f125,evs::node{operational=1,suspected=0,installed=0,fifo_seq=-1,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=256363,seq_range=-1,aru_seq=256363,flags=0,source=0dc1b343-6673-11e2-0800-ee35ea40f125,source_view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),range_uuid=00000000-0000-0000-
0000-000000000000,range=[-1,-1],fifo_seq=22894792,node_list=( 0704ff7e-6674-11e2-0800-ba95e5b58df8,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0704ff7e-6674-11e2-0800-ba95e5b58df8,14),safe_seq=11096619,im_range=[1
1096622,11096621],}
0dc1b343-6673-11e2-0800-ee35ea40f125,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
91f61538-6675-11e2-0800-9e4d229b2a23,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
)
},
}
91f61538-6675-11e2-0800-9e4d229b2a23,evs::node{operational=1,suspected=0,installed=0,fifo_seq=22654746,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=256363,seq_range=-1,aru_seq=256363,flags=4,source=91f61538-6675-11e2-0800-9e4d229b2a23,source_view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),range_uuid=00000000-0000-0000-
0000-000000000000,range=[-1,-1],fifo_seq=22654746,node_list=( 0704ff7e-6674-11e2-0800-ba95e5b58df8,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0704ff7e-6674-11e2-0800-ba95e5b58df8,14),safe_seq=11096619,im_range=[1
1096622,11096621],}
0dc1b343-6673-11e2-0800-ee35ea40f125,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
91f61538-6675-11e2-0800-9e4d229b2a23,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
)
},
}
}
}
130125 11:26:28 [Note] WSREP: no install message received
130125 11:27:28 [Warning] WSREP: evs::proto(0dc1b343-6673-11e2-0800-ee35ea40f125, GATHER, view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15)) install timer expired
evs::proto(evs::proto(0dc1b343-6673-11e2-0800-ee35ea40f125, GATHER, view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15)), GATHER) {
current_view=view(view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15) memb {
0dc1b343-6673-11e2-0800-ee35ea40f125,
91f61538-6675-11e2-0800-9e4d229b2a23,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=256363,safe_seq=256363,node_index=node: {idx=0,range=[256364,256363],safe_seq=256363} node: {idx=1,range=[256364,256363],safe_seq=256363} ,msg_index=,recovery_index= (0,256363),evs::msg{version=0
,type=1,user_type=255,order=0,seq=256363,seq_range=0,aru_seq=256362,flags=0,source=0dc1b343-6673-11e2-0800-ee35ea40f125,source_view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),range_uuid=00000000-0000-0000-0000-000000000000,r
ange=[-1,-1],fifo_seq=22894590,node_list=()
}
(1,256363),evs::msg{version=0,type=1,user_type=1,order=4,seq=256363,seq_range=0,aru_seq=256362,flags=4,source=91f61538-6675-11e2-0800-9e4d229b2a23,source_view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),range_uuid=000
00000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=22654543,node_list=()
}
},
fifo_seq=22894992,
last_sent=256363,
known={
0704ff7e-6674-11e2-0800-ba95e5b58df8,evs::node{operational=1,suspected=0,installed=0,fifo_seq=22467229,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=11096617,seq_range=-1,aru_seq=11096619,flags=4,source=0704ff7e-6674-11e2-0800-ba95e5b58df8,source_view_id=view_id(REG,0704ff7e-6674-11e2-0800-ba95e5b58df8,14),range_uuid=00000000-0000-0
000-0000-000000000000,range=[-1,-1],fifo_seq=22467229,node_list=( 0704ff7e-6674-11e2-0800-ba95e5b58df8,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0704ff7e-6674-11e2-0800-ba95e5b58df8,14),safe_seq=11096619,im_
range=[11096622,11096621],}
0dc1b343-6673-11e2-0800-ee35ea40f125,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
91f61538-6675-11e2-0800-9e4d229b2a23,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
)
},
}
0dc1b343-6673-11e2-0800-ee35ea40f125,evs::node{operational=1,suspected=0,installed=0,fifo_seq=-1,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=256363,seq_range=-1,aru_seq=256363,flags=0,source=0dc1b343-6673-11e2-0800-ee35ea40f125,source_view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),range_uuid=00000000-0000-0000-
0000-000000000000,range=[-1,-1],fifo_seq=22894992,node_list=( 0704ff7e-6674-11e2-0800-ba95e5b58df8,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0704ff7e-6674-11e2-0800-ba95e5b58df8,14),safe_seq=11096619,im_range=[1
1096622,11096621],}
0dc1b343-6673-11e2-0800-ee35ea40f125,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
91f61538-6675-11e2-0800-9e4d229b2a23,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
)
},
}
91f61538-6675-11e2-0800-9e4d229b2a23,evs::node{operational=1,suspected=0,installed=0,fifo_seq=22654946,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=256363,seq_range=-1,aru_seq=256363,flags=4,source=91f61538-6675-11e2-0800-9e4d229b2a23,source_view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),range_uuid=00000000-0000-0000-
0000-000000000000,range=[-1,-1],fifo_seq=22654946,node_list=( 0704ff7e-6674-11e2-0800-ba95e5b58df8,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0704ff7e-6674-11e2-0800-ba95e5b58df8,14),safe_seq=11096619,im_range=[1
1096622,11096621],}
0dc1b343-6673-11e2-0800-ee35ea40f125,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
91f61538-6675-11e2-0800-9e4d229b2a23,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
)
},
}
}
}
130125 11:27:28 [Note] WSREP: no install message received
130125 11:27:28 [Note] WSREP: view(view_id(NON_PRIM,0dc1b343-6673-11e2-0800-ee35ea40f125,15) memb {
0dc1b343-6673-11e2-0800-ee35ea40f125,
} joined {
} left {
} partitioned {
91f61538-6675-11e2-0800-9e4d229b2a23,
})
130125 11:27:28 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130125 11:27:28 [Note] WSREP: Flow-control interval: [253, 256]
130125 11:27:28 [Note] WSREP: Received NON-PRIMARY.
130125 11:27:28 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 202467646)
130125 11:27:28 [Note] WSREP: view(view_id(NON_PRIM,0dc1b343-6673-11e2-0800-ee35ea40f125,16) memb {
0dc1b343-6673-11e2-0800-ee35ea40f125,
} joined {
} left {
} partitioned {
91f61538-6675-11e2-0800-9e4d229b2a23,
})
130125 11:27:28 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130125 11:27:28 [Note] WSREP: Flow-control interval: [253, 256]
130125 11:27:28 [Note] WSREP: Received NON-PRIMARY.
130125 11:27:28 [Note] WSREP: New cluster view: global state: 402367df-5fd0-11e2-0800-58b321ec9eec:202467646, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
130125 11:27:28 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130125 11:27:28 [Note] WSREP: New cluster view: global state: 402367df-5fd0-11e2-0800-58b321ec9eec:202467646, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
130125 11:27:28 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130125 11:28:26 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=4,user_type=255,order=1,seq=0,seq_range=-1,aru_seq=0,flags=4,source=91f61538-6675-11e2-0800-9e4d229b2a23,source_view_id=view_id(REG,91f61538-6675-11e2-08
00-9e4d229b2a23,16),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=22655149,node_list=( 0704ff7e-6674-11e2-0800-ba95e5b58df8,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0704ff7e-6674-11e2-080
0-ba95e5b58df8,14),safe_seq=11096619,im_range=[11096622,11096621],}
0dc1b343-6673-11e2-0800-ee35ea40f125,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,16),safe_seq=0,im_range=[1,0],}
91f61538-6675-11e2-0800-9e4d229b2a23,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,91f61538-6675-11e2-0800-9e4d229b2a23,16),safe_seq=0,im_range=[1,0],}
)
}
130125 11:28:26 [ERROR] WSREP: state after handling message: evs::proto(evs::proto(0dc1b343-6673-11e2-0800-ee35ea40f125, GATHER, view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,16)), GATHER) {
current_view=view(view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,16) memb {
0dc1b343-6673-11e2-0800-ee35ea40f125,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=0,safe_seq=0,node_index=node: {idx=0,range=[1,0],safe_seq=0} ,msg_index=,recovery_index= (0,0),evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=0,source=0dc1b343-66
73-11e2-0800-ee35ea40f125,source_view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,16),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=22894997,node_list=()
}
},
fifo_seq=22895196,
last_sent=0,
known={
0704ff7e-6674-11e2-0800-ba95e5b58df8,evs::node{operational=1,suspected=0,installed=0,fifo_seq=22467428,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=11096617,seq_range=-1,aru_seq=11096619,flags=4,source=0704ff
130125 11:28:26 [ERROR] WSREP: exception from gcomm, backend must be restarted:NodeMap::get_value(i).get_leave_message() == 0: (FATAL)
at gcomm/src/evs_proto.cpp:is_representative():969
130125 11:28:26 [Note] WSREP: Received self-leave message.
130125 11:28:26 [Note] WSREP: Flow-control interval: [253, 256]
130125 11:28:26 [Note] WSREP: Received SELF-LEAVE. Closing connection.
130125 11:28:26 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 202467646)
130125 11:28:26 [Note] WSREP: RECV thread exiting 0: Success
130125 11:28:26 [Note] WSREP: New cluster view: global state: 402367df-5fd0-11e2-0800-58b321ec9eec:202467646, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 2
130125 11:28:26 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130125 11:28:26 [Note] WSREP: applier thread exiting (code:0)
130125 11:30:25 [Note] /usr/sbin/mysqld: Normal shutdown
130125 11:30:25 [Note] WSREP: Stop replication
130125 11:30:25 [Note] WSREP: Closing send monitor...
130125 11:30:25 [Note] WSREP: Closed send monitor.
130125 11:30:27 [Note] WSREP: killing local connection: 53264
The other one has a similar log.