galera and stalled cluster due to weak node.

998 views
Skip to first unread message

Ilias Bertsimas

unread,
Jan 16, 2013, 7:20:55 AM1/16/13
to codersh...@googlegroups.com
Hi guys,

I have been facing some issues with cluster-wide stalls when a node fails to keep up for various reasons. The cluster may stay stalled forever until the node that is unable to keep up is shut down.
I know it is an issue and all nodes should be able to keep up but there are things out of our control for now (hardware etc) to be able to guarantee that.
I was wondering if there is any way to make a node that can't keep up for a long time drop out of the cluster so it does not stall it the way it does now.


Kind Regards,
Ilias.

Alex Yurchenko

unread,
Jan 16, 2013, 9:51:20 AM1/16/13
to codersh...@googlegroups.com
Hi Ilias,

So far this was considered are very unusual scenario, so there are no
provisions made for that at the moment. Perhaps you could explain in a
bit more detail what is actually happening so that we can better
understand how to deal with it? The thing is that to stay in cluster the
node needs to periodically send keepalives, so I can think of only one
scenario where it can get stalled but still send keepalives: mysqld runs
out of disk space in data directory. Is it what is happening or
something else? Is the node totally stalled or just very slow?

Thanks,
Alex
--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Ilias Bertsimas

unread,
Jan 16, 2013, 10:02:06 AM1/16/13
to codersh...@googlegroups.com
Hi Alex,

I can only tell the node starts getting high system load and the whole flow control is paused with a value of 1.
It is not a case of full disk or anything like that.
The moment  I check any node in the cluster I see all queries piling up in state statistics for SELECT queries and query end for INSERT/DELETE/UPDATE queries.
There is nothing in the logs and the only abnormal thing I can notice is one node with high system load but not high user load or iowait.

Kind Regards,
Ilias.

Alex Yurchenko

unread,
Jan 16, 2013, 10:59:18 AM1/16/13
to codersh...@googlegroups.com
1) What about memory usage? Do you have swap enabled? And, have I
pointed you at
http://www.olivierdoucet.info/blog/2012/05/19/debugging-a-mysql-stall/
already? (same thing with high system usage).

2) What about wsrep_last_committed, does it advance a tiny bit or stays
firmly constant?

3) Does it happen randomly on every node or there is a concrete node
where you observe this?

Ilias Bertsimas

unread,
Jan 16, 2013, 11:09:56 AM1/16/13
to codersh...@googlegroups.com

1) No issues with memory.
I have seen that in the past we do not have huge pages enabled at all.

2) Everything is stopped on all nodes no further commits.

3) It seems to be mostly 1 node and then another one very infrequently. One of the 3 nodes have never been the cause of this ever and it happens to be the most loaded one as well (kinda like a master).
I observe this on the cacti graphs of each node around the time the stall starts the system load skyrockets. 

Kind Regards,
Ilias.

Alex Yurchenko

unread,
Jan 21, 2013, 4:24:55 AM1/21/13
to codersh...@googlegroups.com
Hi Ilias,

It seems unlikely that we will roll out a solution for your problem any
time soon as we don't see any that would not have adverse side effects.
Meanwhile you may want to try relaxing flow control:

wsrep_provider_options="gcs.fc_limit=1G; gcs.fc_factor=1.0"

on offending nodes should give you ample time to detect stalled node
and do what you deem appropriate.

BTW, do you happen to have anything like

kernel: INFO: task gzip:28803 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.

in dmesg or syslog? (or do you have 0 in
/proc/sys/kernel/hung_task_timeout_secs ?)

Ilias Bertsimas

unread,
Jan 21, 2013, 7:30:04 AM1/21/13
to codersh...@googlegroups.com
 Hi Alex,

 I will try a workaround if we do not manage to find the hardware issue that causes it in the first place.
 I guess also making the evs keep-alives shorter might be a workaround albeit in the expense of frequent cluster renegotiations.

 We always check the logs and anything on dmesg output when the stall happens without anything showing up.
 The /proc/sys/kernel/hung_task_timeout_secs is set to the default of 120.

 Kind Regards,
 Ilias.

Alex Yurchenko

unread,
Jan 21, 2013, 9:43:01 AM1/21/13
to codersh...@googlegroups.com
On 2013-01-21 14:30, Ilias Bertsimas wrote:
> Hi Alex,
>
> I will try a workaround if we do not manage to find the hardware
> issue
> that causes it in the first place.
> I guess also making the evs keep-alives shorter might be a
> workaround
> albeit in the expense of frequent cluster renegotiations.

Hardly. They are already 1 second by default and had the stalled node
been unable to send keepalives, it would have been detected in 5 seconds
and kicked out. To me it sounds like Galera loop is running fine there
and the problem is in one of mysqld slave threads being stalled.

Ilias Bertsimas

unread,
Jan 21, 2013, 10:09:58 AM1/21/13
to codersh...@googlegroups.com
On Monday, January 21, 2013 2:43:01 PM UTC, Alexey Yurchenko wrote:
On 2013-01-21 14:30, Ilias Bertsimas wrote:
> Hi Alex,
>
>  I will try a workaround if we do not manage to find the hardware
> issue
> that causes it in the first place.
>  I guess also making the evs keep-alives shorter might be a
> workaround
> albeit in the expense of frequent cluster renegotiations.

Hardly. They are already 1 second by default and had the stalled node
been unable to send keepalives, it would have been detected in 5 seconds
and kicked out. To me it sounds like Galera loop is running fine there
and the problem is in one of mysqld slave threads being stalled.

Actually they are set to WAN like values as suggested in the wiki due to a lot of network issues we were facing.

Ilias Bertsimas

unread,
Jan 25, 2013, 7:00:19 AM1/25/13
to codersh...@googlegroups.com
Hi again,

We lost one of the nodes due to iowait seems like trouble with the disks and although at first it was out of the cluster then it managed to bring the cluster down.
Here is the log from one of the 2 nodes that remained in the cluster until they failed:

130125 11:10:49 [Note] WSREP: forgetting 0704ff7e-6674-11e2-0800-ba95e5b58df8 (tcp://192.168.0.64:4567)
130125 11:10:49 [Note] WSREP: deleting entry tcp://192.168.0.64:4567
130125 11:10:49 [Note] WSREP: (0dc1b343-6673-11e2-0800-ee35ea40f125, 'tcp://0.0.0.0:4567') turning message relay requesting off
130125 11:10:49 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
130125 11:10:49 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 7e514cf0-66d7-11e2-0800-7ae6c87f3d84
130125 11:10:49 [Note] WSREP: STATE EXCHANGE: sent state msg: 7e514cf0-66d7-11e2-0800-7ae6c87f3d84
130125 11:10:49 [Note] WSREP: STATE EXCHANGE: got state msg: 7e514cf0-66d7-11e2-0800-7ae6c87f3d84 from 0 (data2)
130125 11:10:49 [Note] WSREP: STATE EXCHANGE: got state msg: 7e514cf0-66d7-11e2-0800-7ae6c87f3d84 from 1 (data1)
130125 11:10:49 [Note] WSREP: Quorum results:
        version    = 2,
        component  = PRIMARY,
        conf_id    = 14,
        members    = 2/2 (joined/total),
        act_id     = 202213024,
        last_appl. = 202212665,
        protocols  = 0/4/2 (gcs/repl/appl),
        group UUID = 402367df-5fd0-11e2-0800-58b321ec9eec
130125 11:10:49 [Note] WSREP: Flow-control interval: [253, 256]
130125 11:10:49 [Note] WSREP: New cluster view: global state: 402367df-5fd0-11e2-0800-58b321ec9eec:202213024, view# 15: Primary, number of nodes: 2, my index: 0, protocol version 2
130125 11:10:49 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130125 11:10:49 [Note] WSREP: Assign initial position for certification: 202213024, protocol version: 2
130125 11:10:55 [Note] WSREP:  cleaning up 0704ff7e-6674-11e2-0800-ba95e5b58df8 (tcp://192.168.0.64:4567)
130125 11:25:27 [Note] WSREP: (0dc1b343-6673-11e2-0800-ee35ea40f125, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.0.64:4567
130125 11:25:27 [Note] WSREP: (0dc1b343-6673-11e2-0800-ee35ea40f125, 'tcp://0.0.0.0:4567') turning message relay requesting off
130125 11:26:28 [Warning] WSREP: evs::proto(0dc1b343-6673-11e2-0800-ee35ea40f125, GATHER, view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15)) install timer expired
evs::proto(evs::proto(0dc1b343-6673-11e2-0800-ee35ea40f125, GATHER, view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15)), GATHER) {
current_view=view(view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15) memb {
        0dc1b343-6673-11e2-0800-ee35ea40f125,
        91f61538-6675-11e2-0800-9e4d229b2a23,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=256363,safe_seq=256363,node_index=node: {idx=0,range=[256364,256363],safe_seq=256363} node: {idx=1,range=[256364,256363],safe_seq=256363} ,msg_index=,recovery_index=        (0,256363),evs::msg{version=0
,type=1,user_type=255,order=0,seq=256363,seq_range=0,aru_seq=256362,flags=0,source=0dc1b343-6673-11e2-0800-ee35ea40f125,source_view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),range_uuid=00000000-0000-0000-0000-000000000000,r
ange=[-1,-1],fifo_seq=22894590,node_list=()
}
        (1,256363),evs::msg{version=0,type=1,user_type=1,order=4,seq=256363,seq_range=0,aru_seq=256362,flags=4,source=91f61538-6675-11e2-0800-9e4d229b2a23,source_view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),range_uuid=000
00000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=22654543,node_list=()
}
},
fifo_seq=22894792,
last_sent=256363,
known={
        0704ff7e-6674-11e2-0800-ba95e5b58df8,evs::node{operational=1,suspected=0,installed=0,fifo_seq=22467029,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=11096617,seq_range=-1,aru_seq=11096619,flags=4,source=0704ff7e-6674-11e2-0800-ba95e5b58df8,source_view_id=view_id(REG,0704ff7e-6674-11e2-0800-ba95e5b58df8,14),range_uuid=00000000-0000-0
000-0000-000000000000,range=[-1,-1],fifo_seq=22467029,node_list=(       0704ff7e-6674-11e2-0800-ba95e5b58df8,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0704ff7e-6674-11e2-0800-ba95e5b58df8,14),safe_seq=11096619,im_
--More--

range=[11096622,11096621],}
        0dc1b343-6673-11e2-0800-ee35ea40f125,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
        91f61538-6675-11e2-0800-9e4d229b2a23,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
)
},
}
        0dc1b343-6673-11e2-0800-ee35ea40f125,evs::node{operational=1,suspected=0,installed=0,fifo_seq=-1,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=256363,seq_range=-1,aru_seq=256363,flags=0,source=0dc1b343-6673-11e2-0800-ee35ea40f125,source_view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),range_uuid=00000000-0000-0000-
0000-000000000000,range=[-1,-1],fifo_seq=22894792,node_list=(   0704ff7e-6674-11e2-0800-ba95e5b58df8,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0704ff7e-6674-11e2-0800-ba95e5b58df8,14),safe_seq=11096619,im_range=[1
1096622,11096621],}
        0dc1b343-6673-11e2-0800-ee35ea40f125,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
        91f61538-6675-11e2-0800-9e4d229b2a23,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
)
},
}
        91f61538-6675-11e2-0800-9e4d229b2a23,evs::node{operational=1,suspected=0,installed=0,fifo_seq=22654746,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=256363,seq_range=-1,aru_seq=256363,flags=4,source=91f61538-6675-11e2-0800-9e4d229b2a23,source_view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),range_uuid=00000000-0000-0000-
0000-000000000000,range=[-1,-1],fifo_seq=22654746,node_list=(   0704ff7e-6674-11e2-0800-ba95e5b58df8,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0704ff7e-6674-11e2-0800-ba95e5b58df8,14),safe_seq=11096619,im_range=[1
1096622,11096621],}
        0dc1b343-6673-11e2-0800-ee35ea40f125,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
        91f61538-6675-11e2-0800-9e4d229b2a23,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
)
},
}
 }
 }
130125 11:26:28 [Note] WSREP: no install message received
130125 11:27:28 [Warning] WSREP: evs::proto(0dc1b343-6673-11e2-0800-ee35ea40f125, GATHER, view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15)) install timer expired
evs::proto(evs::proto(0dc1b343-6673-11e2-0800-ee35ea40f125, GATHER, view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15)), GATHER) {
current_view=view(view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15) memb {
        0dc1b343-6673-11e2-0800-ee35ea40f125,
        91f61538-6675-11e2-0800-9e4d229b2a23,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=256363,safe_seq=256363,node_index=node: {idx=0,range=[256364,256363],safe_seq=256363} node: {idx=1,range=[256364,256363],safe_seq=256363} ,msg_index=,recovery_index=        (0,256363),evs::msg{version=0
,type=1,user_type=255,order=0,seq=256363,seq_range=0,aru_seq=256362,flags=0,source=0dc1b343-6673-11e2-0800-ee35ea40f125,source_view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),range_uuid=00000000-0000-0000-0000-000000000000,r
ange=[-1,-1],fifo_seq=22894590,node_list=()
}
        (1,256363),evs::msg{version=0,type=1,user_type=1,order=4,seq=256363,seq_range=0,aru_seq=256362,flags=4,source=91f61538-6675-11e2-0800-9e4d229b2a23,source_view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),range_uuid=000
00000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=22654543,node_list=()
}
},
fifo_seq=22894992,
last_sent=256363,
known={
        0704ff7e-6674-11e2-0800-ba95e5b58df8,evs::node{operational=1,suspected=0,installed=0,fifo_seq=22467229,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=11096617,seq_range=-1,aru_seq=11096619,flags=4,source=0704ff7e-6674-11e2-0800-ba95e5b58df8,source_view_id=view_id(REG,0704ff7e-6674-11e2-0800-ba95e5b58df8,14),range_uuid=00000000-0000-0
000-0000-000000000000,range=[-1,-1],fifo_seq=22467229,node_list=(       0704ff7e-6674-11e2-0800-ba95e5b58df8,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0704ff7e-6674-11e2-0800-ba95e5b58df8,14),safe_seq=11096619,im_
range=[11096622,11096621],}
        0dc1b343-6673-11e2-0800-ee35ea40f125,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
        91f61538-6675-11e2-0800-9e4d229b2a23,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
)
},
}
        0dc1b343-6673-11e2-0800-ee35ea40f125,evs::node{operational=1,suspected=0,installed=0,fifo_seq=-1,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=256363,seq_range=-1,aru_seq=256363,flags=0,source=0dc1b343-6673-11e2-0800-ee35ea40f125,source_view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),range_uuid=00000000-0000-0000-
0000-000000000000,range=[-1,-1],fifo_seq=22894992,node_list=(   0704ff7e-6674-11e2-0800-ba95e5b58df8,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0704ff7e-6674-11e2-0800-ba95e5b58df8,14),safe_seq=11096619,im_range=[1
1096622,11096621],}
        0dc1b343-6673-11e2-0800-ee35ea40f125,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
        91f61538-6675-11e2-0800-9e4d229b2a23,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
)
},
}
        91f61538-6675-11e2-0800-9e4d229b2a23,evs::node{operational=1,suspected=0,installed=0,fifo_seq=22654946,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=256363,seq_range=-1,aru_seq=256363,flags=4,source=91f61538-6675-11e2-0800-9e4d229b2a23,source_view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),range_uuid=00000000-0000-0000-
0000-000000000000,range=[-1,-1],fifo_seq=22654946,node_list=(   0704ff7e-6674-11e2-0800-ba95e5b58df8,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0704ff7e-6674-11e2-0800-ba95e5b58df8,14),safe_seq=11096619,im_range=[1
1096622,11096621],}
        0dc1b343-6673-11e2-0800-ee35ea40f125,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
        91f61538-6675-11e2-0800-9e4d229b2a23,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,15),safe_seq=256363,im_range=[256364,256363],}
)
},
}
 }
 }
130125 11:27:28 [Note] WSREP: no install message received
130125 11:27:28 [Note] WSREP: view(view_id(NON_PRIM,0dc1b343-6673-11e2-0800-ee35ea40f125,15) memb {
        0dc1b343-6673-11e2-0800-ee35ea40f125,
} joined {
} left {
} partitioned {
        91f61538-6675-11e2-0800-9e4d229b2a23,
})
130125 11:27:28 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130125 11:27:28 [Note] WSREP: Flow-control interval: [253, 256]
130125 11:27:28 [Note] WSREP: Received NON-PRIMARY.
130125 11:27:28 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 202467646)
130125 11:27:28 [Note] WSREP: view(view_id(NON_PRIM,0dc1b343-6673-11e2-0800-ee35ea40f125,16) memb {
        0dc1b343-6673-11e2-0800-ee35ea40f125,
} joined {
} left {
} partitioned {
        91f61538-6675-11e2-0800-9e4d229b2a23,
})
130125 11:27:28 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130125 11:27:28 [Note] WSREP: Flow-control interval: [253, 256]
130125 11:27:28 [Note] WSREP: Received NON-PRIMARY.
130125 11:27:28 [Note] WSREP: New cluster view: global state: 402367df-5fd0-11e2-0800-58b321ec9eec:202467646, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
130125 11:27:28 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130125 11:27:28 [Note] WSREP: New cluster view: global state: 402367df-5fd0-11e2-0800-58b321ec9eec:202467646, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
130125 11:27:28 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130125 11:28:26 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=4,user_type=255,order=1,seq=0,seq_range=-1,aru_seq=0,flags=4,source=91f61538-6675-11e2-0800-9e4d229b2a23,source_view_id=view_id(REG,91f61538-6675-11e2-08
00-9e4d229b2a23,16),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=22655149,node_list=( 0704ff7e-6674-11e2-0800-ba95e5b58df8,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0704ff7e-6674-11e2-080
0-ba95e5b58df8,14),safe_seq=11096619,im_range=[11096622,11096621],}
        0dc1b343-6673-11e2-0800-ee35ea40f125,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,16),safe_seq=0,im_range=[1,0],}
        91f61538-6675-11e2-0800-9e4d229b2a23,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,91f61538-6675-11e2-0800-9e4d229b2a23,16),safe_seq=0,im_range=[1,0],}
)
}
130125 11:28:26 [ERROR] WSREP:  state after handling message: evs::proto(evs::proto(0dc1b343-6673-11e2-0800-ee35ea40f125, GATHER, view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,16)), GATHER) {
current_view=view(view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,16) memb {
        0dc1b343-6673-11e2-0800-ee35ea40f125,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=0,safe_seq=0,node_index=node: {idx=0,range=[1,0],safe_seq=0} ,msg_index=,recovery_index=     (0,0),evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=0,source=0dc1b343-66
73-11e2-0800-ee35ea40f125,source_view_id=view_id(REG,0dc1b343-6673-11e2-0800-ee35ea40f125,16),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=22894997,node_list=()
}
},
fifo_seq=22895196,
last_sent=0,
known={
        0704ff7e-6674-11e2-0800-ba95e5b58df8,evs::node{operational=1,suspected=0,installed=0,fifo_seq=22467428,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=11096617,seq_range=-1,aru_seq=11096619,flags=4,source=0704ff
130125 11:28:26 [ERROR] WSREP: exception from gcomm, backend must be restarted:NodeMap::get_value(i).get_leave_message() == 0:  (FATAL)
         at gcomm/src/evs_proto.cpp:is_representative():969
130125 11:28:26 [Note] WSREP: Received self-leave message.
130125 11:28:26 [Note] WSREP: Flow-control interval: [253, 256]
130125 11:28:26 [Note] WSREP: Received SELF-LEAVE. Closing connection.
130125 11:28:26 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 202467646)
130125 11:28:26 [Note] WSREP: RECV thread exiting 0: Success
130125 11:28:26 [Note] WSREP: New cluster view: global state: 402367df-5fd0-11e2-0800-58b321ec9eec:202467646, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 2
130125 11:28:26 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130125 11:28:26 [Note] WSREP: applier thread exiting (code:0)
130125 11:30:25 [Note] /usr/sbin/mysqld: Normal shutdown

130125 11:30:25 [Note] WSREP: Stop replication
130125 11:30:25 [Note] WSREP: Closing send monitor...
130125 11:30:25 [Note] WSREP: Closed send monitor.
130125 11:30:27 [Note] WSREP: killing local connection: 53264

The other one has a similar log.

Kind Regards,
Ilias.

Ilias Bertsimas

unread,
Jan 26, 2013, 1:54:24 PM1/26/13
to codersh...@googlegroups.com
Hi guys,

Since the issue with the cluster stall has been ongoing for 2 months now only getting worse lately and with opposition from my other colleagues to abandon galera as they blamed it for the stalls due to bad design or some bug leaving me alone to prove that is not the case. Upper management decided to hire percona to help us figure out the issue once and for all. I worked with them and discovered the increased system load when the cluster would stall to be flashcache module's fault due to bad configuration/tuning or some bug in it. So it is not galera after all. 

Of course there are some suggestions on my part that I would like to make.
I believe it would be a good idea to have an option for galera to make a node leave the cluster if it pauses flow control for X times within X amount of time to allow for the cluster to remain functioning and not get downtime like in our case. Since it is not a case of consistency but a node that it does not fail completely but remains in the cluster but stalling it so bad that in the eyes of a demanding mysql application makes the whole cluster unavailable. After all the idea of a cluster is to provide High Availability. Please let me know what you think.

Kind Regards,
Ilias.

Henrik Ingo

unread,
Jan 26, 2013, 4:15:25 PM1/26/13
to Ilias Bertsimas, codership
On Sat, Jan 26, 2013 at 8:54 PM, Ilias Bertsimas <awar...@gmail.com> wrote:
> some bug in it. So it is not galera after all.

Glad to hear :-)

> Of course there are some suggestions on my part that I would like to make.
> I believe it would be a good idea to have an option for galera to make a
> node leave the cluster if it pauses flow control for X times within X amount
> of time to allow for the cluster to remain functioning and not get downtime
> like in our case. Since it is not a case of consistency but a node that it
> does not fail completely but remains in the cluster but stalling it so bad
> that in the eyes of a demanding mysql application makes the whole cluster
> unavailable. After all the idea of a cluster is to provide High
> Availability. Please let me know what you think.

This is a difficult choice to actually implement correctly. The idea
is of course tempting, if there's one node you can blame. For example
in the cloud you can have a situation that 1 node has performance
issues and the right thing to do is to abandon it.

However, quite often HA clusters also fail because of the kind of
functionality that you propose. It could be the case that the load is
just so high that you are at maximum capacity of what the system can
really do. And in that case one of the nodes just happens to be the
first one to show any symptoms. Now, if that node is removed from the
cluster, its load will be targeted at the remaining nodes, which are
also at full capacity, and this failover just makes the situation
worse for them. This would then lead to a cascading failure where the
next node is removed from cluster, then the next, until you have only
one left.

For a piece of code making this kind of a decision, it is impossible
to know whether the poor performance (in galera's case lots of paused
flow control) is due to some error condition or due to the client load
actually being high. So it is better not to have such functionality
and instead have a human make this decision.

henrik

Alex Yurchenko

unread,
Jan 26, 2013, 4:18:34 PM1/26/13
to codersh...@googlegroups.com
Hi Ilias,

First of all, thanks for sharing the outcome of this story. Illustrates
the difference between professional developers and professional support
;)

Secondly, this is the first time we encountered such issue in practice.
Before we speculated about it a bit, but it was speculation and it was
not given much thought. Not only because it was a speculation, but also
because algorithmically it is a very complex task despite apparent
simplicity. It is not clear how to distinguish between a node that is
stuck and a node that is simply busy working. So naturally it was pushed
back.

Anyway, now we see that the problem is real and we'll give it a higher
priority. Yet, no ETA on this as we're just starting.

Regards,
Alex

Alex Yurchenko

unread,
Jan 26, 2013, 5:14:21 PM1/26/13
to codersh...@googlegroups.com
On 2013-01-26 23:15, Henrik Ingo wrote:
> However, quite often HA clusters also fail because of the kind of
> functionality that you propose. It could be the case that the load is
> just so high that you are at maximum capacity of what the system can
> really do. And in that case one of the nodes just happens to be the
> first one to show any symptoms. Now, if that node is removed from the
> cluster, its load will be targeted at the remaining nodes, which are
> also at full capacity, and this failover just makes the situation
> worse for them. This would then lead to a cascading failure where the
> next node is removed from cluster, then the next, until you have only
> one left.

Good point there. And in general, automatic removal of the node from
the cluster means that it is UNCONTROLLED. If the point is to relieve
the human from making decisions, then it means that it can happen ANY
time, and no questions will be asked. Of course, High Availability is
paramount. But who said that it can be achieved without human attention?

Another thing is that planning for the worst, i.e. designing for faulty
hardware, buggy libraries or misconfigured systems at some point becomes
counterproductive: the development and runtime costs of checks and
workarounds become increasingly taxing. Say. that particular task may
require one more mutex lock per writeset. And it will have to be locked
on a perfectly healthy system as well - just because we're working
around the potential stall that will never happen. Yes, the main purpose
of replication is provide workaround for some crashes and failures (and
it does so at the noticeable overhead cost), but there should be limits
to everything if we ever want to make any progress.




Ilias Bertsimas

unread,
Jan 26, 2013, 5:46:29 PM1/26/13
to codersh...@googlegroups.com
Hi guys,

I understand the complexity of such a solution and the implications it poses and that it is better for human intervention to handle the issue.
But like in our case until the problem becomes apparent we already end up with 15-20 mins of downtime and with angry clients.

The solution could be optional and not on by default so it does not impose an overhead at all times or a risk of a cascading failure.
Also maybe a self-leave functionality like in the case when a node realises it is inconsistent and does just that might be less of an overhead than a cluster wide controlled solution.

Kind Regards,
Ilias.

Henrik Ingo

unread,
Jan 27, 2013, 9:37:37 AM1/27/13
to Ilias Bertsimas, codership
On Sun, Jan 27, 2013 at 12:46 AM, Ilias Bertsimas <awar...@gmail.com> wrote:
> But like in our case until the problem becomes apparent we already end up
> with 15-20 mins of downtime and with angry clients.

Yes, but accidentally taking down the whole cluster will also cause
angry clients :-)

FWIW, what you propose is of course perfectly possible to implement
with an external script already. You could monitor values for flow
control paused and whatever else seems relevant, then let the script
kill mysqld if you feel that is the right thing to do.

henrik




--
henri...@avoinelama.fi
+358-40-8211286 skype: henrik.ingo irc: hingo
www.openlife.cc

My LinkedIn profile: http://www.linkedin.com/profile/view?id=9522559
Reply all
Reply to author
Forward
0 new messages