high packet loss on single node bring the whole cluster down

293 views
Skip to first unread message

Vince

unread,
Nov 1, 2013, 5:33:22 AM11/1/13
to codersh...@googlegroups.com
anyone have experience similar issue?

a similar case i found on google.

i have a galera cluster with 4 nodes and a single garbd, high packet loss only on the garbd network will bring all others nodes to initialised state. it's easy to reproduce.

wsrep_provider_options="pc.weight=1;evs.join_retrans_period=PT2S;evs.keepalive_period=PT3S;evs.inactive_check_period=PT10S;evs.suspect_timeout=PT30S;evs.inactive_timeout=PT1M;evs.install_timeout=PT1M"

all nodes share the similar log as follow

131031 15:25:07 [Warning] WSREP: evs::proto(b138644e-3a4f-11e3-a890-cbf6831de099, GATHER, view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497)) install timer expired
evs::proto(evs::proto(b138644e-3a4f-11e3-a890-cbf6831de099, GATHER, view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497)), GATHER) {
current_view=view(view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497) memb {
        2853b263-3a4c-11e3-8d08-bb69a1cadba2,
        4e6e7b63-3a59-11e3-83c8-4b32aeffee17,
        63e5752e-3a4d-11e3-a457-76d5146fbfb5,
        b138644e-3a4f-11e3-a890-cbf6831de099,
        e4fb10ff-3a77-11e3-9767-863d8c1156cd,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=4216,safe_seq=4187,node_index=node: {idx=0,range=[4217,4216],safe_seq=4216} node: {idx=1,range=[4217,4216],safe_seq=4216
} node: {idx=2,range=[4217,4216],safe_seq=4216} node: {idx=3,range=[4217,4216],safe_seq=4216} node: {idx=4,range=[4217,4216],safe_seq=4187} },
fifo_seq=8055201,
last_sent=4216,
known={
        2853b263-3a4c-11e3-8d08-bb69a1cadba2,evs::node{operational=1,suspected=0,installed=0,fifo_seq=8062200,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=4187,seq_range=-1,aru_seq=4216,flags=4,source=2853b263-3a4c-11e3-8d08-bb69a1cadba2,source_view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=8062200,node_list=(  2853b263-3a4c-11e3-8d08-bb69a1cadba2,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4216,im_range=[4217,4216],}
        4e6e7b63-3a59-11e3-83c8-4b32aeffee17,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4216,im_range=[4217,4216],}
        63e5752e-3a4d-11e3-a457-76d5146fbfb5,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4216,im_range=[4217,4216],}
        b138644e-3a4f-11e3-a890-cbf6831de099,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4216,im_range=[4217,4216],}
        e4fb10ff-3a77-11e3-9767-863d8c1156cd,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4187,im_range=[4217,4216],}
)
},
}

      4e6e7b63-3a59-11e3-83c8-4b32aeffee17,evs::node{operational=1,suspected=0,installed=0,fifo_seq=8011820,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=4187,seq_range=-1,aru_seq=4216,flags=4,source=4e6e7b63-3a59-11e3-83c8-4b32aeffee17,source_view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=8011820,node_list=(  2853b263-3a4c-11e3-8d08-bb69a1cadba2,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4216,im_range=[4217,4216],}
        4e6e7b63-3a59-11e3-83c8-4b32aeffee17,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4216,im_range=[4217,4216],}
        63e5752e-3a4d-11e3-a457-76d5146fbfb5,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4216,im_range=[4217,4216],}
        b138644e-3a4f-11e3-a890-cbf6831de099,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4216,im_range=[4217,4216],}
        e4fb10ff-3a77-11e3-9767-863d8c1156cd,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4187,im_range=[4217,4216],}
)
},
}
        63e5752e-3a4d-11e3-a457-76d5146fbfb5,evs::node{operational=1,suspected=0,installed=0,fifo_seq=7999212,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=4187,seq_range=-1,aru_seq=4216,flags=4,source=63e5752e-3a4d-11e3-a457-76d5146fbfb5,source_view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=7999212,node_list=(  2853b263-3a4c-11e3-8d08-bb69a1cadba2,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4216,im_range=[4217,4216],}
        4e6e7b63-3a59-11e3-83c8-4b32aeffee17,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4216,im_range=[4217,4216],}
        63e5752e-3a4d-11e3-a457-76d5146fbfb5,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4216,im_range=[4217,4216],}
        b138644e-3a4f-11e3-a890-cbf6831de099,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4216,im_range=[4217,4216],}
        e4fb10ff-3a77-11e3-9767-863d8c1156cd,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4187,im_range=[4217,4216],}
)
},
}
        b138644e-3a4f-11e3-a890-cbf6831de099,evs::node{operational=1,suspected=0,installed=0,fifo_seq=-1,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=4187,seq_range=-1,aru_seq=4216,flags=0,source=b138644e-3a4f-11e3-a890-cbf6831de099,source_view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=8055201,node_list=(  2853b263-3a4c-11e3-8d08-bb69a1cadba2,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4216,im_range=[4217,4216],}
        4e6e7b63-3a59-11e3-83c8-4b32aeffee17,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4216,im_range=[4217,4216],}
        63e5752e-3a4d-11e3-a457-76d5146fbfb5,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4216,im_range=[4217,4216],}
        b138644e-3a4f-11e3-a890-cbf6831de099,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4216,im_range=[4217,4216],}
        e4fb10ff-3a77-11e3-9767-863d8c1156cd,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4187,im_range=[4217,4216],}
)
},
}
        e4fb10ff-3a77-11e3-9767-863d8c1156cd,evs::node{operational=1,suspected=0,installed=0,fifo_seq=7929206,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=4187,seq_range=-1,aru_seq=4187,flags=4,source=e4fb10ff-3a77-11e3-9767-863d8c1156cd,source_view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=7929206,node_list=(  2853b263-3a4c-11e3-8d08-bb69a1cadba2,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4190,im_range=[4217,4216],}
        4e6e7b63-3a59-11e3-83c8-4b32aeffee17,node: {operational=1,suspected=1,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4190,im_range=[4191,4190],}
        63e5752e-3a4d-11e3-a457-76d5146fbfb5,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4190,im_range=[4188,4216],}
        b138644e-3a4f-11e3-a890-cbf6831de099,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4190,im_range=[4217,4216],}
        e4fb10ff-3a77-11e3-9767-863d8c1156cd,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497),safe_seq=4187,im_range=[4217,4216],}
)
},
}
 }
 }
131031 15:25:07 [Note] WSREP: no install message received
131031 15:25:07 [Note] WSREP: view(view_id(NON_PRIM,2853b263-3a4c-11e3-8d08-bb69a1cadba2,497) memb {
        b138644e-3a4f-11e3-a890-cbf6831de099,
} joined {
} left {
} partitioned {
        2853b263-3a4c-11e3-8d08-bb69a1cadba2,
        4e6e7b63-3a59-11e3-83c8-4b32aeffee17,
        63e5752e-3a4d-11e3-a457-76d5146fbfb5,
        e4fb10ff-3a77-11e3-9767-863d8c1156cd,
})
131031 15:25:07 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
131031 15:25:07 [Note] WSREP: Flow-control interval: [16, 16]
131031 15:25:07 [Note] WSREP: Received NON-PRIMARY.
131031 15:25:07 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 2021088)
131031 15:25:07 [Note] WSREP: New cluster view: global state: 1ddd4eed-ee2e-11e2-0800-45d2899001c9:2021088, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
131031 15:25:07 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131031 15:25:07 [Note] WSREP: view(view_id(NON_PRIM,b138644e-3a4f-11e3-a890-cbf6831de099,498) memb {
        b138644e-3a4f-11e3-a890-cbf6831de099,
} joined {
} left {
} partitioned {
        2853b263-3a4c-11e3-8d08-bb69a1cadba2,
        4e6e7b63-3a59-11e3-83c8-4b32aeffee17,
        63e5752e-3a4d-11e3-a457-76d5146fbfb5,
        e4fb10ff-3a77-11e3-9767-863d8c1156cd,
})
131031 15:25:07 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
131031 15:25:07 [Note] WSREP: Flow-control interval: [16, 16]
131031 15:25:07 [Note] WSREP: Received NON-PRIMARY.
131031 15:25:07 [Note] WSREP: New cluster view: global state: 1ddd4eed-ee2e-11e2-0800-45d2899001c9:2021088, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
131031 15:25:07 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131031 15:25:07 [Note] WSREP: declaring 2853b263-3a4c-11e3-8d08-bb69a1cadba2 stable
131031 15:25:07 [Note] WSREP: view(view_id(NON_PRIM,2853b263-3a4c-11e3-8d08-bb69a1cadba2,499) memb {
        2853b263-3a4c-11e3-8d08-bb69a1cadba2,
        b138644e-3a4f-11e3-a890-cbf6831de099,
} joined {
} left {
} partitioned {
        4e6e7b63-3a59-11e3-83c8-4b32aeffee17,
        63e5752e-3a4d-11e3-a457-76d5146fbfb5,
        e4fb10ff-3a77-11e3-9767-863d8c1156cd,
})
131031 15:25:07 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 1, memb_num = 2
131031 15:25:07 [Note] WSREP: Flow-control interval: [23, 23]
131031 15:25:07 [Note] WSREP: Received NON-PRIMARY.
131031 15:25:07 [Note] WSREP: New cluster view: global state: 1ddd4eed-ee2e-11e2-0800-45d2899001c9:2021088, view# -1: non-Primary, number of nodes: 2, my index: 1, protocol version 2
131031 15:25:07 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131031 15:25:07 [Note] WSREP: declaring 2853b263-3a4c-11e3-8d08-bb69a1cadba2 stable
131031 15:25:07 [Note] WSREP: declaring 63e5752e-3a4d-11e3-a457-76d5146fbfb5 stable
131031 15:25:07 [Note] WSREP: view(view_id(NON_PRIM,2853b263-3a4c-11e3-8d08-bb69a1cadba2,500) memb {
        2853b263-3a4c-11e3-8d08-bb69a1cadba2,
        63e5752e-3a4d-11e3-a457-76d5146fbfb5,
        b138644e-3a4f-11e3-a890-cbf6831de099,
} joined {
} left {
} partitioned {
        4e6e7b63-3a59-11e3-83c8-4b32aeffee17,
        e4fb10ff-3a77-11e3-9767-863d8c1156cd,
})
131031 15:25:07 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 2, memb_num = 3
131031 15:25:07 [Note] WSREP: Flow-control interval: [28, 28]
131031 15:25:07 [Note] WSREP: Received NON-PRIMARY.
131031 15:25:07 [Note] WSREP: New cluster view: global state: 1ddd4eed-ee2e-11e2-0800-45d2899001c9:2021088, view# -1: non-Primary, number of nodes: 3, my index: 2, protocol version 2
131031 15:25:07 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.


thanks in advance.

  

Alex Yurchenko

unread,
Nov 1, 2013, 8:23:38 AM11/1/13
to codersh...@googlegroups.com
Hi,

What is it that you want to know? That one punctured tyre brings the
whole car down? Or that you hardly can talk over the phone if half of
the sounds is lost?

Like anything, Galera has its limits - and this is one of them. And it
is more of a logical limit: maybe the packet loss tolerance can be
improved but not by much, and the post you're referring to is a great
example: there always is a grey area between 0% and 100% packet loss
where you neither can declare the node as failed nor communicate with it
reliably.
--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Vince

unread,
Nov 1, 2013, 11:34:28 PM11/1/13
to codersh...@googlegroups.com
i don't think it's a logical limit. taken with 5 nodes including the garbd with pc.weight=1 for each of them. why would a single node packet loss problem affect the quorum calculation and bring all others nodes to initialized state.

Alex Yurchenko

unread,
Nov 2, 2013, 5:29:27 AM11/2/13
to codersh...@googlegroups.com
On 2013-11-02 05:34, Vince wrote:
> i don't think it's a logical limit. taken with 5 nodes including the
> garbd
> with pc.weight=1 for each of them. why would a single node packet loss
> problem affect the quorum calculation and bring all others nodes to
> initialized state.

Well, before you even start thinking about *quorum*, you need to come to
a *consensus* about membership. And you know what *consensus* means.
*Everybody* must agree. Now suppose you have nodes A, B, C, D, E. And
for the sake of the argument you have such (50%) packet loss on E, so
that at a given round it sees only B and D. The overall picture is:

A sees A, B, C, D
B sees A, B, C, D, E
C sees A, B, C, D
D sees A, B, C, D, E
E sees B, D, E

So not only all 5 of them can't agree on the membership, even 4 "well
connected" nodes can't. And more than half of the nodes see E. Now it
still may seem quite simple to you, since you know that the packet loss
is on E. But how do the nodes know? Maybe the packet loss is actually on
A and C?

So by default Galera tries 3 times and gives up (instead of trying
indefinitely which would have an appearance of hanging).

Had the packet loss been lower, 1 of three attempts would have resulted
in all of A, B, C, D seeing E and reaching consensus on 5-node
membership. Had the packet loss been higher, 1 of the attempts would
have resulted in all of A, B, C, D not seeing E and reaching consensus
on 4-node membership.

Only then you can get to "quorum" calculation.

Vince

unread,
Nov 2, 2013, 8:24:01 AM11/2/13
to codersh...@googlegroups.com
thx for the info,

i do understand that galera's goal is data consistency and consensus required for membership. i've been using/testing galera for a few months only and all the info i got are from codership wiki so i may have some misunderstanding. i did assume if any node passed the hard limit of evs.inactive_timeout. it should be broadcast to the whole cluster and remove it. i now know i were too naive after i read your info, it's hard to know which node actually getting packet loss. is it possible to config galera to drop any node out of the cluster if over 2 of the members with reported pass evs.suspect_timeout or evs.inactive_time? 

anyway i know the actual mechanism can be more complex, still the result should be the problem(packet loss) node got removed or remain in the cluster. if it got removed then problem solved. if it still have the membership, with the high packet loss it will become a split brain situation, but still don't see a reason why would a single split brain node in a five node cluster can affect the whole cluster. isn't that split brain situation can be avoid and correct by the quorum calculation too? it's weird to have the whole cluster in "initialzed" state while the remains four nodes still in sync.

Alex Yurchenko

unread,
Nov 2, 2013, 12:22:39 PM11/2/13
to codersh...@googlegroups.com
On 2013-11-02 14:24, Vince wrote:
> thx for the info,
>
> i do understand that galera's goal is data consistency and consensus
> required for membership. i've been using/testing galera for a few
> months
> only and all the info i got are from codership wiki so i may have some
> misunderstanding. i did assume if any node passed the hard limit of
> evs.inactive_timeout. it should be broadcast to the whole cluster and

As you should have seen already, the tricky word is "whole". The rest
trivial.

> remove it. i now know i were too naive after i read your info, it's
> hard to
> know which node actually getting packet loss. is it possible to config
> galera to drop any node out of the cluster if over 2 of the members
> with
> reported pass evs.suspect_timeout or evs.inactive_time?

You're missing it about evs.suspect_timeout or evs.inactive_time, they
are not relevant to establishing the membership - there is membership
protocol for that. And again you're falling into using "2 of the
members" construct. Who are the members?

> anyway i know the actual mechanism can be more complex, still the
> result

With packet loss it is really complex. Getting back to the old example,
in the first round it was

>> A sees A, B, C, D
>> B sees A, B, C, D, E
>> C sees A, B, C, D
>> D sees A, B, C, D, E
>> E sees B, D, E

then next round it can be

>> A sees A, B, C, D, E
>> B sees A, B, C, D
>> C sees A, B, C, D, E
>> D sees A, B, C, D, E
>> E sees A, C, D, E

and now there is no way to figure out which node is a culprit. Add to
this that communication is two-way, with retransmissions and TCP
timeouts, and you should understand how hairy it can get.

> should be the problem(packet loss) node got removed or remain in the
> cluster. if it got removed then problem solved. if it still have the
> membership, with the high packet loss it will become a split brain

And you're missing what split-brain is.

> situation, but still don't see a reason why would a single split brain
> node
> in a five node cluster can affect the whole cluster. isn't that split
> brain
> situation can be avoid and correct by the quorum calculation too? it's
> weird to have the whole cluster in "initialzed" state while the remains
> four nodes still in sync.

Group Communication IS weird. If you really want to make educated
guesses about it, I'd suggest you read some academic research about it.
Last time I checked there was no textbooks, so you'll have to do with CS
papers. Start with Yair Amir PhD thesis.

But the bottom line is: everything has its LIMITS for correct operation.
It is naive to expect that any cluster can be made operational in an
arbitrarily poor network (if not for anything else, then because it
simply won't be able to replicate). And before you say that the network
between 4 nodes is great: you had a cluster of 5 nodes. And the network
in the *cluster* was crappy. Don't use crappy links (there are so many
good ones). It is plain pointless. There is nothing you can gain by
running 5th node over a crappy link. Except really bad performance on
the verge of downtime. (Note, performance will be crappy long before the
cluster finally disintegrates).

Galera is a SYNCHRONOUS replication cluster. It is only natural to
expect that it needs at least a semi-decent network for operation. For
crappy links they invented ASYNCHRONOUS replication. (In PAC theorem
terms "synchronous" means "consistency". So you end up trading off
between "partition tolerance" and "availability". In this case you
avoided partition (stayed with 5 nodes) and ended up with 0
availability.) So, use asynchronous replication for crappy links. It
works like a charm.

Vince

unread,
Nov 2, 2013, 2:31:42 PM11/2/13
to codersh...@googlegroups.com

You're missing it about evs.suspect_timeout or evs.inactive_time, they
are not relevant to establishing the membership - there is membership
protocol for that. And again you're falling into using "2 of the
members" construct. Who are the members?

in this case we have to trust the others four members with a specific rating, maybe any node with 3 out of 5 synced connection.


And you're missing what split-brain is.


sorry about that. i really have no depth knowledge of how galera actually works, what i mean here is out of sync. ok drop out the problem node is the easiest solution, even if u can't determine which node should be dropped out just by keepalive/membership protocol. if you need to maintain consistency, it should have sequence number and time to live record. it should be more easier to determine which is the problem node.
we are evaluating galera just because of its consistency nature, our situation do need data consistency. i am just looking for a solution to make galera work as a consistency database, while facing any server lag or packet loss i don't mind the whole cluster just drop it out. i can't trade off consistency or availability but would rather drop out doubtful node. 

we already have some custom builded peers quorum script to detect packet loss/server lag on our cluster. however during my test shows that even we can detected the problem node and shutdown the node itself. i guess its because of the packet loss, rest of the server won't back online. since all others nodes are in "initialized" state, i am not sure if i bootstrap any single node will have any data lost? do they roll backed everything and remain the same before goes into "initialized" state?

i am also wondering if i tune down the evs.suspect_timeout/inactive_time as well as tune up the evs.join_retrains_periods to say 15mins would it make the problem node got drop out easier?

i am not that knowledgable to go through  those thesis for sure. but i do have some experience of simple peers scripting to maintain data synchronization on clusters. we just base on sequence number and and time to live to drop out outdated node. i know that galera is more complex but i think it should be doable. for sure my doable means configure it to drop out the doubtful node while maintain consistency and availability.



Alex Yurchenko

unread,
Nov 4, 2013, 2:48:57 PM11/4/13
to codersh...@googlegroups.com
You can check wsrep_last_committed status variable. If you stick to the
practice of bootstrapping from the node with the highest seqno, you
should not lose any data.

> i am also wondering if i tune down the
> evs.suspect_timeout/inactive_time as
> well as tune up the evs.join_retrains_periods to say 15mins would it
> make
> the problem node got drop out easier?

I doubt that.

> i am not that knowledgable to go through those thesis for sure. but i
> do
> have some experience of simple peers scripting to maintain data
> synchronization on clusters. we just base on sequence number and and
> time
> to live to drop out outdated node. i know that galera is more complex
> but i
> think it should be doable. for sure my doable means configure it to
> drop
> out the doubtful node while maintain consistency and availability.

If you managed to make a script that can reliably detect and shutdown
the node which looses packets - that's awesome. We could not. The above
was to explain why don't have it as a high priority. But, maybe we'll
have to get back to it one day.

Vince

unread,
Nov 5, 2013, 5:40:26 AM11/5/13
to codersh...@googlegroups.com

You can check wsrep_last_committed status variable. If you stick to the
practice of bootstrapping from the node with the highest seqno, you
should not lose any data.


oh thanks a lot that will solve our problem for now. 

If you managed to make a script that can reliably detect and shutdown
the node which looses packets - that's awesome. We could not. The above
was to explain why don't have it as a high priority. But, maybe we'll
have to get back to it one day.



our impanelment is quite simple, the main problem we have is we can't do long transaction commit.

anyway, here's our logic and so far it work with trade off of slow commit.

peer connection with heartbeat and ttl, only retrying a limited time if connection lost. a node is live as long as it is connected to any single node of the cluster. 
we only have sequence number request and commit request, request will route through other node is no direct connection.
the problem we had is to get the majority agree on sequence number, especially with even number of node we have delay to get it done. packet loss is not a problem since no response from requests will be drop out after retries anyway.

as you can see we will have performance issue on commit. that's why we looking at galera at the moment. i strongly disagree that with data consistency it has to trade off availability. the main purpose of cluster should be availability. 

 
 
Reply all
Reply to author
Forward
0 new messages