Cluster Startup Understanding and implications of modifications

Philip Haynes

未讀,

2015年10月23日凌晨12:26:592015/10/23

收件者：raft-dev

Hi,

I am implementing a high availability logging system and am seeking to validate my understanding of some of the leader election criteria and its implications for HA situation.

Our system has three nodes that each listen for upstream events on multicast addresses using the Aeron comms library (https://github.com/real-logic/Aeron). As Aeron only provides reliable rather than guaranteed messaging, we temporarily log events and then further transmit events to upstream processes for further processing if the upstream processes are available and flow control says it is safe to push events along our processing pipeline. We are using Rocks DB as the temporary data store and must process 1B events per day.

A key property of our system is that it must log all upstream events since failure to capture an event would mean the system could miss detection of fraudulent activity.

My first question is, am I correct in my understanding that in a three node system that at least 2 nodes must start before a leader can be elected? This then has the implication that if we lose two nodes that the system is down and that the third node would sit in a loop putting itself into Candidate mode and voting for itself.

From an availability point of view, this is less than ideal since it would be better that the remaining third node would continue to operate.

My second question is from a consistency point of what sort of errors would I expose myself to in the event that on a two node failure, the system automatically changes cluster size to one and thus elects itself as the leader?

Kind Regards,

Philip

Arnaud Bailly

未讀,

2015年10月23日凌晨3:08:242015/10/23

收件者：raft...@googlegroups.com

What about the following scenario:

- 2 nodes fail in a 3-nodes cluster

- remaining node starts operating as a 1-node cluster

- remaining mode fails

- one of the other 2 nodes gets back and start operating

The new 1-node cluster would not have knowledge of what happened on the node that has just failed, and no way of acquiring that knowledge from a other source because data would have not been replicated.

Generally speaking raft errs on the side of consistency against availability, hence the requirement that a majority of nodes be available in the cluster. Removing that constraint means forfeiting strong consistency.

My 50 cts

Arnaud

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Arnaud Bailly

twitter: abailly

skype: arnaud-bailly

linkedin: http://fr.linkedin.com/in/arnaudbailly/

Henrik Ingo

未讀,

2015年10月23日凌晨3:21:202015/10/23

收件者：raft...@googlegroups.com

On Fri, Oct 23, 2015 at 7:26 AM, Philip Haynes <haynes...@gmail.com> wrote:
> A key property of our system is that it must log all upstream events since
> failure to capture an event would mean the system could miss detection of
> fraudulent activity.
>

This is an aside, but I can't resist digressing a bit: If your goal is
to detect fraudulent activity, then even if you missed a few events it
would be highly unlikely that exactly those events are the ones that
needed to be detected. Otoh if your requirement is to provide a 100%
accurate audit trail, then of course that's a much harder requirement.

> My first question is, am I correct in my understanding that in a three node
> system that at least 2 nodes must start before a leader can be elected? This
> then has the implication that if we lose two nodes that the system is down
> and that the third node would sit in a loop putting itself into Candidate
> mode and voting for itself.
>

Correct.

> From an availability point of view, this is less than ideal since it would
> be better that the remaining third node would continue to operate.

From your point of view, yes. The point with requiring a majority of
nodes for electing a primary, is to guarantee that there is never more
than one primary in existence. This is to avoid split brain
situations.

> My second question is from a consistency point of what sort of errors would
> I expose myself to in the event that on a two node failure, the system
> automatically changes cluster size to one and thus elects itself as the
> leader?

In the case of a network interruption, all 3 nodes would be running
and elect themselves as primary and continue to process log events
independently, leading to 3 diverged databases.

From your use case, it's not clear to me that Raft is an ideal choice
of HA for you. It seems to me you don't need to worry about split
brain, rather you want to maximize write availability. In other words,
you just want to capture as many upstream log events as possible and
then store them in a redundant fashion. But it doesn't seem like you
have a problem with for example strict ordering of the incoming
events. Since your use case is what I call "insert only", you also
don't need to worry about split brain.

If you need to avoid storing the same upstream log event twice, you
only need some way of detecting duplicate entries. The easiest way to
do this is if the upstream events already contain some unique id from
their respective source systems.

In short, what you probably want to look into is some form of
multi-master replication with eventual consistency semantics.

henrik

>
>
> Kind Regards,
>
> Philip
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "raft-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to raft-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--

henri...@avoinelama.fi
+358-40-5697354 skype: henrik.ingo irc: hingo
www.openlife.cc

My LinkedIn profile: http://fi.linkedin.com/pub/henrik-ingo/3/232/8a7

Philip Haynes

未讀,

2015年10月24日凌晨12:32:592015/10/24

收件者：raft-dev、henri...@avoinelama.fi

Thanks for your clarifications Henrik and Arnaud.

RE: Your aside Henrick. Whilst at a certain level you are correct, around 5% of transactions we process will be relevant for further processing or ~ 50M per day. There is no uniform correlation of an event to the financial / security implications of the activity being monitored. So we are electing as far as possible, minimise the risk missing fund transfers events by terrorists as far as possible.....

Henrick you are correct that RAFT may not be the perfect choice for my use case. However, having had less than positive experiences implementing Consensus algorithms previously, from an engineering implementation perspective, the understandability aspect of RAFT is very attractive. Thus I am looking to potentially "bend" the algorithm rather than us other harder to understand and test approaches. Again, given the cost and complexity of implementing and testing the implementation I am hoping for a single implementation of this with

So it seems, the key risk by allowing a single server to act as master is split brain. However, as you point out "since your use case is what I call "insert only", you also don't need to worry about split brain. " - with the much more likely scenarios when the system segments into 3 split brains, none of the nodes is receiving any data anyway....

Thus it would appear that I am not being too naughty in weakening RAFT's consistency guarantees slightly by transitioning to Leader state in the event of a vote timeout (even if done as a configurable option for other scenarios where consistency is more important than availability).

Thanks for your clarifications and feedback.

Kind Regards,

Philip

Oren Eini (Ayende Rahien)

未讀,

2015年10月24日凌晨12:37:212015/10/24

收件者：raft...@googlegroups.com、henri...@avoinelama.fi

Do you actually need to process things in isolated mode? Or do you need to just save them?

Stuff that fails to be saved can get dump to disk, to be retried when the cluster is up again.

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

Philip Haynes

未讀,

2015年10月24日下午6:56:442015/10/24

收件者：raft-dev、henri...@avoinelama.fi

In practice, the system is it is a bit more complicated than I previously described.

Events are received in two packets (high priority and low priority) and based on a decision from a rules engine service as to

whether the event is of interest to the rest of the system or not pass an event up the line.

The pattern of usage depends on whether the system is in steady state, or upstream processes are down, or are in recovery mode.

In the steady state all three nodes receive event information and it is then is multicast up the line by the master. Once the upstream processes receive an event., a delete event request is issued. The system has an average network latency of less than 50 microseconds so that in the normal case complete event processing occurs before needing to store the event. A strong master minimises network traffic limits complexities associated with different nodes seeing a different event order - that is, a full event process could complete on the master before a follower sees it.

When upstream processes are not available, the system logs and monitors for upstream process recovery. As upstream processes recover, the backlog must be processed with appropriate levels of flow control prior to transitioning back to normal operation.

My first stab at an implementation does as you suggest however under load edge cases emerge leading to memory leaks and errors in logic with upstream processes. Particularly flow control, for example, if upstream processes have been down for a period, pretty much immediately upstream processes become overwhelmed, locking the entire system irrecoverably.

Henrik Ingo

未讀,

2015年10月25日凌晨4:41:142015/10/25

收件者：raft...@googlegroups.com

On Sat, Oct 24, 2015 at 7:32 AM, Philip Haynes <haynes...@gmail.com> wrote:
> Henrick you are correct that RAFT may not be the perfect choice for my use
> case. However, having had less than positive experiences implementing
> Consensus algorithms previously, from an engineering implementation
> perspective, the understandability aspect of RAFT is very attractive. Thus I
> am looking to potentially "bend" the algorithm rather than us other harder
> to understand and test approaches. Again, given the cost and complexity of
> implementing and testing the implementation I am hoping for a single
> implementation of this with

You're correct that the elegance and simplicity of Raft are a gift to humanity!

> So it seems, the key risk by allowing a single server to act as master is
> split brain. However, as you point out "since your use case is what I call
> "insert only", you also don't need to worry about split brain. " - with
> the much more likely scenarios when the system segments into 3 split brains,
> none of the nodes is receiving any data anyway....
> Thus it would appear that I am not being too naughty in weakening RAFT's
> consistency guarantees slightly by transitioning to Leader state in the
> event of a vote timeout (even if done as a configurable option for other
> scenarios where consistency is more important than availability).

In a way. However, I think the resulting system is fundamentally
different from Raft:

- After nodes have been split, you need to somehow reconcile the
nodes with each other. If you assume insert-only operations, this is
trivial, but it's still a fundamental addition to Raft. (Raft is
concerned with avoiding split brain, you with reconciliation.)

- Once you have implemented some form of reconciliation, you can then
ask yourselves why would you handle network splits as a special case
where you allow single nodes to become leaders? You can just as well
allow all of them to be leaders all the time, as this will simplify
your implementation.

At this point you're quite far away from Raft, but if it helps you to
think of Raft as a starting point, by all means, enjoy it!

henrik

Oren Eini (Ayende Rahien)

未讀,

2015年10月25日凌晨4:46:532015/10/25

收件者：raft...@googlegroups.com

Note that if you have an insert only system, you can change things to use a gossip protocol.

Nodes will gossip about their changes, and everything works even in split brain scenario.

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

Philip Haynes

未讀,

2015年10月26日凌晨2:21:022015/10/26

收件者：raft-dev、henri...@avoinelama.fi

Had a busy day today.

I realised this part of the system can be implemented using a OR-Set CRDT.

So mostly the system will work like a normal RAFT system but in the restricted case when things

go completely pear shaped I have a consistency backup with no loss of availability.

In other parts of the system that don't require such high levels of availability.

Thanks for the help - much appreciated.

Philip

Diego Ongaro

未讀,

2015年10月26日下午6:14:092015/10/26

收件者：raft...@googlegroups.com

Philip,

Just to clarify, does this mean that a part of your system will be
handled using CRDTs, and beside it another part will be handled using
"normal" Raft? Because I have to agree with Henrik and the others: if
you weaken Raft to maximize availability, you've given up consistency
and are better off with an entirely different approach.

-Diego

Philip Haynes

未讀,

2015年10月30日凌晨1:33:112015/10/30

收件者：raft-dev

Thanks Diago,

At first glance I had been thinking some of the software infrastructure needed for CRDT's would be shared

RAFT anyway. However, this implementation has been getting way too complicated and I am backing off this approach.

It turns out I have a much simpler solution to my availability problem. If no master exists - simply log but don't process the

event until enough nodes are available and a master can be elected. When the cluster come back online then process the

logs through a RAFT cluster.

Again help appreciated.

Philip

Diego Ongaro

未讀,

2015年11月6日下午1:44:072015/11/6

收件者：raft...@googlegroups.com

Great, that does seem simpler. Glad you ended up in a good place here.

-Diego

回覆所有人

回覆作者

轉寄