Breaking out the failure detector

77 views
Skip to first unread message

Martin Furmanski

unread,
Jan 30, 2018, 9:47:46 AM1/30/18
to raft-dev
Does anyone have thoughts on the feasibility and appropriateness of breaking out the failure detection from the core of Raft?

The basic idea is that failure detection would run as a service alongside Raft and they would thus only interact when there are suspicions.

The followers would monitor the current leader and start elections on suspicions.
The leader would optionally monitor followers and step-down if it suspects that it has not got a quorum.

The failure detector would then be pluggable and could for example be an implementation of the phi-accrual failure detector.

Are there any major benefits of having the failure detection baked into the core of the protocol, which I am missing here?
The Raft papers are generally quite good at separating out various concerns in most other areas.

The elections themselves need a bit of randomness when scheduled to behave nicely of course, but I think that also is a separate concern.

In practical systems I think this would work better, for example in situations where you are sending a lot of data through the Raft machinery and heartbeats, e.g. failure detection, thus compete with a potentially very loaded path of data processing. Separating the concerns opens up the possibility to better engineer the QoS of the failure detection. Separate channels, separate processes, separate priority levels, etc.

Has anyone been experimenting with or thinking something similar?

Best Regards,
Martin Furmanski

Oren Eini (Ayende Rahien)

unread,
Jan 30, 2018, 1:01:05 PM1/30/18
to raft...@googlegroups.com
The major problem with this is that it is quite easy to get into a state where the leader cannot proceed by the failure detector think everything is fine.
In particular, if you are sending events and the followers can't catch up quickly enough, you want to know that, and not continue accepting commands and making things worse

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Martin Furmanski

unread,
Jan 30, 2018, 1:19:08 PM1/30/18
to raft-dev
I'm not really able to follow the logic here. With sending events I have to assume you mean AppendEntries.Request, since that is what you mainly send to followers. You are arguing that if they are not able to handle and thus respond quickly enough, then I guess you mean the Leader should stop accepting NewEntry.Requests. There is nothing which limits that when sending heartbeats in the traditional Raft sense either. Could you expand on your argument here, perhaps being more clear with which messages, events, commands that you are referring to and give some example of where it would break down?
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

Karl Nilsson

unread,
Jan 30, 2018, 1:38:42 PM1/30/18
to raft...@googlegroups.com
I'm curious - in this scenario would have the leader step down voluntarily if the followers seem slow? If the followers are slow what makes any of them more likely to be suitable leader?

Do you use pipelining of commands? if so it is fairly trivial to put a limit on the number of in-flight entries allowed.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

Martin Furmanski

unread,
Jan 30, 2018, 1:48:47 PM1/30/18
to raft-dev
Not entirely sure what you are asking or referring to as a scenario.
Leader step-down is already described in the Raft literature and it would work basically the same regardless of failure detector. If the leader cannot reaffirm it's leadership through heartbeat responses or through monitoring of follower connectivity in the traditional model for failure detectors then it will step down.

Followers being slow is not a description good enough to be analysed really. What type of slowness are you talking about? Most assumptions in this context and distributed systems in general do not consider partially faulty processes. We consider two failure cases in general, a crash or a network partition. There is nothing in particular that I know about slow instances that is argued about in any of the Raft literature.

Feel free to expand on your problem statement if you wish, but please keep it on topic and related to the breaking out of the failure detection component in Raft. Otherwise perhaps you should create another topic for it to be discussed.

Karl Nilsson

unread,
Jan 30, 2018, 2:54:31 PM1/30/18
to raft...@googlegroups.com
I was trying to work out why Oren felt append entries essential to detect “follower slowness” independently of a separate failure detector. I think this qualifies as on topic but as the OP I guess you are the ultimate arbiter of that.

I think a separate failure detector could work well and that is what I use in an implementation I am currently working on. Combined with pre-vote it should allow leaders to remain stable whilst achieving low partition detection latency as well as adapting better to changing network conditions. For systems where you may have many independent raft clusters running inside the same system (multi-raft) it soon become unhelpful to have them all do their own failure detection.

Oren Eini (Ayende Rahien)

unread,
Jan 30, 2018, 2:56:57 PM1/30/18
to raft...@googlegroups.com
Follower slowness may indicate a problem for the cluster as well.
For example, if the I/O is swamped, the follower may not be able to process events quickly enough. In that case, the leader will abort and error will be send to the user.
With a side channel for errors, you might be able to continue being the leader, and start stockpiling problems for later.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

Oren Eini (Ayende Rahien)

unread,
Jan 30, 2018, 2:57:22 PM1/30/18
to raft...@googlegroups.com
Better to have an explicit and immediate error. Admin can then chose to increase capacity, increase election timeout or change behavior.

Karl Nilsson

unread,
Jan 30, 2018, 3:24:13 PM1/30/18
to raft...@googlegroups.com
That sounds more like a metrics and monitoring concern than needing to be an explicit error.
Unless you mean that the followers aren’t making any progress at all which should be detectable independently of any external failure detector.. I believe the OP suggests that the leader is monitored by the followers to detect potential non fail-stop failure scenarios.

If I understand it correctly your concern is that if the leader detects “slowness” and steps down the followers won’t detect this as the failure detector still makes it appear as if all is well? If so that is a reasonable concern however in my implementation I value stability and would not have a leader step down because progress is slow. Different trade offs I guess. 



Oren Eini (Ayende Rahien)

unread,
Jan 30, 2018, 3:27:16 PM1/30/18
to raft...@googlegroups.com
Leader stepping down is something that happens. 
I would rather have everything (code, ops, etc) be ready for that, rather than have the world melt when that happens
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

jordan.h...@gmail.com

unread,
Jan 30, 2018, 3:31:52 PM1/30/18
to raft...@googlegroups.com
We do exactly this in Atomix. We use a separate phi accrual failure detector both for leader election timeouts and for session expiration. To maintain randomness for leader election timeouts, we check failure detectors at semi-random intervals. To maintain consistent for sessions, leaders detect failures and commit an entry to expire the sessions. Also, we do what someone termed “multi-Raft” (sharding) and use a single failure detector for all shards.

The reason we had to do this in Atomix is because we use clients’ sessions for external leader elections via a Raft state machine, and detecting failures and electing new leaders quickly is important for us (working in SDN we need to be able to elect new leaders quickly to maintain control of the network). But the problem was, when a Raft leader was killed, we would have to wait for a new leader to be elected before the leader election state machine could make any changes related to that same failure. And because of the time it can take to elect a Raft leader, sessions’ timeouts need to be reset after a new leader is elected so arbitrarily long leader changes don’t lead to expired sessions. So, when a Raft leader crashes, it’s important for us to detect the failure quickly, elect a new leader, and reset session timeouts as quickly as possible.

In practice this has significantly decreased our clients’ leader election failover times. The risk here is just unnecessary leader elections from false positives in the failure detector. But with the pre-vote protocol also implemented, there’s less risk of false positives from aggressive failure detection leading to unnecessary leader changes.
--
Reply all
Reply to author
Forward
0 new messages