Election timeout in geo-distributed servers

268 views
Skip to first unread message

ruby...@gmail.com

unread,
Dec 4, 2020, 2:14:14 AM12/4/20
to raft-dev
I am developing a Raft library (https://github.com/akiradeveloper/lol) and want to improve heartbeat and election timeout.

In Raft, leadership is maintained by periodical heartbeat. When a follower finds no heartbeat is received in a certain election timeout, it starts a new election.

In Raft's paper, the heartbeat's interval is fixed and the election timeout is fixed. This works only in environments where the latency between any two nodes is the same and the latency is known prior to deployment.

This assumption is broken when we deploy a Raft node in a geo-distributed environment where any latencies between nodes aren't necessarily the same.

If you know some good solution please let me know.

I have an idea but not sure this is correct. In my algorithm, a follower decides the next election timeout based on the previous two heartbeat times x, y. The next election timeout is calculated as f(y-x). We can extend this to previous N heartbeats and drop exceptional values. This algorithm looks reasonable to me but not sure so please give me comments.

- Akira

Jordan Halterman

unread,
Dec 4, 2020, 2:20:47 AM12/4/20
to raft...@googlegroups.com
You might want to look at adaptive failure detection algorithms. We’ve used a phi accrual failure detector for this purpose, so the cluster can be easily tuned for different environments (i.e. LAN and WAN).

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/a538f828-31e7-48fd-bf1d-508e3f01ad4dn%40googlegroups.com.

Oren Eini (Ayende Rahien)

unread,
Dec 4, 2020, 2:25:47 AM12/4/20
to raft...@googlegroups.com
You must ensure that the election timeout is known and roughly the same across the board.
Otherwise, you'll have instability in the cluster.
My impl reject nodes with election timeout that is not within 10% of the node election timeout.

The usual metric we use is that you want the election timeout to be roughly 5-10 times the ping time between instances.
In other words, using: https://www.cloudping.info/

I'm getting a ping time that is 250ms to US West from my location, so I'll probably use 3 seconds as the election timeout.
For Sydney, on the other hand, I'm getting 475ms, which means that I'll put the election timeout at 5 seconds.

Note that as usual, we need to balance liveliness with failure detection. 

The idea of dynamically changing the election timeout is... suspect. 
First, you'll need to make sure that this runs through the cluster as well, to make sure that a majority of the nodes are up to date on this.
Second, you are assuming that the network conditions are stable. It is very common to be able to get wildly different times.
Just in the time it took me to write this email, I checked  Beijing  a few times and got: [1126, 295]. And to Ningxia: [388, 3085].


But you have to account for failures, etc. What happens when some of your nodes have different election timeouts? You may end up in a situation where:
* The election timeout started out as 5 seconds.
* Long period of stability, the election timeout drops to 500 ms.
* Network disruption / slowdown, average ping time is 400 ms now.
* You now have to deal with the cluster setup in such a way that it cannot elect a leader.

Another thing to consider is that measuring the heartbeat numbers are about _one_ path in the network. Leader to followers.
What about the latency between the other nodes.
Let's assume that Node 1 to 2 & 3 is 100 ms. But node 2 to 3 is 400 ms. 
When Node 1 is the leader, if you set things up based on heartbeats, you'll fail to recover from the node going down.




--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/a538f828-31e7-48fd-bf1d-508e3f01ad4dn%40googlegroups.com.


--
Oren Eini
CEO   /   Hibernating Rhinos LTD
Skype:  ayenderahien
Support:  sup...@ravendb.net
  

jordan.h...@gmail.com

unread,
Dec 4, 2020, 2:49:40 AM12/4/20
to raft...@googlegroups.com
I agree you do have to be careful here, but the scenarios you’re describing where a leader cannot be elected seem to assume an improper implementation of adaptive failure detection for this use case. Sure, in the event there’s a drastic swing in latency you could actually see *slower* leader election times due failure detectors needing to adapt over multiple election rounds. But you’re implying that failure detectors cannot adapt without a leader. In the event an election times out due to an over-optimized election timeout like the ones you’re describing, that timeout should just be treated as a high latency round trip, feeding back into the adaptive failure detector and causing the election timeout to increase for the next round. For example, in your last scenario if the election timeout on nodes 2 and 3 is 300ms and the election times out, the failure detector sees that as a 300ms latency and adjusts the timeout to e.g. 500ms for the next round. As long as the failure detector continues to adapt the election timeout even when a leader is not present, I don’t see how it could prevent a leader from being elected. Indeed, a static timeout would be the more likely culprit in that case.

On Dec 3, 2020, at 11:25 PM, Oren Eini (Ayende Rahien) <aye...@ayende.com> wrote:



Vilho Raatikka

unread,
Dec 4, 2020, 3:20:38 AM12/4/20
to raft...@googlegroups.com
Thanks Jordan, very interesting (and missed) information. Paper's been there only for the last 16 years so I wonder how come I've never come across..

Vilho

Oren Eini (Ayende Rahien)

unread,
Dec 4, 2020, 4:30:22 AM12/4/20
to raft...@googlegroups.com
The actual issue I have is that you may have different timeouts in different locations.

Consider the case of a 300 ms timeout, where the latency spiked.
And now there is just enough time to _initiate_ the connection, but not to close the election in that time frame.

It is something that we saw happened. The latency got just bad enough that we would keep re-running elections, but not bad to the point where we would extend the timeout. 

Henrik Ingo

unread,
Dec 4, 2020, 7:40:52 AM12/4/20
to raft...@googlegroups.com
To add to the other replies...

It's worth pointing out that it's not wrong to call for election too often. Yes, it's not desired, but it's not incorrect. Any node can call for election "if they feel like it". From this point of view experimenting with what you are doing is kind of safe.

The other thing you may want to look into is to think about which nodes can participate in elections in the first place. Or, to put it another way, which nodes can become leaders.

Assume you have 2 DCs on US West Coast, with 10 ms latency between each other, and 3rd DC in US East Coast, with 60 ms latency to the first two. And you have a 4th DC in Sydney, with the 500 ms latency. It would now makes sense to configure your cluster so that the Sydney nodes will always be followers and never participate in elections even. They are just passive followers. You could now configure the 3 US based nodes to have election timeout relative to the 60 ms latency between those nodes. The nodes in Sydney can be ignored for purpose of election and timeout.

henrik

On Fri, Dec 4, 2020 at 9:14 AM ruby...@gmail.com <ruby...@gmail.com> wrote:
--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/a538f828-31e7-48fd-bf1d-508e3f01ad4dn%40googlegroups.com.


--

Philip O'Toole

unread,
Dec 4, 2020, 7:59:52 AM12/4/20
to raft...@googlegroups.com
BTW (and for fun) I replicated SQLite using Raft across 3 globally-spaced datacenters: https://www.philipotoole.com/rqlite-v3-0-1-globally-replicating-sqlite/

(This is an older version of rqlite, but the basic findings remain the same)

ruby...@gmail.com

unread,
Dec 5, 2020, 9:46:07 AM12/5/20
to raft-dev
Phi Accrual Failure Detection is very interesting idea!
It is similar to my idea but more mathematically sophisticated.

I learned it today and implemented it today.


I am thinking of integrating this to my Raft library in the next step.

- Akira

ruby...@gmail.com

unread,
Dec 11, 2020, 11:25:12 PM12/11/20
to raft-dev
Thank you for the advice!

I've finished embedding phi detector to my Raft library.

For the future successors, here is how I integrate phi detector to my Raft library.
1. Initially, the detector watches itself. Initially, detector has only one interval of 1sec which is enough long and diluted later by the actual heartbeats.
2. When new heartbeat arrives and the leader_id is not the one detector is watching at. Reinitialize the detector.
3. When phi > 3 it starts to wait for random timeout. This is needed to avoid indefinite election conflicts. The random timeout is based on +4 SD from the normal distribution.

- Akira
Reply all
Reply to author
Forward
0 new messages