Paper: 3 modifications to Raft consensus

518 views
Skip to first unread message

Henrik Ingo

unread,
Aug 21, 2015, 10:54:58 AM8/21/15
to raft-dev
Hi all

I work at MongoDB. Not the dev team, but as a consultant. As our home grown replication happens to be similar to Raft, for our next release we are updating it to be a Raft implementation. This is so that we can benefit from the peer review and correctness proofs that went into the algorithm. (In particular, lack of using election terms has historically been a problem.)

Based on my experience working with several database replication technologies, I've written a paper that proposes 3 modifications to the Raft algorithm. The goal is to improve robustness in some corner cases / operational errors. My motivation of course is to make these improvements available to our own dev team, but I ended up writing them first in the style of an academic paper, and within the context of Raft itself rather than MongoDB. Maybe this way these proposals can benefit the rest of the world as well.

The 3 modifications are:

***
1. Universally Unique Database Identifier: to identify whether a snapshot or a log on one server is in fact some (predecessor or successor) state of the same state machine on other servers of the cluster, or whether it's a snapshot of some completely different state machine that has nothing to do with this cluster.

2. Pre-vote algorithm: this paper provides a more detailed specification of the idea suggested only in passing in §9.6 in (Ongaro, 2014)

3. Leader stickiness: Building on the pre-vote algorithm, we also modify Raft to reject servers that attempt to elect themselves as leader, if the current leader appears to still be healthy to the rest of the cluster. This is to avoid flip-flopping between two competing leaders.
****

I welcome any comments and corrections, should somebody end up reading it.

http://openlife.cc/blogs/2015/august/3-modifications-raft-consensus-algorithm-paper

henrik

Kijana Woodard

unread,
Aug 21, 2015, 11:32:59 AM8/21/15
to raft...@googlegroups.com
Skimmed it. Initial thoughts:

- It seems aggressive/dangerous to delete the log based on mismatched database ids. Rejecting requests seems simpler and safer. 

"Or, put in more political wording: followers will be loyal to their existing leader, and only servers that agree that an election is needed (Candidates) will vote for a leader from within themselves."

- But candidates, in politics and in raft, first vote for themselves. They never vote for other candidates. IIRC, the 9.6 PreVote mechanism defines that "yes" responses can be sent in the period between the "minimum election timeout" and the randomized election timeout for a Follower.

Fwiw, I've always wondered whether the "Requesting PreVotes activity" is part of the Candidate state, the Follower state or some "in between not yet named" state. Given a new state [PreCandidate?], maybe waiting for full election timeout would be viable. I'm not convinced it's necessary/desirable.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Oren Eini (Ayende Rahien)

unread,
Aug 21, 2015, 12:54:33 PM8/21/15
to raft...@googlegroups.com
We have implemented much the same thing, see my comments below. I think that most people with real world impl had to do that.

> Persistent State:
> databaseId 

Is that meant to be a node id or a cluster id?
In our impl, we have a cluster id, which is generated as the first action of the initial node. Messages from a different cluster are rejected.

> Append Entries
> 1. If databaseId of caller doesn't match: delete all
> snapshots and log and reset to initial state, set
> databaseId from caller.

This lead to a misconfiguration ( a new node created is own cluster, then was connected to an existing node) causing a data loss.
In our system, we only update the cluster id if there current node isn't part of a cluster. We don't allow you to move between clusters.

You have the same behavior specified in the AddServer. But you need to make sure that _all_ requests carry the database id, and all behaviors will reject non cluster messages.

PreVote - you silently introduced another state, and it is something that require explanation.
In this state, you are a candidate, and you run a prevote cycle, then a vote cycle.
I assume that you mean to vote for yourself only if you got a PreVote quorom, but that is dangerous, because you basically force all servers to timeout before a PreVote can succeed, in which case you are pretty much guaranteed to have conflicting candidates soliciting votes.

In our impl, we move to the candidate mode, and ask for PreVote.
After we get quorum agreement, we increment the term, vote for ourselves, and call real election.

Note that a node that got a message (such as Append Entries _or_ PreVote) will NOT go into candidate mode. This ensures that in the common case, you'll have one server timing out first, and its pre vote action will ensure that the other servers in the system will stay followers, so it can become a leader quickly.

We ensure leader stickiness by rejecting candidates that have sent us a pre vote request when we got a message in the last timeout/2 timeframe.
Note that we don't reject actual votes, because parts of the cluster may not talk to the leader, and we want to move to the new quorum.

> RequestVote modification:
> 1. Reply false if not in Candidate state (leader stickiness)

If I'm in candidate state, I already voted for myself. Therefor I won't vote for you.

Table 3 seems too complex to me.

#3 - dangerous. You created data loss event here.
If I called AddServer with "db1/orders" instead of "tst1/orders", I just lost all my orders.
It is better to reject such scenario entirely. If the admin want to recreate the db, they can delete it and create a new one manually.

#4 & #2 are the same thing. We have a backup of the cluster log, the leader shouldn't need to make distinction between them.
Why do you NEED to distinguish between them?

And this seems to put too much logic at the hands of the leader. The way we have it, when you add a new server, we send AppendEntries.
It fails if there is no data (new server) or too old for our current log. We then send InstallSnapshot.
If the admin restored from backup, the new node just accepts the AppendEntries and move on.

The clusterid issue is important, and I really don't like the way you set it up, it seems like it can cause problems.

The way it works for us.
All servers starts as followers, and CANNOT become leaders until they have a cluster id set. 
The admin is responsible for letting one of the servers knows that it can be a leader, it register a database id and then can add other servers, which will get their cluster id from it, and only then they can become leader.

This avoid the issue where multiple new servers starting at the same time all assume that they can start accepting requests.


Leader stickiness in your impl means that you now can't wait for the first node to timeout, but have to have _all_ the nodes time out first, which increase failover time.

See the previous discussion about waiting until we didn't get a message of any kind for election-timeout /2 before responding. In practice, this gives the robustness necessary for avoiding flip flops.
You also need to make at least N/2 election cycles (follower times out, try to run elections, rejected, second runs election, etc).

As a side effect, this will tend to choose the _slowest_ node to be the leader.

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--

Henrik Ingo

unread,
Aug 21, 2015, 4:28:42 PM8/21/15
to raft...@googlegroups.com
On Fri, Aug 21, 2015 at 6:32 PM, Kijana Woodard
<kijana....@gmail.com> wrote:
> Skimmed it. Initial thoughts:
>

Wow, that was fast! Thanks so much for commenting.

> - It seems aggressive/dangerous to delete the log based on mismatched
> database ids. Rejecting requests seems simpler and safer.

Correct. While I have labeled that behavior as "Optional" in the
paper, at least most databases would surely take the more conservative
approach of not automatically deleting all your data. Otoh, from the
algorithm point of view, both approaches are correct. For example a
cache server using Raft for replication, could very well do the
automatic deletion approach. Raft will work correctly either way.

> "Or, put in more political wording: followers will be loyal to their
> existing leader, and only servers that agree that an election is needed
> (Candidates) will vote for a leader from within themselves."
>
> - But candidates, in politics and in raft, first vote for themselves. They
> never vote for other candidates. IIRC, the 9.6 PreVote mechanism defines
> that "yes" responses can be sent in the period between the "minimum election
> timeout" and the randomized election timeout for a Follower.
>
> Fwiw, I've always wondered whether the "Requesting PreVotes activity" is
> part of the Candidate state, the Follower state or some "in between not yet
> named" state. Given a new state [PreCandidate?], maybe waiting for full
> election timeout would be viable. I'm not convinced it's
> necessary/desirable.

Ah, this is an interesting point! I suppose I didn't explicitly think
about it when writing.

The answer is that when calling PreVote RPC, the server is still in
follower. The paper is actually written that way.

One reason it is a follower is that before PreVote returns true, the
server doesn't increment its currentTerm. This means, should the
leader be able to reconnect to the server, and send an appendEntries
RPC, then it can actually continue to receive those entries and remain
a follower. (Before introducing PreVote, this is not the case.
currentTerm is immediately incremented, and this server will therefore
never again accept entries from the previous term leader.)

I did however have a bug in the description of PreVote RPC: Instead of
"term", it is better to label the parameter "nextTerm", and it's value
is actually "currentTerm + 1", ie the term we would use if we proceed
to candidate status.

henrik
--
henri...@avoinelama.fi
+358-40-5697354 skype: henrik.ingo irc: hingo
www.openlife.cc

My LinkedIn profile: http://fi.linkedin.com/pub/henrik-ingo/3/232/8a7

Henrik Ingo

unread,
Aug 21, 2015, 5:12:28 PM8/21/15
to raft...@googlegroups.com
On Fri, Aug 21, 2015 at 11:28 PM, Henrik Ingo <henri...@avoinelama.fi> wrote:
>> Fwiw, I've always wondered whether the "Requesting PreVotes activity" is
>> part of the Candidate state, the Follower state or some "in between not yet
>> named" state. Given a new state [PreCandidate?], maybe waiting for full
>> election timeout would be viable. I'm not convinced it's
>> necessary/desirable.
>
> Ah, this is an interesting point! I suppose I didn't explicitly think
> about it when writing.
>
> The answer is that when calling PreVote RPC, the server is still in
> follower. The paper is actually written that way.
>

I need to correct myself: I had a feeling I had been inconsistent with
this at some point (more than the missing +1) and I had:

In RequestVote RPC, Receiver implementation, I have written:
"1. Reply false if not in Candidate state (leader stickiness)"

This is of course wrong - nobody could ever get into Candidate state
since everyone else would always reply false in this step. The correct
way to write this is "1. Reply false if AppendEntries was received
less than election timeout ago."

In other words, the way I'm trying to construct this is: the servers
that eventually end up returning true, will be followers but have
reached their election timeout. (So maybe it would indeed be clearer
with a pre-candidate phase?)

henrik


> One reason it is a follower is that before PreVote returns true, the
> server doesn't increment its currentTerm. This means, should the
> leader be able to reconnect to the server, and send an appendEntries
> RPC, then it can actually continue to receive those entries and remain
> a follower. (Before introducing PreVote, this is not the case.
> currentTerm is immediately incremented, and this server will therefore
> never again accept entries from the previous term leader.)
>
> I did however have a bug in the description of PreVote RPC: Instead of
> "term", it is better to label the parameter "nextTerm", and it's value
> is actually "currentTerm + 1", ie the term we would use if we proceed
> to candidate status.
>
> henrik
>

Oren Eini (Ayende Rahien)

unread,
Aug 21, 2015, 5:15:52 PM8/21/15
to raft...@googlegroups.com
That is going to ensure that the full election timeout is going to expire, and he slowest node is likely to win

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


Henrik Ingo

unread,
Aug 21, 2015, 5:41:13 PM8/21/15
to raft...@googlegroups.com
On Fri, Aug 21, 2015 at 7:54 PM, Oren Eini (Ayende Rahien)
<aye...@ayende.com> wrote:
> We have implemented much the same thing, see my comments below. I think that
> most people with real world impl had to do that.
>

Hi Oren! Thanks, that is good to know. Gives some validation to hear this.


>> Persistent State:
>> databaseId
>
> Is that meant to be a node id or a cluster id?
> In our impl, we have a cluster id, which is generated as the first action of
> the initial node. Messages from a different cluster are rejected.

No, this is not a node id nor a cluster id.

While a real world implementation probably would have some kind of
cluster id (LogCabin has too, even if it is never mentioned in any of
the Raft papers), to me that's not an important part of Raft. It
basically falls under the heading of "authentication", and in
authentication lingo, such an id would commonly be called "shared
secret". But there could also be any other form of authentication,
such as using a variety of cryptographic keys or even just
username+password. But this is not what I'm talking about.

The databaseId is there to identify the *database*. The data,
regardless of the server it is located on. So it should be stored as
part of the database, and should be part of backups, etc.

The databaseId essentially brings the Log Matching Property into the
real world. The idea of the Log Matching property is that if two
servers have the same term + log index, then they are in identical
states. However, this is not true in the real world. You could have 2
servers holding 2 completely different databases, that simply happen
to be at the same term + log index. The databaseId provides the
missing piece of information.

A clusterUUID that is part of a configuration file or otherwise
provided by a client, does not fix this. (But it is useful for
authentication/identification of servers.)

>
>> Append Entries
>> 1. If databaseId of caller doesn't match: delete all
>> snapshots and log and reset to initial state, set
>> databaseId from caller.
>
> This lead to a misconfiguration ( a new node created is own cluster, then
> was connected to an existing node) causing a data loss.
> In our system, we only update the cluster id if there current node isn't
> part of a cluster. We don't allow you to move between clusters.

This is a bit unintuitive. But the gatekeeping happens at AddServer
RPC. As discussed in this thread, a typical implementation (at least
for databases) would fail the AddServer RPC, if the new server
contains some data that is not a copy of the same database as the
cluster has. A corner case, that just happens to be how I wrote it, is
that a node with an empty database will pass AddServer RPC when it has
a different databaseId. It will then get the correct databaseId at
this step in AppendEntries. But there is no data to be deleted, so
while this sounds like a scary step, it is not.

Otoh, if the Optional step 2 in AddServer RPC is not implemented, then
this step actually does what it says: deletes all data in order to
start from scratch.

I'll think if there is a less scary way to construct this, but anyway,
this is how the various additions interact. I hope you can see that
the typical implementation would not actually delete all data
automatically.


> You have the same behavior specified in the AddServer. But you need to make
> sure that _all_ requests carry the database id, and all behaviors will
> reject non cluster messages.
>

No. This is not a cluster Id.

A real world implementation might very well pass along the databaseId
in every call, and check it every time with an assertion. That's
probably good programming practice. But from the algorithm point of
view it's only needed in the AddServer, and the
AppendEntries/InstallSnapshot immediately following and AddServer. So
I've only added it in the places where it is necessary.

> PreVote - you silently introduced another state, and it is something that
> require explanation.

Good point. But after thinking about it while writing my previous 2
mails to this thread, it seems I tried to write it so that PreVote
happens while still in follower state. I didn't quite get it right
though, as discussed in those mails. (Missing +1 in PreVote RPC and
the first step in RequestVote RPC)

It would be a valid approach to introduce a new "Pre-candidate" state.
However, I feel the minimalistic design of Raft suggests I shouldn't
add a state if I can avoid it. But you might argue that it would
actually be clearer with having a separate new state. This is
subjective and both approaches are correct. I'm open to reader's
opinions on this.

> In this state, you are a candidate, and you run a prevote cycle, then a vote
> cycle.

As explained above, the way I wrote it, the PreVote call is supposed
to happen while still in follower state. But I understand why you're
thinking the way you're thinking.

> I assume that you mean to vote for yourself only if you got a PreVote
> quorom,

Correct.

> but that is dangerous, because you basically force all servers to
> timeout before a PreVote can succeed, in which case you are pretty much
> guaranteed to have conflicting candidates soliciting votes.
>

No. You need a majority to timeout before the first PreVote will succeed.

This looks and feels odd, but it doesn't really change anything.
Instead of the first server timing out becoming the new leader, it is
the ceiling(n/2) server that becomes the new leader. Current MongoDB
elections actually have this behavior, and while it looks odd when you
first see it ("Why did the server fail an election even if the leader
is dead?") it does work correctly.

> In our impl, we move to the candidate mode, and ask for PreVote.
> After we get quorum agreement, we increment the term, vote for ourselves,
> and call real election.

This is also ok, but doesn't give you leader stickiness. That's not a
big deal, because Raft is already sticky in many situations (an
isolated node will often be behind in its log). But it doesn't achieve
stickiness in some of the corner cases I'm trying to cover.

> Note that a node that got a message (such as Append Entries _or_ PreVote)
> will NOT go into candidate mode. This ensures that in the common case,
> you'll have one server timing out first, and its pre vote action will ensure
> that the other servers in the system will stay followers, so it can become a
> leader quickly.

What happens if a server S1 sends PreVote to all other servers, and
all other servers also respond to it, and then server S1 crashes? I
assume after some additional timeout the other nodes do try to become
candidates again?

> We ensure leader stickiness by rejecting candidates that have sent us a pre
> vote request when we got a message in the last timeout/2 timeframe.
> Note that we don't reject actual votes, because parts of the cluster may not
> talk to the leader, and we want to move to the new quorum.

This is correct and what I try to do too. (except I don't see why you
do timeout/2?)

You cannot reject actual RequestVotes. Once you see a RequestVote with
a higher term than the currentTerm, then you must abandon the current
leader and respond to the RequestVote. This is a very fundamental
property of Raft, it is not an optimization.

>> RequestVote modification:
>> 1. Reply false if not in Candidate state (leader stickiness)
>
> If I'm in candidate state, I already voted for myself. Therefor I won't vote
> for you.

Thanks, for pointing this out. (Just figured it out myself while
responding to previous mail.)

This is an inconsistency, I explained above what I tried to say.


> Table 3 seems too complex to me.
>
> #3 - dangerous. You created data loss event here.
> If I called AddServer with "db1/orders" instead of "tst1/orders", I just
> lost all my orders.
> It is better to reject such scenario entirely. If the admin want to recreate
> the db, they can delete it and create a new one manually.

Correct. A typical database implementation would be conservative and
fail, rather than automatically delete data. From Raft point of view
both are correct behaviors.


> #4 & #2 are the same thing. We have a backup of the cluster log, the leader
> shouldn't need to make distinction between them.
> Why do you NEED to distinguish between them?

In #2 it is the same server that rejoins the cluster, in #4 it is a
brand new server that was provisioned with a copy of the database.
From Raft point of view these are actually the same (in both you would
just call AddServer) but from the DBA point of view they are of course
completely different use cases.


> And this seems to put too much logic at the hands of the leader. The way we
> have it, when you add a new server, we send AppendEntries.
> It fails if there is no data (new server) or too old for our current log. We
> then send InstallSnapshot.
> If the admin restored from backup, the new node just accepts the
> AppendEntries and move on.
>
> The clusterid issue is important, and I really don't like the way you set it
> up, it seems like it can cause problems.

We are trying to solve different problems. ClusterId is important, but
it is not what I'm discussing here. I hope my above replies have
clarified things.

> The way it works for us.
> All servers starts as followers, and CANNOT become leaders until they have a
> cluster id set.
> The admin is responsible for letting one of the servers knows that it can be
> a leader, it register a database id and then can add other servers, which
> will get their cluster id from it, and only then they can become leader.

I think also LogCabin seems to do something like this. It has a
--bootstrap option.

However, this is again not how the Raft paper is actually written. In
the world of Ongaro's thesis, a single node is its own majority, and
*will* become its own leader after election timeout. The paper doesn't
discuss any further initialization or boot activities.


>
> This avoid the issue where multiple new servers starting at the same time
> all assume that they can start accepting requests.

Probably sensible for a real world implementation, yes. Otoh, also the
opposite is sensible: developers will often run a single instance
database on their laptop, it's not wrong to default to doing that.

> Leader stickiness in your impl means that you now can't wait for the first
> node to timeout, but have to have _all_ the nodes time out first, which
> increase failover time.

To be precise: a majority has to timeout.

Yes, this does increase failover time. For a "large cluster" it is
likely to be election timeout / 2. Otoh, the motivation for this
change is that it would allow you to configure the election timeout to
be shorter, since flip-flopping is now explicitly prevented. Whether
this is true is open to debate and to be proven. (And probably depends
on the kind of network conditions you expect to deploy clusters in.)

> See the previous discussion about waiting until we didn't get a message of
> any kind for election-timeout /2 before responding. In practice, this gives
> the robustness necessary for avoiding flip flops.
> You also need to make at least N/2 election cycles (follower times out, try
> to run elections, rejected, second runs election, etc).

PreVote cycles, but yes. You're understanding it correctly. Note that
all those PreVote calls still happen within the same election timeout,
and the one to succeed happens approximately at election timeout / 2.

> As a side effect, this will tend to choose the _slowest_ node to be the
> leader.

No, it is random. It just depends on when each of the servers happened
to receive the last AppendEntries from the leader.
Thanks a lot for your comments! I really appreciate the review.

henrik

Henrik Ingo

unread,
Aug 21, 2015, 5:43:46 PM8/21/15
to raft...@googlegroups.com
We are criss crossing, so I already said this, but to hunt down a
loose end: You need a majority of nodes to return success, not all
nodes. So instead of the first node winning, this moves the winner to
the middle node, but not the last.

henrik

Oren Eini (Ayende Rahien)

unread,
Aug 21, 2015, 6:22:55 PM8/21/15
to raft...@googlegroups.com
inline

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


On Sat, Aug 22, 2015 at 12:41 AM, Henrik Ingo <henri...@avoinelama.fi> wrote:
On Fri, Aug 21, 2015 at 7:54 PM, Oren Eini (Ayende Rahien)
<aye...@ayende.com> wrote:
> We have implemented much the same thing, see my comments below. I think that
> most people with real world impl had to do that.
>

Hi Oren! Thanks, that is good to know. Gives some validation to hear this.


>> Persistent State:
>> databaseId
>
> Is that meant to be a node id or a cluster id?
> In our impl, we have a cluster id, which is generated as the first action of
> the initial node. Messages from a different cluster are rejected.

No, this is not a node id nor a cluster id.

While a real world implementation probably would have some kind of
cluster id (LogCabin has too, even if it is never mentioned in any of
the Raft papers), to me that's not an important part of Raft. It
basically falls under the heading of "authentication", and in
authentication lingo, such an id would commonly be called "shared
secret". But there could also be any other form of authentication,
such as using a variety of cryptographic keys or even just
username+password. But this is not what I'm talking about.

The databaseId is there to identify the *database*. The data,
regardless of the server it is located on. So it should be stored as
part of the database, and should be part of backups, etc.


The term database here is confusing. A better term would be the state machine id, since that is clearly shared, and if you have multiple state machines, it make it obvious why you need to distinguish it.
Raft is also not talking about databases as all, so that would reuse an existing term

 


This is a bit unintuitive. But the gatekeeping happens at AddServer
RPC. As discussed in this thread, a typical implementation (at least
for databases) would fail the AddServer RPC, if the new server
contains some data that is not a copy of the same database as the
cluster has. A corner case, that just happens to be how I wrote it, is
that a node with an empty database will pass AddServer RPC when it has
a different databaseId. It will then get the correct databaseId at
this step in AppendEntries. But there is no data to be deleted, so
while this sounds like a scary step, it is not.

But it is, you just removed one of the nodes from _another_ cluster. It will reject all messages from its origin cluster, just because someone else called it with AddServer before it got any date?
That seems suspicious.


A real world implementation might very well pass along the databaseId
in every call, and check it every time with an assertion. That's
probably good programming practice. But from the algorithm point of
view it's only needed in the AddServer, and the
AppendEntries/InstallSnapshot immediately following and AddServer. So
I've only added it in the places where it is necessary.

You _are_ trying to talk about real world systems. And in the real world, misconfig happens all the time.
It is better to be explicit about it, otherwise delayed messages will kill you.
Even in your own system, the sequence of events will kill you:

AddServer A to Cluster 1 - the A node is empty, so it accepts.
AddServer A to Cluster 2 - the A node is still empty, so it accepts
AppendEntries (term 1, entry 1) (from 1) - accepted blindly
AppendEntries (term 1, entry 1) (from 2) - reply that it already have it, leader for 2 commits.
AppendEntries (term 1, entry 2) (from 2) - accepted blidnly.
AppendEntries (term 1, entry 1) (from 1) - reply that it already have it, leader for 1 commits.

Data corruption, in both clusters..


> but that is dangerous, because you basically force all servers to
> timeout before a PreVote can succeed, in which case you are pretty much
> guaranteed to have conflicting candidates soliciting votes.
>
No. You need a majority to timeout before the first PreVote will succeed.


The problem is that you are extending the failover time until the full N/2 has timed out.
Since Raft already states that follower timeout is variant (between some range).
You can minimize the time to start fixing the failure by using the first server, instead of waiting for the N/2th server to figure it out.


This is also ok, but doesn't give you leader stickiness. That's not a
big deal, because Raft is already sticky in many situations (an
isolated node will often be behind in its log). But it doesn't achieve
stickiness in some of the corner cases I'm trying to cover.

This give stickiness, because servers will not PreVote if they got a message from the leader
in twice the heartbeat time. So if the leader is well, a PreVote will fail.
 
What happens if a server S1 sends PreVote to all other servers, and
all other servers also respond to it, and then server S1 crashes? I
assume after some additional timeout the other nodes do try to become
candidates again?

The same as what happens if a server crashes immediately after winning an election.
A follower will timeout and run another election. In this case, it will use the same term (because the pre vote isn't binding), but that is about it.

You cannot reject actual RequestVotes. Once you see a RequestVote with
a higher term than the currentTerm, then you must abandon the current
leader and respond to the RequestVote. This is a very fundamental
property of Raft, it is not an optimization.

I'm not, I'm rejecting a PreVote, because I have a leader that talks to this node on a regular basis. That is where the timeout/2 comes into play.

> Table 3 seems too complex to me.
>
> #3 - dangerous. You created data loss event here.
> If I called AddServer with "db1/orders" instead of "tst1/orders", I just
> lost all my orders.
> It is better to reject such scenario entirely. If the admin want to recreate
> the db, they can delete it and create a new one manually.
Correct. A typical database implementation would be conservative and
fail, rather than automatically delete data. From Raft point of view
both are correct behaviors.

How can this be correct? You just tanked a server from one cluster to the other.
This means that you just permanently lost a node. For fun, the way it will actually work is that now you have a two clusters that will keep sending InstallSnapshot to the node, flipping it between the clusters all the time.



> #4 & #2 are the same thing. We have a backup of the cluster log, the leader
> shouldn't need to make distinction between them.
> Why do you NEED to distinguish between them?
In #2 it is the same server that rejoins the cluster, in #4 it is a
brand new server that was provisioned with a copy of the database.
From Raft point of view these are actually the same (in both you would
just call AddServer) but from the DBA point of view they are of course
completely different use cases.

You need to decide which side you want to be on :-)
In parts of your answer you say "this is how a db would do it, but the algorithm doesn't care" and here you say the reverse.
And from the DBA perspective, if I clone a machine (pretty common scenario in most virtual env), what is the difference? 
The cloning process took the machine down for a long while, and now I have two clones of the original machine, and there is no way to tell which is which (except maybe through IPs, which isn't always the same).

> And this seems to put too much logic at the hands of the leader. The way we
> have it, when you add a new server, we send AppendEntries.
> It fails if there is no data (new server) or too old for our current log. We
> then send InstallSnapshot.
> If the admin restored from backup, the new node just accepts the
> AppendEntries and move on.
>
> The clusterid issue is important, and I really don't like the way you set it
> up, it seems like it can cause problems.
We are trying to solve different problems. ClusterId is important, but
it is not what I'm discussing here. I hope my above replies have
clarified things.

I think we are using different terms for the same things. See my state machine id comment above.


> The way it works for us.
> All servers starts as followers, and CANNOT become leaders until they have a
> cluster id set.
> The admin is responsible for letting one of the servers knows that it can be
> a leader, it register a database id and then can add other servers, which
> will get their cluster id from it, and only then they can become leader.
I think also LogCabin seems to do something like this. It has a
--bootstrap option.

That is the idea,yes.


However, this is again not how the Raft paper is actually written. In
the world of Ongaro's thesis, a single node is its own majority, and
*will* become its own leader after election timeout. The paper doesn't
discuss any further initialization or boot activities.


And in the real world, you run the risk on spinning 3 node cluster, each starting as independent leader, and each starting to accept requests (and committing them!) and you can't reconcile them.
That is why you need to bootstrap.

>
> This avoid the issue where multiple new servers starting at the same time
> all assume that they can start accepting requests.
Probably sensible for a real world implementation, yes. Otoh, also the
opposite is sensible: developers will often run a single instance
database on their laptop, it's not wrong to default to doing that.

An extra step saying, "this is development" would make a lot of difference. Can be done on install / setup.
But it is a very important action. In our UI, we have a "create cluster action" and a "join to cluster action". And they are very distinct. 
In the cluster config, you create a cluster, and if you have just one, that is great. 
If you want to add a new node, you go ahead and do that on the current single leader. That works for both dev & ops.

> Leader stickiness in your impl means that you now can't wait for the first
> node to timeout, but have to have _all_ the nodes time out first, which
> increase failover time.
To be precise: a majority has to timeout.
Yes, this does increase failover time. For a "large cluster" it is
likely to be election timeout / 2. Otoh, the motivation for this
change is that it would allow you to configure the election timeout to
be shorter, since flip-flopping is now explicitly prevented. Whether
this is true is open to debate and to be proven. (And probably depends
on the kind of network conditions you expect to deploy clusters in.)

The other options, of rejecting pre vote if you have heard from the leader in the last two heartbeats, means that you have a reasonable election time.
Note that if you reduce the election timeout, you run the risk of false elections. 
If you reduce  the election timeout, this end up being the same, and an issue of the terminology.

Henrik Ingo

unread,
Aug 22, 2015, 7:49:02 AM8/22/15
to raft...@googlegroups.com
On Sat, Aug 22, 2015 at 1:22 AM, Oren Eini (Ayende Rahien)
<aye...@ayende.com> wrote:
> On Sat, Aug 22, 2015 at 12:41 AM, Henrik Ingo <henri...@avoinelama.fi>
> wrote:
>> The databaseId is there to identify the *database*. The data,
>> regardless of the server it is located on. So it should be stored as
>> part of the database, and should be part of backups, etc.
>
>
>
> The term database here is confusing. A better term would be the state
> machine id, since that is clearly shared, and if you have multiple state
> machines, it make it obvious why you need to distinguish it.
> Raft is also not talking about databases as all, so that would reuse an
> existing term

Correct. I actually considered both. The problem is that people
implementing clustered databases typically don't think of the database
as a state machine either, so whichever you pick, half of the audience
is lost. I guess I just erred on the side of practical
implementations, since that's where I come from too. It's true that
state-machine-id would be more consistent with the academic vocabulary
used in Raft literature.

Also, semantics, but: If I take a backup of a Raft cluster, is that
backup copy still a state machine, or is it just a data file that
becomes a state machine when instantiated on a server? I'm neither
native English nor academic, so can't tell, but was guessing at the
latter. An important property of the databaseId is that it is also
part of a backup copy.
Nono, this should of course not be allowed.

Look, I'm not trying to argue that something like a cluster id isn't
needed, just that I'm not talking about that right now and not trying
to define it. If you ask me, I think it is kind of implied in Raft
that each server needs to remember what cluster it is part of, and can
only be part of a single cluster, and must reject messages from any
other servers. (This is why, the joint-consensus approach for cluster
reconfiguration used in the first Raft papers was odd, since a cluster
ended up accepting messages from a leader no longer part of the
cluster. That's simply not something I would ever implement.)

For example, it would be wrong and silly for a server to accept an
AddServer request from another leader, if its already part of some
other cluster. But Raft doesn't really consider that point of view, it
only describes AddServer as something being called on the leader. And
I haven't tried to extend that part, even if it is important for a
real world implementation.

>> No. You need a majority to timeout before the first PreVote will succeed.
>>
>>
>
> The problem is that you are extending the failover time until the full N/2
> has timed out.
> Since Raft already states that follower timeout is variant (between some
> range).
> You can minimize the time to start fixing the failure by using the first
> server, instead of waiting for the N/2th server to figure it out.

I'm not disagreeing. Also without the leader stickiness I'm proposing,
it is simply generally true that there exists a tradeoff between
minimizing failover time, and risk for flip flops. You can avoid
flip-flops with arbitrarily long election timeout, but of course
that's not useful since you want election timeout to be as short as
possible.

My proposal for leader stickiness adds to this spectrum of trade-offs.
From an algorithmic point of view it does increase the failover time.
Otoh if it helps avoid flip flops, that can in itself be valuable and
worth the trade off. (Note that trade offs are always somewhat
subjective.)

Further, since it helps avoid flip flops, you might be able to
configure the election timeout to a lower value than without
stickiness. Hence, leader stickiness adds a new dimension to the
trade-off.

To fully appreciate the value of leader stickiness, its effect would
have to be measured either in simulations or in some real world
system. I have not done so.

>> This is also ok, but doesn't give you leader stickiness. That's not a
>> big deal, because Raft is already sticky in many situations (an
>> isolated node will often be behind in its log). But it doesn't achieve
>> stickiness in some of the corner cases I'm trying to cover.
>
>
> This give stickiness, because servers will not PreVote if they got a message
> from the leader
> in twice the heartbeat time. So if the leader is well, a PreVote will fail.
>

Sorry, I have a hard time following you. (But please don't give up, I
find this very valuable.)

What you write now, here, sounds like exactly what I'm proposing too.
PreVote will only succeed/return true if the receiver also has not
heard from the leader for a while.

(Our descriptions differ in that your implementation goes to candidate
state first, then does PreVote, while I'm trying to specify it the
opposite way. But it looks like both of these are equivalent in terms
of outcome.)

Your implementation also tries to favor whoever did the first PreVote
- that is a different optimization. If we ignore that part, it seems
like what you write above is exactly what my paper proposes too.
Agree, or am I still not following?

>> What happens if a server S1 sends PreVote to all other servers, and
>> all other servers also respond to it, and then server S1 crashes? I
>> assume after some additional timeout the other nodes do try to become
>> candidates again?
>
>
> The same as what happens if a server crashes immediately after winning an
> election.
> A follower will timeout and run another election. In this case, it will use
> the same term (because the pre vote isn't binding), but that is about it.

Ok, then this makes sense. Thanks.

>> You cannot reject actual RequestVotes. Once you see a RequestVote with
>> a higher term than the currentTerm, then you must abandon the current
>> leader and respond to the RequestVote. This is a very fundamental
>> property of Raft, it is not an optimization.
>
>
> I'm not, I'm rejecting a PreVote, because I have a leader that talks to this
> node on a regular basis. That is where the timeout/2 comes into play.

Sorry. In the above I wasn't disagreeing with you, just emphasizing.
This part is clear to me.

>> > Table 3 seems too complex to me.
>> >
>> > #3 - dangerous. You created data loss event here.
>> > If I called AddServer with "db1/orders" instead of "tst1/orders", I just
>> > lost all my orders.
>> > It is better to reject such scenario entirely. If the admin want to
>> > recreate
>> > the db, they can delete it and create a new one manually.
>> Correct. A typical database implementation would be conservative and
>> fail, rather than automatically delete data. From Raft point of view
>> both are correct behaviors.
>
>
> How can this be correct? You just tanked a server from one cluster to the
> other.
> This means that you just permanently lost a node. For fun, the way it will
> actually work is that now you have a two clusters that will keep sending
> InstallSnapshot to the node, flipping it between the clusters all the time.

As discussed above, for a real world implementation to allow this to
happen would be silly. I'm not saying an implementation doesn't need
to address this, just that it is not what I am addressing right now.

>> > #4 & #2 are the same thing. We have a backup of the cluster log, the
>> > leader
>> > shouldn't need to make distinction between them.
>> > Why do you NEED to distinguish between them?
>> In #2 it is the same server that rejoins the cluster, in #4 it is a
>> brand new server that was provisioned with a copy of the database.
>> From Raft point of view these are actually the same (in both you would
>> just call AddServer) but from the DBA point of view they are of course
>> completely different use cases.
>
>
> You need to decide which side you want to be on :-)
> In parts of your answer you say "this is how a db would do it, but the
> algorithm doesn't care" and here you say the reverse.

That is because I'm talking about different things in each part :-)

> And from the DBA perspective, if I clone a machine (pretty common scenario
> in most virtual env), what is the difference?
> The cloning process took the machine down for a long while, and now I have
> two clones of the original machine, and there is no way to tell which is
> which (except maybe through IPs, which isn't always the same).

Interesting use case, but to give a meaningful answer, you'd have to
give more details. For example, if the two clones have same IP address
and hostname, we end up in a very weird world from a networking
perspective. It's not clear to me an implementation needs to be robust
against such misconfiguration. At least on first thought, it seems to
me to guard against this, you would just reimplement what TCP/IP is
supposed to do for you?

>> We are trying to solve different problems. ClusterId is important, but
>> it is not what I'm discussing here. I hope my above replies have
>> clarified things.
>
>
> I think we are using different terms for the same things. See my state
> machine id comment above.

Maybe :-) More below...

>>
>> However, this is again not how the Raft paper is actually written. In
>> the world of Ongaro's thesis, a single node is its own majority, and
>> *will* become its own leader after election timeout. The paper doesn't
>> discuss any further initialization or boot activities.
>
>
>
> And in the real world, you run the risk on spinning 3 node cluster, each
> starting as independent leader, and each starting to accept requests (and
> committing them!) and you can't reconcile them.
> That is why you need to bootstrap.
>

I think you're right. In fact, now that I think about all the
clustered database implementations I have experience with, all of them
do something like this, and it does make sense. When starting a node
for the first time, you either have to initialize a new cluster, or
add the node to an existing cluster. Before that, it will not do
anything. (This is ignoring technologies like MySQL replication, which
just moves data around, but doesn't itself elect a leader, so is not
really a HA cluster.)

Ok, I think adding this to be a more explicit part of the Raft
description actually would help clarify things. For example it would
be much easier to then add what I have called databaseId. It is
generated at bootstrapping.

Having admitted this, I'm also starting to agree that a databaseId and
what you call clusterId could be the same thing. While we have
approached the topic from trying to solve slightly different problems
(looking at servers vs looking at copies of data), they do share at
least one important property - their lifecycle is the same:

- Id is generated at bootstrapping.

- Id is received in AddServer.

- You may face situations where no node can ever become leader again,
because a majority of nodes is lost and you won't be able to replace
them. In such situation a sysadmin would typically take one or more of
the remaining nodes and call bootstrap again, in order to start a new
cluster. In this case the Id must also be generated to a new value.


Without knowing much about your implementation, I can't say for sure,
but it seems to me most people thinking about "cluster id" wouldn't
implement what I have added to the AddServer implementation: If you
add a new server, that seemingly has never been part of this cluster
before, but somehow actually has a valid copy of the database/state
machine, then (as an optimization) you can add it as if it was an old
server, without doing InstallSnapshot (which is expensive for large
databases).

Allowing for this to work, is the main reason I have written it in the
way I did (e.g. databaseId is persisted and part of backups). Other
than this use case, it is rather similar to cluster id.

henrik

Oren Eini (Ayende Rahien)

unread,
Aug 22, 2015, 12:27:36 PM8/22/15
to raft...@googlegroups.com
In terms of operations, it is important to also add unsafe operations to the system.
What do I mean by unsafe operations, adding/removing servers from the cluster manually.


Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


Reply all
Reply to author
Forward
0 new messages