bug in single-server membership changes

5,981 views
Skip to first unread message

Diego Ongaro

unread,
Jul 10, 2015, 12:58:53 AM7/10/15
to raft...@googlegroups.com
Hi raft-dev,

Unfortunately, I need to announce a bug in the dissertation version of membership changes (the single-server changes, not joint consensus). The bug is potentially severe, but the fix I'm proposing is easy to implement.

When Huanchen Zhang and Brandon Amos were working on a class project at CMU to formalize single-server membership changes, Huanchen found the bug and the counter-example below by hand. They contacted me over email on May 14th, and I chose to keep this quiet for a while until we had agreed upon a solution to propose to the list. After several incorrect and/or ugly attempts, I came up with the solution proposed below. I apologize for keeping this information from you for so long.


Recapping the single-server membership change algorithm in my dissertation:

A leader may only create a configuration entry in its log if the prior one is committed and the new one differs from the prior one by at most one server (i.e., adding or removing a single server at a time). Each server always uses the latest configuration in its log, regardless of whether that entry is committed, for counting votes and determining commitment. The full details are in Chapter 4 of my dissertation: https://github.com/ongardie/dissertation

The Raft paper "In search of an understandable consensus algorithm" and its extended version present an earlier, different membership change algorithm called joint consensus. The joint consensus algorithm is not affected by this bug (see paragraph under Scope below).


The bug:

It's essential to Raft that if one decision (vote or commitment) is made in one quorum (majority), a subsequent quorum will contain at least one server that's aware of the decision, and this is needed even across membership changes. My dissertation shows how if two configurations differ by at most one server, a majority from the first and a majority from the second will have at least one server in common. Within a single term, a single leader can easily ensure that one configuration and the next differ by at most one server. This bug shows up across term boundaries. It's possible for two concurrent, competing changes across term boundaries to have quorums that don't overlap with each other, causing a safety violation (split brain).

Below I've reproduced Huanchen's counter-examples, starting from a 4-server cluster. The three are similar, and I include all three for completeness, but you might get the idea after one or two.

Counter-example 1 with one add and one remove:

This involves an attempt to add a server to a 4-server cluster and a concurrent attempt to remove a server from the same 4-server cluster.

0. Initial state:

S1 (L1): [C]
S2 (F1): [C]
S3 (F1): [C]
S4 (F1): [C]
S5 (F1): []

where C is the configuration {S1,S2,S3,S4}.

1. S1 catches up S5 and appends configuration D to its log:

S1 (L1): [C, D]
S2 (F1): [C]
S3 (F1): [C]
S4 (F1): [C]
S5 (F1): [C]

where D is the configuration {S1,S2,S3,S4,S5}.

2. S1 replicates D to S5 and goes offline for a while:

S1 (L1): [C, D] X
S2 (F1): [C]
S3 (F1): [C]
S4 (F1): [C]
S5 (F1): [C, D]

3. S2 starts an election and becomes leader in term 2 with votes from {S2, S3, S4}:

S1 (L1): [C, D] X
S2 (L2): [C]
S3 (F2): [C]
S4 (F2): [C]
S5 (F1): [C, D]

4. S2 appends configuration E:

S1 (L1): [C, D] X
S2 (L2): [C, E]
S3 (F2): [C]
S4 (F2): [C]
S5 (F1): [C, D]

where E is the configuration {S2, S3, S4}.

5. S2 replicates configuration E to S3 and marks it committed:

S1 (L1): [C, D] X
S2 (L2): [C, E]
S3 (F2): [C, E]
S4 (F2): [C]
S5 (F1): [C, D]

6. S1 starts an election and becomes leader in term 3 with votes from {S1, S4, S5}:

S1 (L3): [C, D]
S2 (L2): [C, E]
S3 (F2): [C, E]
S4 (F3): [C]
S5 (F3): [C, D]

Note that S1 does not have the committed entry E in its log.

7. S1 replicates D to everyone else, overwriting a committed entry (E):

S1 (L3): [C, D]
S2 (L2): [C, D]
S3 (F2): [C, D]
S4 (F3): [C, D]
S5 (F3): [C, D]


Counter-example 2 with two adds:

This involves an attempt to add a server to a 4-server cluster and a concurrent attempt to add a different server to the same 4-server cluster.

0. Initial state:

S1 (L1): [C]
S2 (F1): [C]
S3 (F1): [C]
S4 (F1): [C]
S5 (F1): []
S6 (F1): []

where C is the configuration {S1, S2, S3, S4}

1. S1 catches up S5 and appends configuration D to its log:

S1 (L1): [C, D]
S2 (F1): [C]
S3 (F1): [C]
S4 (F1): [C]
S5 (F1): [C]
S6 (F1): []

where D is the configuration {S1, S2, S3, S4, S5}

2. S1 replicates D to S5 and goes offline for a while:

S1 (L1): [C, D] X
S2 (F1): [C]
S3 (F1): [C]
S4 (F1): [C]
S5 (F1): [C, D]
S6 (F1): []

3. S2 starts an election and becomes leader in term 2 with votes from {S2, S3, S4}:

S1 (L1): [C, D] X
S2 (L2): [C]
S3 (F2): [C]
S4 (F2): [C]
S5 (F1): [C, D]
S6 (F1): []

4. S2 catches up S6 and appends configuration E to its log:

S1 (L1): [C, D] X
S2 (L2): [C, E]
S3 (F2): [C]
S4 (F2): [C]
S5 (F1): [C, D]
S6 (F1): [C]

where E is the configuration {S1, S2, S3, S4, S6}:

5. S2 replicates E to S3 and S6, and marks it committed:

S1 (L1): [C, D] X
S2 (L2): [C, E]
S3 (F2): [C, E]
S4 (F2): [C]
S5 (F1): [C, D]
S6 (F1): [C, E]

6. S1 starts an election and becomes leader in term 3 with votes from {S1, S4, S5}:

S1 (L3): [C, D]
S2 (L2): [C, E]
S3 (F2): [C, E]
S4 (F3): [C]
S5 (F3): [C, D]
S6 (F1): [C, E]

Note that S1 does not have the committed entry E in its log.

7. S1 replicates D to everyone else, overwriting a committed entry (E):

S1 (L3): [C, D]
S2 (L2): [C, D]
S3 (F2): [C, D]
S4 (F3): [C, D]
S5 (F3): [C, D]
S6 (F1): [C, D]


Counter-example 3 with two removes:

This involves an attempt to remove a server from a 4-server cluster and a concurrent attempt to remove a different server from the same 4-server cluster.

0. Initial state:

S1 (L1): [C]
S2 (F1): [C]
S3 (F1): [C]
S4 (F1): [C]

where C is the configuration {S1, S2, S3, S4}

1. S1 appends configuration D to its log and goes offline for a while:

S1 (L1): [C, D] X
S2 (F1): [C]
S3 (F1): [C]
S4 (F1): [C]

where D is the configuration {S1, S2, S3}

2. S2 starts an election and becomes leader in term 2 with votes from {S2, S3, S4}:

S1 (L1): [C, D] X
S2 (L2): [C]
S3 (F2): [C]
S4 (F2): [C]

3. S2 appends configuration E to its log:

S1 (L1): [C, D] X
S2 (L2): [C, E]
S3 (F2): [C]
S4 (F2): [C]

where E is the configuration {S1, S2, S4}

4. S2 replicates E to S4, and marks it committed:

S1 (L1): [C, D] X
S2 (L2): [C, E]
S3 (F2): [C]
S4 (F2): [C, E]

5. S1 starts an election and becomes leader in term 3 with votes from {S1, S3}:

S1 (L3): [C, D]
S2 (L2): [C, E]
S3 (F3): [C]
S4 (F2): [C, E]

Note that S1 does not have the committed entry E in its log.

6. S1 replicates D to everyone else, overwriting the committed entry E:

S1 (L3): [C, D]
S2 (L2): [C, D]
S3 (F3): [C, D]
S4 (F2): [C, D]


Scope and severity:

This affects Raft implementations that use single-server membership changes when:
1. Starting from an even-sized cluster,
2. Multiple different changes are requested concurrently, and
3. Leadership is lost.
In this event, data loss and permanent split brain could occur.

This does not affect Raft implementations that use joint consensus, since:
1. Any two competing joint configurations have the old configuration in common and therefore overlap with themselves and their predecessor, and
2. Any two competing simple configurations have the new configuration in common and therefore overlap with themselves and their predecessor.
(I can go into more detail on this upon request, but it may not be necessary.)


Proposed solution:

The solution I'm proposing is exactly like the dissertation describes except that a leader may not append a new configuration entry until it has committed an entry from its current term.

In a typical Raft implementation, a leader appends a no-op entry to the log when it is elected. This change would mean rejecting or delaying membership change requests until the no-op entry is committed.

Once a leader has committed an entry in its current term, it knows it has the latest committed configuration, and no existing uncommitted configurations from prior terms can be committed anymore (the servers that store the current leader's entry won't vote for the servers that have the uncommitted configuration). So then it is safe for the leader to create a new configuration (that differs by at most one server) and begin replicating it.

In John Ousterhout's words, which tend to be better than mine, "The overall requirement is that a leader must not begin replicating a new configuration entry if there is an older configuration entry that is incompletely replicated (it is stored on at least one server, is not currently committed, but could become committed in the future). This ensures that if two configurations "compete", they differ by at most one server and hence have overlapping consensuses. In the algorithm from the dissertation, leaders were careful to make sure there were no incomplete configuration entries from the current term, but the algorithm did not properly handle incomplete configuration entries from previous leaders (which might not even be visible to the current leader). The new approach fixes that by ensuring that any such entries, if they exist, cannot be committed in the future, hence cannot be used for making decisions."


Safety argument:

We don't yet have a formal safety proof for the correctness of membership changes (of either type) in Raft. I've made an attempt at a safety argument for why the proposed solution works, but it's a little ugly.

It extends the safety argument in the paper/dissertation. The only part that's different (as far as I can tell) for membership changes is Step 2, which claims the existence of "the voter". The voter is the server that is part of the quorum that commits the entry in term T and the quorum that votes for the leader of term U. With a static configuration, it's easy to see the voter must exist (overlap of two majorities of the same set). With membership changes, it's more difficult; this aims to show it for the dissertation algorithm patched with the solution proposed above.

Those interested can find the safety argument here: https://gist.github.com/ongardie/a11f32b70581e20d6bcd . I don't have any current plans to flesh this out into a full proof, but I'd be happy to discuss merits of the argument and ways to simplify it.


Closing remarks:

Let's use this thread to discuss the issue with respect to the single-server change algorithm described in my dissertation and the proposed solution(s). If you want to discuss how this affects other membership change algorithms, including joint consensus or whatever else you folks might have come up with, please keep those to separate clearly-labeled threads to avoid confusion.

I hope this bug hasn't affected anyone yet, and the events needed to cause it seem pretty unlikely. Still, this can cause data loss and split brain, so it should be taken seriously. If you have an implementation that's affected, you should file a bug report right away and get this patched up soon. Even if the solution I've proposed is ultimately superseded by something better, it won't do any harm to add this if-statement now.

I also plan to start a list of errata for my dissertation, so that new people reading it are aware of this issue.

Your feedback is welcome, and I'm looking forward to the discussion.

-Diego

Li Xiang

unread,
Jul 10, 2015, 2:06:50 AM7/10/15
to raft...@googlegroups.com
Hi Deigo,

In etcd/raft implementation, we actually used a different but similar
approach. We only applied the configuration change after it is
committed by the old quorum (rather than as soon as it is added). I
thought this is more safe and the configuration change command would
go through the same commit/apply path as normal command.

It does have some limitation and we documented it here:
https://github.com/coreos/etcd/blob/master/raft/doc.go#L127-L150.

Thanks,
Xiang
> --
> You received this message because you are subscribed to the Google Groups
> "raft-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to raft-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Li Xiang

unread,
Jul 10, 2015, 2:21:02 AM7/10/15
to raft...@googlegroups.com
Hi,

I think there is another difference that might worth sharing.

In etcd/raft, the newly joined follower would try to get the most
recent snapshot and append the log entries out of raft protocol.

I feel this can also help to simplify the configuration change
protocol a little bit.

Here is the original discussion between etcd and cockroach team:
https://github.com/coreos/etcd/issues/2397#issuecomment-76497331

Thanks,
Xiang

Diego Ongaro

unread,
Jul 10, 2015, 3:37:29 AM7/10/15
to raft...@googlegroups.com
Xiang, you have broken the rules. Shame. As your punishment, I hereby order you to prominently display a LogCabin sticker on your CoreOS laptop for a minimum of one year.

As a reminder, "If you want to discuss how this affects other membership change algorithms, including joint consensus or whatever else you folks might have come up with, please keep those to separate clearly-labeled threads to avoid confusion."

Please do not engage Xiang about etcd's bastardized membership change approach on this thread.

-Diego

Li Xiang

unread,
Jul 10, 2015, 12:06:06 PM7/10/15
to raft...@googlegroups.com
> you have broken the rules.

Sorry about that everybody! I should have read through the original post.

> As your punishment, I hereby order you to prominently display a LogCabin sticker on your CoreOS laptop for a minimum of one year.

Such a harsh punishment!

jordan.h...@gmail.com

unread,
Jul 10, 2015, 12:34:12 PM7/10/15
to raft...@googlegroups.com
Good find Huanchen!

I think your solution is elegant in that it relies upon noops, an already implemented component of the algorithm. Should be an easy change.

"After several incorrect and/or ugly attempts..."

Any interest in sharing a quick snippet of what those alternate ideas were? Or is that a part of your life you want to forget about now? :-)

Diego Ongaro

unread,
Jul 10, 2015, 3:41:38 PM7/10/15
to raft...@googlegroups.com
On Fri, Jul 10, 2015 at 9:34 AM, jordan.h...@gmail.com <jordan.h...@gmail.com> wrote:
Good find Huanchen!

I think your solution is elegant in that it relies upon noops, an already implemented component of the algorithm. Should be an easy change.

"After several incorrect and/or ugly attempts..."

Any interest in sharing a quick snippet of what those alternate ideas were? Or is that a part of your life you want to forget about now? :-)


Sure, I suppose that's fair. In chronological order:

1. ugly and incorrect: I first thought that only removes were an issue, so I proposed that if you're removing a server from the cluster, you use the previous quorum (the larger one) until the configuration entry is committed. Then Huanchen showed counter-example 2, a safety violation using only two additions, no removes.

2. at least ugly: Huanchen suggested having servers use their latest committed configs and a leader would send a round of messages before creating a config to make sure a majority already had its latest committed config. That one seemed ugly to me, but it inspired the next one.

3. ugly: In the second idea, I saw some parallel to joint consensus, and pushed things pretty far in that direction. First a leader would append the transitional config, in which servers continue to use the majority of Cold (only) to form a quorum, but they promise to move to Cnew next. Once the transitional config was committed (according to Cold), a quorum of Cold must have been using Cold, they would necessarily move to Cnew next, and any conflicting proposed configurations would have been neutralized. So the leader would append the Cnew config, in which servers would immediately start using the majority of Cnew to form a quorum. This was so close to joint consensus that reverting entirely to joint consensus seemed preferable.

That third idea was an interesting point because I could show it was safe without too much difficulty, using the explicit structure in how configuration entries would alternate in the log and paying no attention to who was leader. Still, John and I were bothered by why pieces of it were necessary. I started pulling those pieces out one by one, until I ended up with the really easy solution I proposed in my email above. The safety argument takes a pretty different shape for this one, having to argue about leaders and terms, but I do believe it to be safe.

-Diego

Oren Eini (Ayende Rahien)

unread,
Jul 11, 2015, 2:22:58 AM7/11/15
to raft...@googlegroups.com
Thanks for reporting this, and I think that the solution is pretty elegant.
Just to verify my understanding, this would mean:

def ModifyConfiguration(... ):

   raise "not a leader" if state != "leader"
  
   raise "wait until we commit an entry" if termFor(commitIndex) != currentTerm


Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


Diego Ongaro

unread,
Jul 11, 2015, 12:07:11 PM7/11/15
to raft...@googlegroups.com


On Jul 10, 2015 11:22 PM, "Oren Eini (Ayende Rahien)" <aye...@ayende.com> wrote:
>
> Thanks for reporting this, and I think that the solution is pretty elegant.
> Just to verify my understanding, this would mean:
>
> def ModifyConfiguration(... ):
>
>    raise "not a leader" if state != "leader"
>   
>    raise "wait until we commit an entry" if termFor(commitIndex) != currentTerm

Yep, I think that's the only change.

You should have already had a
raise "previous change still in progress" if commitIndex < configurationIndex
in there somewhere, too.

-Diego

Юрий Соколов

unread,
Jun 24, 2016, 11:44:15 AM6/24/16
to raft-dev
Were this bug and workaround documented somewhere outside this google group?

Why paper referred from https://raft.github.io/ still doesn't mention single-server membership change?

With regards,
Sokolov Yura

суббота, 11 июля 2015 г., 19:07:11 UTC+3 пользователь Diego Ongaro написал:

Юрий Соколов

unread,
Jun 24, 2016, 11:52:26 AM6/24/16
to raft-dev
Ah, I see: there is a link from README in dissertation repository to this bug.
So, first question may be count as answered.

пятница, 24 июня 2016 г., 18:44:15 UTC+3 пользователь Юрий Соколов написал:

Diego Ongaro

unread,
Jun 24, 2016, 9:46:59 PM6/24/16
to raft...@googlegroups.com
Why paper referred from https://raft.github.io/ still doesn't mention single-server membership change?

There's two versions of the paper, a slightly shorter version published in USENIX ATC '14 and the "extended version" posted on the Internet at the same time. These were written before single-server-at-a-time membership changes were developed, so they describe the older joint consensus approach instead.

The ATC14 paper is immutable, but we could update the "extended version" and post a new revision. That just hasn't happened.

Maybe the easiest thing would be to add a short note in the margin and not change the existing text or layout. I'll volunteer to do that much eventually, though I don't look forward to battling LaTeX on this.

-Diego

Andy Chen

unread,
Aug 24, 2016, 7:35:35 PM8/24/16
to raft-dev
Hi Xiangli,

This bug does not apply to a two-server based cluster, it does not have any limitation.

/**
                             * In case of there are only two servers in the cluster, it safe to remove the server directly from peers
                             * as at most one config change could happen at a time (if leader only allows at most one uncommitted config in log store)
                             *  prove:
                             *      assume there could be two config changes at a time
                             *      this means there must be a leader after previous leader offline, which is impossible 
                             *      (no leader could be elected after one server goes offline in case of only two servers in a cluster)
*/

Thanks,
Andy

Diego Ongaro

unread,
Aug 25, 2016, 1:47:26 PM8/25/16
to raft...@googlegroups.com
Andy, I don't know where that's coming from, but I really don't want to talk about etcd's membership changes in this same thread. etcd's approach is different from what's discussed in my dissertation, and it's extremely confusing to mix the two approaches here. Quoting my original post:

If you want to discuss how this affects other membership change algorithms, including joint consensus or whatever else you folks might have come up with, please keep those to separate clearly-labeled threads to avoid confusion.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

Andy Chen

unread,
Sep 15, 2016, 12:05:09 PM9/15/16
to raft-dev
Okay, but my original thinking was, this bug has some exceptions, we need to generalized it through proven and then have better guidelines for implementations. Based on my own work, I found it could be proved that this bug only apply to cluster that has even servers (except 2), that means the implementation could specially take care when current cluster size is even (take committed config in that case), otherwise, new config could be applied directly without commit

Diego Ongaro

unread,
Sep 15, 2016, 12:24:16 PM9/15/16
to raft...@googlegroups.com
Andy,

I honestly don't know which protocol you're talking about anymore, and that confusion is what I'm hoping to avoid.

I continue to stand by the fix to my dissertation that I proposed on this thread back in July of 2015, which only requires adding a single if-statement to most implementations. Unless you find a problem with that fix, I doubt we'd find any easier solution. There's no point in adding more complexity to that if-statement, if that's what you're suggesting.

If you're still talking about etcd's membership changes in this thread, then stop. I can't make that any more clear.

-Diego

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

Willy Schultz

unread,
Nov 9, 2020, 1:54:01 PM11/9/20
to raft-dev
I have one high level question about the safety argument sketch linked above. If my understanding is correct, it seems that the argument aims to establish the "leader completeness property" under reconfiguration i.e. if a log entry is committed in term T then all future leaders will contain the entry in their log. I'm curious, though, whether this alone would be sufficient to establish safety of reconfiguration (using either single server membership change or joint consensus). My main thought is that it would also be necessary to ensure that the election safety property (at most one leader per term) holds, since leader completeness, and other core properties of Raft e.g. "log matching", seem to depend on election safety. To put this another way, leader completeness seems to be about ensuring proper intersection between future vote quorums and past write quorums, whereas election safety is about ensuring intersection of any two vote quorums. It would seem necessary to prove both.

I understand that the sketch was only a first attempt at fleshing out a proof here and that you didn't spend much time expanding on it, but I would be curious if you have any thoughts on this and how you approached this when first thinking about these correctness arguments. 

Reply all
Reply to author
Forward
0 new messages