Hi raft-dev,
Unfortunately, I need to announce a bug in the dissertation version of membership changes (the single-server changes, not joint consensus). The bug is potentially severe, but the fix I'm proposing is easy to implement.
When Huanchen Zhang and Brandon Amos were working on a class project at CMU to formalize single-server membership changes, Huanchen found the bug and the counter-example below by hand. They contacted me over email on May 14th, and I chose to keep this quiet for a while until we had agreed upon a solution to propose to the list. After several incorrect and/or ugly attempts, I came up with the solution proposed below. I apologize for keeping this information from you for so long.
Recapping the single-server membership change algorithm in my dissertation:
A leader may only create a configuration entry in its log if the prior one is committed and the new one differs from the prior one by at most one server (i.e., adding or removing a single server at a time). Each server always uses the latest configuration in its log, regardless of whether that entry is committed, for counting votes and determining commitment. The full details are in Chapter 4 of my dissertation:
https://github.com/ongardie/dissertation
The Raft paper "In search of an understandable consensus algorithm" and its extended version present an earlier, different membership change algorithm called joint consensus. The joint consensus algorithm is not affected by this bug (see paragraph under Scope below).
The bug:
It's essential to Raft that if one decision (vote or commitment) is made in one quorum (majority), a subsequent quorum will contain at least one server that's aware of the decision, and this is needed even across membership changes. My dissertation shows how if two configurations differ by at most one server, a majority from the first and a majority from the second will have at least one server in common. Within a single term, a single leader can easily ensure that one configuration and the next differ by at most one server. This bug shows up across term boundaries. It's possible for two concurrent, competing changes across term boundaries to have quorums that don't overlap with each other, causing a safety violation (split brain).
Below I've reproduced Huanchen's counter-examples, starting from a 4-server cluster. The three are similar, and I include all three for completeness, but you might get the idea after one or two.
Counter-example 1 with one add and one remove:
This involves an attempt to add a server to a 4-server cluster and a concurrent attempt to remove a server from the same 4-server cluster.
0. Initial state:
S1 (L1): [C]
S2 (F1): [C]
S3 (F1): [C]
S4 (F1): [C]
S5 (F1): []
where C is the configuration {S1,S2,S3,S4}.
1. S1 catches up S5 and appends configuration D to its log:
S1 (L1): [C, D]
S2 (F1): [C]
S3 (F1): [C]
S4 (F1): [C]
S5 (F1): [C]
where D is the configuration {S1,S2,S3,S4,S5}.
2. S1 replicates D to S5 and goes offline for a while:
S1 (L1): [C, D] X
S2 (F1): [C]
S3 (F1): [C]
S4 (F1): [C]
S5 (F1): [C, D]
3. S2 starts an election and becomes leader in term 2 with votes from {S2, S3, S4}:
S1 (L1): [C, D] X
S2 (L2): [C]
S3 (F2): [C]
S4 (F2): [C]
S5 (F1): [C, D]
4. S2 appends configuration E:
S1 (L1): [C, D] X
S2 (L2): [C, E]
S3 (F2): [C]
S4 (F2): [C]
S5 (F1): [C, D]
where E is the configuration {S2, S3, S4}.
5. S2 replicates configuration E to S3 and marks it committed:
S1 (L1): [C, D] X
S2 (L2): [C, E]
S3 (F2): [C, E]
S4 (F2): [C]
S5 (F1): [C, D]
6. S1 starts an election and becomes leader in term 3 with votes from {S1, S4, S5}:
S1 (L3): [C, D]
S2 (L2): [C, E]
S3 (F2): [C, E]
S4 (F3): [C]
S5 (F3): [C, D]
Note that S1 does not have the committed entry E in its log.
7. S1 replicates D to everyone else, overwriting a committed entry (E):
S1 (L3): [C, D]
S2 (L2): [C, D]
S3 (F2): [C, D]
S4 (F3): [C, D]
S5 (F3): [C, D]
Counter-example 2 with two adds:
This involves an attempt to add a server to a 4-server cluster and a concurrent attempt to add a different server to the same 4-server cluster.
0. Initial state:
S1 (L1): [C]
S2 (F1): [C]
S3 (F1): [C]
S4 (F1): [C]
S5 (F1): []
S6 (F1): []
where C is the configuration {S1, S2, S3, S4}
1. S1 catches up S5 and appends configuration D to its log:
S1 (L1): [C, D]
S2 (F1): [C]
S3 (F1): [C]
S4 (F1): [C]
S5 (F1): [C]
S6 (F1): []
where D is the configuration {S1, S2, S3, S4, S5}
2. S1 replicates D to S5 and goes offline for a while:
S1 (L1): [C, D] X
S2 (F1): [C]
S3 (F1): [C]
S4 (F1): [C]
S5 (F1): [C, D]
S6 (F1): []
3. S2 starts an election and becomes leader in term 2 with votes from {S2, S3, S4}:
S1 (L1): [C, D] X
S2 (L2): [C]
S3 (F2): [C]
S4 (F2): [C]
S5 (F1): [C, D]
S6 (F1): []
4. S2 catches up S6 and appends configuration E to its log:
S1 (L1): [C, D] X
S2 (L2): [C, E]
S3 (F2): [C]
S4 (F2): [C]
S5 (F1): [C, D]
S6 (F1): [C]
where E is the configuration {S1, S2, S3, S4, S6}:
5. S2 replicates E to S3 and S6, and marks it committed:
S1 (L1): [C, D] X
S2 (L2): [C, E]
S3 (F2): [C, E]
S4 (F2): [C]
S5 (F1): [C, D]
S6 (F1): [C, E]
6. S1 starts an election and becomes leader in term 3 with votes from {S1, S4, S5}:
S1 (L3): [C, D]
S2 (L2): [C, E]
S3 (F2): [C, E]
S4 (F3): [C]
S5 (F3): [C, D]
S6 (F1): [C, E]
Note that S1 does not have the committed entry E in its log.
7. S1 replicates D to everyone else, overwriting a committed entry (E):
S1 (L3): [C, D]
S2 (L2): [C, D]
S3 (F2): [C, D]
S4 (F3): [C, D]
S5 (F3): [C, D]
S6 (F1): [C, D]
Counter-example 3 with two removes:
This involves an attempt to remove a server from a 4-server cluster and a concurrent attempt to remove a different server from the same 4-server cluster.
0. Initial state:
S1 (L1): [C]
S2 (F1): [C]
S3 (F1): [C]
S4 (F1): [C]
where C is the configuration {S1, S2, S3, S4}
1. S1 appends configuration D to its log and goes offline for a while:
S1 (L1): [C, D] X
S2 (F1): [C]
S3 (F1): [C]
S4 (F1): [C]
where D is the configuration {S1, S2, S3}
2. S2 starts an election and becomes leader in term 2 with votes from {S2, S3, S4}:
S1 (L1): [C, D] X
S2 (L2): [C]
S3 (F2): [C]
S4 (F2): [C]
3. S2 appends configuration E to its log:
S1 (L1): [C, D] X
S2 (L2): [C, E]
S3 (F2): [C]
S4 (F2): [C]
where E is the configuration {S1, S2, S4}
4. S2 replicates E to S4, and marks it committed:
S1 (L1): [C, D] X
S2 (L2): [C, E]
S3 (F2): [C]
S4 (F2): [C, E]
5. S1 starts an election and becomes leader in term 3 with votes from {S1, S3}:
S1 (L3): [C, D]
S2 (L2): [C, E]
S3 (F3): [C]
S4 (F2): [C, E]
Note that S1 does not have the committed entry E in its log.
6. S1 replicates D to everyone else, overwriting the committed entry E:
S1 (L3): [C, D]
S2 (L2): [C, D]
S3 (F3): [C, D]
S4 (F2): [C, D]
Scope and severity:
This affects Raft implementations that use single-server membership changes when:
1. Starting from an even-sized cluster,
2. Multiple different changes are requested concurrently, and
3. Leadership is lost.
In this event, data loss and permanent split brain could occur.
This does not affect Raft implementations that use joint consensus, since:
1. Any two competing joint configurations have the old configuration in common and therefore overlap with themselves and their predecessor, and
2. Any two competing simple configurations have the new configuration in common and therefore overlap with themselves and their predecessor.
(I can go into more detail on this upon request, but it may not be necessary.)
Proposed solution:
The solution I'm proposing is exactly like the dissertation describes except that a leader may not append a new configuration entry until it has committed an entry from its current term.
In a typical Raft implementation, a leader appends a no-op entry to the log when it is elected. This change would mean rejecting or delaying membership change requests until the no-op entry is committed.
Once a leader has committed an entry in its current term, it knows it has the latest committed configuration, and no existing uncommitted configurations from prior terms can be committed anymore (the servers that store the current leader's entry won't vote for the servers that have the uncommitted configuration). So then it is safe for the leader to create a new configuration (that differs by at most one server) and begin replicating it.
In John Ousterhout's words, which tend to be better than mine, "The overall requirement is that a leader must not begin replicating a new configuration entry if there is an older configuration entry that is incompletely replicated (it is stored on at least one server, is not currently committed, but could become committed in the future). This ensures that if two configurations "compete", they differ by at most one server and hence have overlapping consensuses. In the algorithm from the dissertation, leaders were careful to make sure there were no incomplete configuration entries from the current term, but the algorithm did not properly handle incomplete configuration entries from previous leaders (which might not even be visible to the current leader). The new approach fixes that by ensuring that any such entries, if they exist, cannot be committed in the future, hence cannot be used for making decisions."
Safety argument:
We don't yet have a formal safety proof for the correctness of membership changes (of either type) in Raft. I've made an attempt at a safety argument for why the proposed solution works, but it's a little ugly.
It extends the safety argument in the paper/dissertation. The only part that's different (as far as I can tell) for membership changes is Step 2, which claims the existence of "the voter". The voter is the server that is part of the quorum that commits the entry in term T and the quorum that votes for the leader of term U. With a static configuration, it's easy to see the voter must exist (overlap of two majorities of the same set). With membership changes, it's more difficult; this aims to show it for the dissertation algorithm patched with the solution proposed above.
Closing remarks:
Let's use this thread to discuss the issue with respect to the single-server change algorithm described in my dissertation and the proposed solution(s). If you want to discuss how this affects other membership change algorithms, including joint consensus or whatever else you folks might have come up with, please keep those to separate clearly-labeled threads to avoid confusion.
I hope this bug hasn't affected anyone yet, and the events needed to cause it seem pretty unlikely. Still, this can cause data loss and split brain, so it should be taken seriously. If you have an implementation that's affected, you should file a bug report right away and get this patched up soon. Even if the solution I've proposed is ultimately superseded by something better, it won't do any harm to add this if-statement now.
I also plan to start a list of errata for my dissertation, so that new people reading it are aware of this issue.
Your feedback is welcome, and I'm looking forward to the discussion.
-Diego