Is "one-phase commit" Joint Consensus still correct?

118 views
Skip to first unread message

Chi Li

unread,
Mar 25, 2025, 10:52:45 PMMar 25
to raft-dev
Hi all,

In the paper, it describes a two-phase commit (2PC) based joint consensus. I am wondering if it is possible to do it as one-phase commit.
1. Leader appends and replicates Cold,new, and members start to use Cold,new as soon as the config is persisted in the log (same as paper)
2. Members start to use Cnew as soon as Cold,new is committed (different from the paper).

Since Cold,new is committed, Cold along can never forms a quorum (no split brain with Cold). Thus, it should be sufficiently safe to use Cnew for those who commits Cold,new. Is my understanding correct? I could not think of any corner case to prove me wrong.

Best,
Chi

A. Jesse Jiryu Davis

unread,
Mar 26, 2025, 12:22:12 PMMar 26
to raft...@googlegroups.com
One problem (at least) is if Cnew is larger than Cold. See Figure 10 in the Raft paper. Cold includes servers 1 through 3. Cnew includes servers 1 through 5. Let's say server 3 is the leader. You send it Cnew, it commits "Cold,new" to its log by replicating it to servers 3, 4, and 5. Servers 3, 4, and 5 are a majority of Cnew, so they can elect one of themselves as the leader in Cnew. If servers 1 and 2 don't learn of Cnew, they can elect one of themselves as the leader of Cold, because  servers 1 and 2 are a majority of Cold. Now you have two leaders in the same term, and all of Raft's guarantees can now be violated.

The Raft paper solves this with a joint consensus phase. The Raft thesis also proposes single-node changes: so long as each reconfig only adds one server or removes one server, then a majority of Cold and Cnew always overlaps so there's no need for a distinct joint consensus phase. MongoDB adopted this, plus an enhancement: we don't commit the reconfig in the oplog, so it's easier for MongoDB to use reconfig to remove lagging followers.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/raft-dev/06936fc9-c556-479c-a787-4d03a8350f11n%40googlegroups.com.

Chi Li

unread,
Mar 26, 2025, 12:56:03 PMMar 26
to raft-dev
Hi,

Thanks for the reply. 

Maybe I was not describing it correctly. I understand the difference between joint consensus and single-node change. In the counter-example you provided, Leader 3 could not commit Cold,new. Raft uses the new config as soon as new config is received in log, and for Server 3 to commit Cold,new it requires a intersection of Cold and Cnew. It needs the majority of Cold, which is one of the server in Server 1 and 2. As one of the Server 1 and 2 has received Cold,new, it could not form a quorum under Cold.

Best,
Chi

A. Jesse Jiryu Davis

unread,
Mar 28, 2025, 9:16:06 AMMar 28
to raft...@googlegroups.com
Sorry, I think I understand your proposal now. So in the Raft paper/thesis, joint consensus works like:
  1. Leader L writes Cold,new to its log at position i.
  2. L replicates all subsequent entries to all members of Cold and Cnew.
  3. When L's commitIndex >= i, L writes Cnew to its log, and L starts using Cnew.
After this, the servers only in Cold can be shut down.

I think you propose this change?:
  1. L writes Cold,new to its log at position i.
  2. L replicates all subsequent entries to all members of Cold and Cnew.
  3. When L's commitIndex >= i, L starts using Cnew.
So, what's the purpose of writing the Cnew entry? Why can't L just assume Cnew is the configuration if it knows Cold,new is committed?

I think the Cnew entry is useful if L crashes and a new leader L2 is elected. Let's say L2 has Cold,new in its log, but L2's commitIndex < i, so L2 doesn't know whether Cold,new was committed. (The commitIndex is communicated asynchronously, a new leader's commitIndex lags the previous leader's commitIndex by an unknown amount.) What should L2 do? If it commits its next entry with a majority of Cnew only, it risks the "two disjoint majorities" problem of Figure 10 in the paper. So for safety, it should commit an entry with a joint majority of Cold,new. But what if the servers that were only in Cold have been shut down? If a majority of Cold servers are now unavailable, L2 can't proceed.

So I think there's at least one use of the Cnew entry. Future leaders can look in their logs for a Cnew entry. If they see one, they know that Cold,new was committed, so they won't try to contact servers only in Cold anymore. An administrator can wait for the Cnew entry to appear before shutting down the servers only in Cold.



Chi Li

unread,
Mar 31, 2025, 3:28:17 PMMar 31
to raft-dev
HI,

Thanks again for your reply.

The example you gave is an availability problem. I think Raft gives strong consistency guarantee and availability is out of the scope. To be more specific, what you described can be generalize into a situation when quorum of Cold is down during the configuration change. In paper's "two-step" joint consensus approach, this can still happen. After Cold,new is committed, bringing down Cold will also cause availability loss of the quorum, which is no different than the "one-step" approach I proposed. To avoid availability loss in "one-step" approach, admin should make sure the current leader has committed Cold,new before shutting down Cold. 

Does this make sense? I think you are asking the right question: do we really need a separate "Cnew" in the log given we already have Cold,new? I think we might not need it.

Best,
Chi

dr-dr xp

unread,
Apr 6, 2025, 12:44:53 PMApr 6
to raft-dev
You are right. This algo can safely change the config.

But it brings in other issues: when a node starts up, it does not know whether the config is `C_old,new` or `C_new`. And it has to assume it is `C_old,new`. If the cluster has already entered `C_new`, then the nodes in `C_old` may have already permanently purged and remove. You have to carefully deal with such issue, otherwise it might lead to a deadlock, because nodes in `C_old` may be removed thus no leader can be elected.

The original joint consensus naturally adds a **barrier** between using `C_old,new` and using `C_new`.

And a patch to this one-phase commit algo just needs to append one another log, to play the role of the **barrier**. And the patched version of one-phase commit then becomes a variant of the original joint consensus algo.
Reply all
Reply to author
Forward
0 new messages