Can configuration log be used after committed rather than as soon as it is added for the two-phase joint consensus?

117 views
Skip to first unread message

程栋彬

unread,
Apr 19, 2024, 3:11:29 AMApr 19
to raft-dev
Hi all,

Glad to find this group!

I have read the raft paper(In Search of an Understandable Consensus Algorithm) and in the cluster membership changes section, it described the two-phase joint consensus and it emphasize:
"Once a given server adds the new configuration entry to its log, it uses that configuration for all future decisions (a server always uses the latest configuration in its log, regardless of whether the entry is committed)."  
I'm wondering whether the cluster reconfiguration related logs could be used after committed rather than using it regardless of the entry is committed or not. 

Could anybody help to explain why a configuration entry must be used as soon as it is added to a server? I have failed trying to found some bad cases for using the configuration entry after committed, and it also seems to workable for keeping safety during cluster reconfiguration. Could anyone provide some explanation about that?

Thanks in advance!
Dongbin Cheng

Diego Ongaro

unread,
Apr 29, 2024, 11:56:06 PMApr 29
to raft...@googlegroups.com
Hi Dongbin Cheng,

I'm sure I could have easily answered this question in the past. Thankfully, I wrote a little bit about it in my dissertation on the pages physically numbered 35 and 36, in the context of single server at a time membership changes:

As stated above, servers always use the latest configuration in their logs, regardless of whether
that configuration entry has been committed. This allows leaders to easily avoid overlapping config-
uration changes (the third item above), by not beginning a new change until the previous change’s
entry has committed. It is only safe to start another membership change once a majority of the old

cluster has moved to operating under the rules of Cnew. If servers adopted Cnew only when they
learned that Cnew was committed, Raft leaders would have a difficult time knowing when a major-
ity of the old cluster had adopted it. They would need to track which servers know of the entry’s
commitment, and the servers would need to persist their commit index to disk; neither of these
mechanisms is required in Raft. Instead, each server adopts Cnew as soon as that entry exists in its
log, and the leader knows it’s safe to allow further configuration changes as soon as the Cnew entry
has been committed. [...]

See https://github.com/ongardie/dissertation and please note the errata mentioned there for chapter 4.

Your question was about joint consensus, which is the older version of Raft membership changes that we included in the paper and is a bit different. Similar issues apply to joint consensus when using the proposed rule of adopting a configuration only once it is committed:

1. Suppose a leader committed a configuration change and started using the new configuration. Then, the leader restarts or leadership changes to another server. Since Raft does not normally persist or replicate the commit index, the rebooted or new leader would probably revert to an older configuration, which could be unsafe.

2. In joint consensus, suppose a leader committed the Cold,new entry. The leader can't safely commit the Cnew entry until the leader knows that a majority of Cold and a majority of Cnew have adopted the Cold,new configuration -- so those followers must have marked Cold,new as committed and persisted that and informed the leader. (Otherwise, the leader could operate under the rules of Cnew while other members of the cluster operated under Cold.)

It may be possible to extend Raft to persist and replicate commit indexes and for leaders to track when followers persist new commit indexes. I don't believe I ever explored that because it seemed immediately harder.

etcd uses a different membership change protocol that changes a single server at a time but waits for configuration entries to be committed before using them. The doc comments at https://github.com/etcd-io/raft/blob/main/doc.go#L260 mention some issues with the approach:

1. This can fail when removing a server from a two-server cluster. I think this refers to when you try to remove the leader of a two-server cluster: the leader commits the new entry, so it steps down, but the other server doesn't know the entry committed, so it's stuck.

2. They restrict membership changes to happen only when the leader has committed everything to prevent overlapping changes. In a busy cluster, you may never reach this state, so I imagine they'll block the creation of new entries and cause some brief unavailability before starting a membership change.

-Diego


--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/be38f0b9-0d16-441f-ad8a-1037892bd1e4n%40googlegroups.com.

程栋彬

unread,
May 6, 2024, 12:18:39 PMMay 6
to raft-dev
Hi Diego

Excited to get your reply and thank you very much for you explanation. I still have some more questions for the cases you listed when using the proposed rule of adopting a configuration only once it is committed. I think they are not bad cases and won't lead to unsafe situation, would you mind to review the following descriptions and could you please help to correct me if there is any misunderstanding for joint consensus?

For the first case, a leader committed a configuration change and started using the new configuration, we knew that this configuration change log is committed based on the fact that the majority of the old cluster have own this log. At that moment, before followers know that this configuration change log has been committed by leader, the leader crashed or restart. Then the left nodes will vote for a new leader, and the new leader must have owned this configuration change log otherwise it can't win the election. Then as a new leader with a bigger term, it should commit all existing logs then it will commit the configuration change log then know that the new configuration should be adopted. Then it commit a no-op entry so the new leader will use the new configuration before persisting and replicating any new entires in this term.

For the second case, when the leader committed the Cold,new entry, it will generate the Cnew entry and send it to followers through AppendEntry request. As described in your paper, the AppendEntry will contain the "LeaderCommit" field to indicate the commit index of the leader, thus as long as followers got the replication of Cnew log entry and persist it, they should know that the Cold,new is committed and adopt it. So I think the leader could safely commit Cnew after it know that the majority of Cnew and the majority of Cold have replicated Cnew log entry, because that means these two majority have adopted Cold,new configuration.  So commit index needn't to be persisted and leader also needn't to track when followers persist new commit indexes. But commit index need to be replicated from leader to follower through AppendEntry requests.
 
For the case of removing 1 node from 2-nodes cluster, if the removed node is the leader node, using the new configuration after it's committed will indeed lead to stuck situation. So I think this is truly a kind of bad case which was caused by using configuration after it's committed. Thanks for your description and explanation again! 

Best Wishes
Dongbin

Diego Ongaro

unread,
May 7, 2024, 6:21:54 PMMay 7
to raft...@googlegroups.com
Hi Dongbin,

I'll begin with a disclaimer that I've forgotten many of Raft's details, so my responses aren't as reliable as they once were. I should probably re-read my own dissertation. I had forgotten that leaders transmit their commit index in the AppendEntries request. (This is useful for keeping the followers' state machines nearly up-to-date. Without it, when a follower becomes leader, its state machine may have to process many entries, and it wouldn't be able to keep up with log compaction.) Thanks for pushing back and reminding me of this.

I think you're saying that a server may assume (or MUST assume) a configuration entry is committed if it finds another configuration entry later in its log. That seems valid, wasn't something I considered when I wrote my last reply, and may not be something I considered when I wrote my dissertation.

With that change, I also don't see a safety issue with the first or second cases we've been discussing. The leader may only create the C_{n+2} entry after committing the C_{n+1} entry to the C_{n} configuration. This implies a majority of C_{n} have the C_{n+1} entry, so a majority of C_{n} infer that the C_{n} entry is committed and use it. Then, by the time a leader is able to commit C_{n+2} and then start using it, a majority of C_{n+1} infer that the C_{n+1} entry is committed and use it. Since two consecutive configurations have overlapping quorums, that seems like it could be OK.

Back to removing the leader of a two-server cluster:

- As mentioned in my dissertation (section 4.2.2 first paragraph), you could sidestep the issue of removing leaders by transferring leadership to another server that will remain in the cluster, then doing the membership change.

- A similar issue applies to larger clusters with unavailable servers. For example, consider removing the leader of a 4-server cluster where another server is unavailable. The 3 available servers, including the to-be-removed leader, are needed to form an available majority of the old configuration. If the to-be-removed leader steps down and shuts down too soon, the two other available servers won't know they can use the new configuration.

I think the more general question is: when are we allowed to turn off removed servers? In normal Raft membership changes, that's as soon as Cnew is committed. If servers wait until Cnew is committed to use Cnew, shutting down then is too soon, because some servers may still use/revert to the previous configuration. I guess you could introduce another log entry?

-Diego

Reply all
Reply to author
Forward
Message has been deleted
0 new messages