Is there a bug w.r.t when Raft leader return OK to client?

143 views
Skip to first unread message

David Kao

unread,
Oct 30, 2019, 8:02:36 PM10/30/19
to raft-dev
Has anyone ever had similar unease feeling regarding how a entry is considered committed in Raft?

Here is what I found in "In Search of an Understandable Consensus Algorithm (Extended Version)" after about 10 pages.

Let's say:
  • Leader append an entry to majority of followers upon receiving a write request from client
  • leader considers the entry committed after majority has the entry in its log; but only he knows it is committed.
  • leader returns to client OK <--- this is how I read the paper.
  • but only in next heartbeat (appendEntries) that clients start to learn about a log index being committed <-- only the current term leader decides if an entry in its term is committed; subsequent term leader cannot decide on entries of the previous term.
  • if the leader fails before any follower (or < majority followers) learns about this committed value, then it seems possible for that entry to be viewed as uncommitted, which means the next term leader could overwrite it; but that entry was already returned OK to client.
    • remember, having an entry in the Raft log doesn't mean it is committed; the highest committed index needs to increment
  • To fix the above, the leader may have to broadcast the committed entry (by incrementing highest committed log index) to a majority before returning to client. That way, whoever has the latest committed value can be elected the new leader (leader completeness).
    • But this makes the algorithm a naive Paxos, taking two rounds (append actual entry to Raft log & then telling followers it was committed, before returning OK to client)
Raft is so widely received I doubt it has any bugs, so can someone point out where I understood the algorithm to be wrong? What do you think?

Alternatively, my understanding of the section "committing entries from previous terms" might be wrong w.r.t how leaders are elected ... but that's why I am writing this post.

Keine Neco

unread,
Oct 30, 2019, 10:37:28 PM10/30/19
to raft...@googlegroups.com
Hi David,

At 5th stage, e.g. this is a 5 nodes Raft group, there are at least 3 nodes contains this committed entry (one is the failed leader), so there are just two nodes which don't contains this entry. The majority can't be organized by only two node, so the new majority have to contains at least one node with this committed entry (But this node may think it is not committed now.). this entry contains maximum log term and log index, so this node must be elected as new leader. And its all uncommitted entries will be committed after its first no-op entry committed.

To suffer twice RTT is not so reasonable for Raft to achieve a lower latency.

I think you can read the election part again to get more details. Thanks for your question.

Dong.

David Kao <a.li...@gmail.com> 于2019年10月31日周四 上午8:02写道:
--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/f9d87f46-e5b8-42bb-a7df-03c334b80758%40googlegroups.com.

Changqing Li

unread,
Oct 31, 2019, 5:15:24 AM10/31/19
to raft...@googlegroups.com
Hi David,

Regarding "if the leader fails before any follower (or < majority followers) learns about this committed value, then it seems possible for that entry to be viewed as uncommitted, which means the next term leader could overwrite it; but that entry was already returned OK to client.", I think followers do not need to know the committed value immediately. Raft algorithm guarantees that only the nodes from majority group can be elected as new leader after old leader replicates the latest log to majority and then crashes, so the entry will never be overwritten.

Log entry may be overwritten when old leader crashes at the time the latest log entry has not  been store on a majority of nodes. To avoid consistent issue, leader does commits log entries from previous terms even though these logs have been stored on a majority.

Keine Neco <neco...@gmail.com> 于2019年10月31日周四 上午10:37写道:

David Kao

unread,
Oct 31, 2019, 8:32:44 AM10/31/19
to raft...@googlegroups.com
Hi Keine,

I may misunderstand the statement "there are at least 3 nodes contains this committed entry (one is the failed leader)" - I want to emphasize on the term "committed". My read of raft is that there is a difference between having the (uncommitted) in log vs. having the latest (committed) log index advance (basically, follower knowing the entry in its log is committed).

I fully understand if the entry is marked committed in a majority of followers, then sure, during election, another majority will at least contain one follower containing that latest committed entry. My problem was between broadcasting the committed value to the majority of follows (the 2nd round trip).

There *is* a difference between a mere appendEnries & a 2nd call to tell followers about the entry being committed, right? I don't think you can tell the follower "append and also mark committed" because the leader doesn't know the append has hit a majority yet.

David Kao


David Kao

unread,
Oct 31, 2019, 8:55:19 AM10/31/19
to raft...@googlegroups.com
Hi Changqing Li,

A little hard to understand exactly what you are trying to say, especially regarding "leader does commits log entries from previous terms even though these logs have been stored on a majority."

I'd agree that uncommitted log entries can be overwritten.

The question is how a leader commit log entries from previous terms. But Raft paper specifically says, on page 9,

"Raft never commits log entries from previous terms by count-ing replicas. Only log entries from the leader’s current term are committed by counting replicas"

On the other hand, Figure 8 would say

"log entry from term 2 has been replicated on a majority of the servers, but it is not committed."

So, besides current leader counting majority replicas, and THEN deciding it is committed, how else can a new leader decide a previous term replicas committed? As far as I know, Raft says leader doesn't decide uncommitted entry to be committed from previous terms.

David Kao


Ed M

unread,
Oct 31, 2019, 9:31:04 AM10/31/19
to raft-dev
David, 

Hey folks!

I agree the abridged raft paper is a little hard to follow in this area. 

We have to remember our guarantees:
Leader Completeness Property
Log Matching Property

First: A candidate can't transition to leader unless it has the most up-to-date replication log (or the longer of the two as a tiebreaker). 
Second: If two logs contain an entry w/ same matching term, then the logs are considered equal up to that point, which means they all have the same entries. 

The paper suggests that it doesn't commit from previous terms by counting replicas. It commits them based on proof by induction from leader completeness and log matching in combination.


On Thursday, October 31, 2019 at 8:55:19 AM UTC-4, David Kao wrote:
Hi Changqing Li,

A little hard to understand exactly what you are trying to say, especially regarding "leader does commits log entries from previous terms even though these logs have been stored on a majority."

I'd agree that uncommitted log entries can be overwritten.

The question is how a leader commit log entries from previous terms. But Raft paper specifically says, on page 9,

"Raft never commits log entries from previous terms by count-ing replicas. Only log entries from the leader’s current term are committed by counting replicas"

On the other hand, Figure 8 would say

"log entry from term 2 has been replicated on a majority of the servers, but it is not committed."

So, besides current leader counting majority replicas, and THEN deciding it is committed, how else can a new leader decide a previous term replicas committed? As far as I know, Raft says leader doesn't decide uncommitted entry to be committed from previous terms.

David Kao


On Thu, Oct 31, 2019 at 2:15 AM Changqing Li <lich...@gmail.com> wrote:
Hi David,

Regarding "if the leader fails before any follower (or < majority followers) learns about this committed value, then it seems possible for that entry to be viewed as uncommitted, which means the next term leader could overwrite it; but that entry was already returned OK to client.", I think followers do not need to know the committed value immediately. Raft algorithm guarantees that only the nodes from majority group can be elected as new leader after old leader replicates the latest log to majority and then crashes, so the entry will never be overwritten.

Log entry may be overwritten when old leader crashes at the time the latest log entry has not  been store on a majority of nodes. To avoid consistent issue, leader does commits log entries from previous terms even though these logs have been stored on a majority.

Keine Neco <neco...@gmail.com> 于2019年10月31日周四 上午10:37写道:
Hi David,

At 5th stage, e.g. this is a 5 nodes Raft group, there are at least 3 nodes contains this committed entry (one is the failed leader), so there are just two nodes which don't contains this entry. The majority can't be organized by only two node, so the new majority have to contains at least one node with this committed entry (But this node may think it is not committed now.). this entry contains maximum log term and log index, so this node must be elected as new leader. And its all uncommitted entries will be committed after its first no-op entry committed.

To suffer twice RTT is not so reasonable for Raft to achieve a lower latency.

I think you can read the election part again to get more details. Thanks for your question.

Dong.

David Kao <a.l...@gmail.com> 于2019年10月31日周四 上午8:02写道:
Has anyone ever had similar unease feeling regarding how a entry is considered committed in Raft?

Here is what I found in "In Search of an Understandable Consensus Algorithm (Extended Version)" after about 10 pages.

Let's say:
  • Leader append an entry to majority of followers upon receiving a write request from client
  • leader considers the entry committed after majority has the entry in its log; but only he knows it is committed.
  • leader returns to client OK <--- this is how I read the paper.
  • but only in next heartbeat (appendEntries) that clients start to learn about a log index being committed <-- only the current term leader decides if an entry in its term is committed; subsequent term leader cannot decide on entries of the previous term.
  • if the leader fails before any follower (or < majority followers) learns about this committed value, then it seems possible for that entry to be viewed as uncommitted, which means the next term leader could overwrite it; but that entry was already returned OK to client.
    • remember, having an entry in the Raft log doesn't mean it is committed; the highest committed index needs to increment
  • To fix the above, the leader may have to broadcast the committed entry (by incrementing highest committed log index) to a majority before returning to client. That way, whoever has the latest committed value can be elected the new leader (leader completeness).
    • But this makes the algorithm a naive Paxos, taking two rounds (append actual entry to Raft log & then telling followers it was committed, before returning OK to client)
Raft is so widely received I doubt it has any bugs, so can someone point out where I understood the algorithm to be wrong? What do you think?

Alternatively, my understanding of the section "committing entries from previous terms" might be wrong w.r.t how leaders are elected ... but that's why I am writing this post.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft...@googlegroups.com.
Message has been deleted

David Kao

unread,
Oct 31, 2019, 7:46:54 PM10/31/19
to raft-dev
(edited and reposted)

Hi Ed,

Thanks for agreeing that the abridged paper is at least insufficient in explaining the details.

However, I get leader completeness and log matching property. I did read through the proof by contradiction (section 5.4.3 Safety argument). The whole argument is surrounded around the idea of "committed" entries, which is largely similar to the idea of "accepted" value in Paxos.

The problem is, if you follow my original email (first email of this thread), how is an entry determined to be committed?

I'd appreciate if anyone can point out, using my example, which step I understood it to be wrong.

John Ousterhout

unread,
Oct 31, 2019, 7:48:18 PM10/31/19
to raft...@googlegroups.com
An entry is committed in either of two ways:

1. The entry is accepted by a majority of the servers in the cluster *in the same term as the entry* (if the leader crashes before receiving the results of those AppendEntries calls, it's possible that no one in the cluster will know that the entry is committed, but it is indeed committed; any future leader is guaranteed to store that entry, and it will finish propagating it to the rest of the cluster, if needed).

2. An entry in the same log, but with higher index, is committed.

Your error is in your fourth step: "subsequent term leader cannot decide on entries of the previous term." This is not true. A subsequent leader can commit entries in earlier terms; the way it does this is by committing a new entry in the current term, after which rule 2 above applies.

-John-

On Thu, Oct 31, 2019 at 4:38 PM David Kao <a.li...@gmail.com> wrote:
Hi Ed,

Thanks for agreeing that the abridged paper is at least insufficient in explaining the details.

However, I get leader completeness and log matching property. I did read through the proof by contradiction (section 5.4.3 Safety argument). The whole argument is surrounded around the idea of "committed" entries, which is largely similar to the idea of learned value in Paxos.

The problem is, if you follow my original email (first email of this thread), how is an entry determined to be committed?

I'd appreciate if anyone can point out, using my example, which step I understood it to be wrong.

On Thursday, October 31, 2019 at 6:31:04 AM UTC-7, Ed M wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/f24a6315-c67a-4b0a-995b-afe49b7ad03f%40googlegroups.com.

Changqing Li

unread,
Oct 31, 2019, 10:54:39 PM10/31/19
to raft...@googlegroups.com
Hi David,

Regarding "if the leader fails before any follower (or < majority followers) learns about this committed value, then it seems possible for that entry to be viewed as uncommitted,", I think you may misunderstand the meaning of "Committed". In latest term, as long as entry is stored by a majority of servers, the entry can be regarded as 'Committed". ''Committed" means that the log will never be overwritten and leader can apply it to State Machine and return to client. It does not mean followers learned the committed value. Followers do not have to explicitly receive a message which indicates the committed value when a entry is committed. 

Regarding how a leader commits log entries from previous terms, actually new leader does not decide whether a previous term log entry should be committed or not directly. New leader just replicates the log entries of previous terms. When a log entry of current term are stored on a majority, the previous entries must have been stored on a majority. When new leader commits a log entry of current term, the log entries of previous term are committed indirectly. When a log entry of current term stored on a majority, it is guaranteed that this log entry and entries before that will not be overwritten. Otherwise, it is not guaranteed. 

You may ask, if no client submitts log entry in new term, does it mean the entries of previous term will never be committed? Dong gave you the answer that new leader can launch a no-op entry at the beginning of a new term to trigger the replication of previous log entries, which is also mentioned in <<CONSENSUS: BRIDGING THEORY AND PRACTICE>> of Diego.

Hope that helps.

David Kao

unread,
Nov 2, 2019, 6:26:08 PM11/2/19
to raft-dev
Thanks John! I think I get it now. Raft is brilliant!


For future readers, this is how I understood it:

Regarding "Only log entries from the leader’s current term are committed by counting replicas", figure 8 of the paper is really the key.
  • when an entry is present in the majority of the servers, it matters how it arrived at the majority, and which term
    • If the original leader only replicated it to a minority when it received the request, that means the original leader could not return OK to client
      • to the client as well as server, the status of that write is unknown & could go either committed or overwritten in the future.
      • there is still a majority server out there that could put in a different entry at the same log index, with higher term; see (b) of figure 8
        • higher term is key because regardless of replication, the term number always monotonically increases in every round of elections. Election requires majority, so some server always remembers the highest term number from the last around and will increment it.
      • the rest is all in figure 8, I'll try to add my own explanations (hopefully giving out the right information):
    • If the current term leader is able to replicate to the majority an entry of current term (including a no op entry, if only to commit the uncommitted entries before it, see Changqing Li's recent response), that entry (and the entries before it) is considered committed by current leader, and will be committed by all future leaders if somehow the current leader crashes before telling others to move the highest committed index.
      • why? because current term is obviously the highest current term number so far; and any candidate with the highest term number entry in the log will win the election; election requires a majority; someone from these candidates containing the highest term entry will be in some majority set to win that election.

Wow it is a mouthful. Hope I don't confuse people even more.

Raft is brilliant; I can see where Paxos concepts are scattered in all of this, yet it is for a mutli-round/multi-log-entries; I can't find the right words to describe how beautiful this is.

Thanks all! I am just really happy I got it.


On Thursday, October 31, 2019 at 4:48:18 PM UTC-7, John Ousterhout wrote:
An entry is committed in either of two ways:

1. The entry is accepted by a majority of the servers in the cluster *in the same term as the entry* (if the leader crashes before receiving the results of those AppendEntries calls, it's possible that no one in the cluster will know that the entry is committed, but it is indeed committed; any future leader is guaranteed to store that entry, and it will finish propagating it to the rest of the cluster, if needed).

2. An entry in the same log, but with higher index, is committed.

Your error is in your fourth step: "subsequent term leader cannot decide on entries of the previous term." This is not true. A subsequent leader can commit entries in earlier terms; the way it does this is by committing a new entry in the current term, after which rule 2 above applies.

-John-

Reply all
Reply to author
Forward
0 new messages