what will happend to replicated but uncommited logs in raft protocol

raftwalking

unread,

Jan 8, 2016, 4:57:58 AM1/8/16

to raft-dev

supposed a 3-member raft cluster a[leader],b,c

client send command to a, a replicate it to b and c, a apply the log to the status machine and response to client, then crash before replicate the committed state to b and c.

b replaces a to be the cluster leader. then what will happen to the uncommitted log while the log has been respond to the client? will it be replicated by b again or just discarded?

then supposed a 4-member raft cluster a[master],b,c,d

client send command to a, a replicate it to b and c(no d), a apply the log to the status machine and response to client, then crash before replicate the committed state to b and c and d.

d replaces a to be the cluster leader. then what will happen to the uncommitted log while the log has been respond to the client? will it be just discarded?

Oren Eini (Ayende Rahien)

unread,

Jan 8, 2016, 5:27:33 AM1/8/16

to raft...@googlegroups.com

Because the log is in a majority of the cluster, a leader crush will not cause the data to be lost.

The next leader will commit those entries.

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

raftwalking

unread,

Jan 8, 2016, 6:13:25 AM1/8/16

to raft-dev

does it mean in the 4-member cluster in my example, d can never be promoted to be the leader?

在 2016年1月8日星期五 UTC+8下午6:27:33，Ayende Rahien写道：

Oren Eini (Ayende Rahien)

unread,

Jan 8, 2016, 6:53:00 AM1/8/16

to raft...@googlegroups.com

D cannot be leader, correct

John Ousterhout

unread,

Jan 8, 2016, 12:02:15 PM1/8/16

to raft...@googlegroups.com

In your first example:

"client send command to a, a replicate it to b and c, a apply the log to the status machine and response to client, then crash before replicate the committed state to b and c."

It sounds like you are assuming that the leader has to replicate additional state (the "committed state"?) after it has applied a log entry to its state machine. This is not the case. The only thing replicated is the log entry describing the state machine operation (which doesn't occur until after the entry has been replicated). Once this entry has been replicated, the state machine changes can be reproduced by simply replaying that log entry. There's no need to replicate anything extra once the state machine update has occurred.

-John-

On Fri, Jan 8, 2016 at 1:53 AM, raftwalking <bigwes...@gmail.com> wrote:

supposed a 3-member raft cluster a[master],b,c

client send command to a, a replicate it to b and c, a apply the log to the status machine and response to client, then crash before replicate the committed state to b and c.

b replaces a to be the cluster leader. then what will happen to the uncommitted log while the log has been respond to the client? will it be replicated by b again or just discarded?

then supposed a 4-member raft cluster a[master],b,c,d

client send command to a, a replicate it to b and c(no d), a apply the log to the status machine and response to client, then crash before replicate the committed state to b and c and d.

d replaces a to be the cluster leader. then what will happen to the uncommitted log while the log has been respond to the client? will it be just discarded?

jordan.h...@gmail.com

unread,

Jan 9, 2016, 10:07:00 PM1/9/16

to raft...@googlegroups.com

On Jan 8, 2016, at 1:53 AM, raftwalking <bigwes...@gmail.com> wrote:

supposed a 3-member raft cluster a[master],b,c

client send command to a, a replicate it to b and c, a apply the log to the status machine and response to client, then crash before replicate the committed state to b and c.

As was mentioned, each server has a separate state machine. All that needs to be sent to followers after the applied the entry to its state machine and responded is the commitIndex which will cause followers to apply entries up to that index to their own state machines.

b replaces a to be the cluster leader. then what will happen to the uncommitted log while the log has been respond to the client? will it be replicated by b again or just discarded?

When b becomes the leader it will commit the entry for which it never received the updated commitIndex from the prior leader. Indeed, part of the new leader's responsibilities when it becomes the leader is to ensure all entries it logged prior to the start of its term are stored on a majority of servers and then commit them. That means server B sends an AppendEntries RPC to C, verifies that C has all the entries prior to the start of leader B's term, then increases the commitIndex beyond the start of its term (usually by committing a no-op entry) and applies the entries left over from the prior term to its state machine.

then supposed a 4-member raft cluster a[master],b,c,d

client send command to a, a replicate it to b and c(no d), a apply the log to the status machine and response to client, then crash before replicate the committed state to b and c and d.

d replaces a to be the cluster leader. then what will happen to the uncommitted log while the log has been respond to the client? will it be just discarded?

D cannot win an election because B and C have more entries from the prior term than D. That means even if D times out and starts an election before B and C, neither B nor C will vote for D, thus no entries committed by the prior leader will be discarded.

raftwalking

unread,

Jan 10, 2016, 9:05:29 PM1/10/16

to raft-dev

this answer is what i'm looking for.

committing process is just like 2pc, except that the final commited state is attached on the next appendentries rpc with the leadercommit property.

and d never wins the election cause it's not so up-to-date than majority of this cluster.

Message has been deleted

jordan.h...@gmail.com

unread,

Mar 20, 2018, 2:27:13 PM3/20/18

to raft...@googlegroups.com

Yes this is described in the “read only queries” section of the Raft dissertation:

“If the leader has not yet marked an entry from its current term committed, it waits until it has done so. The Leader Completeness Property guarantees that a leader has all committed entries, but at the start of its term, it may not know which those are. To find out, it needs to commit an entry from its term. Raft handles this by having each leader commit a blank no-op entry into the log at the start of its term. As soon as this no-op entry is committed, the leader’s commit index will be at least as large as any other servers’ during its term.”

So a leader should reject reads until an entry from its term has been applied to ensure it committed all prior entries. Then it still needs to do a round of heartbeats to ensure it’s still the leader.

On Mar 19, 2018, at 11:46 AM, yihao yang <yangyi...@gmail.com> wrote:

Hi, Jordan:

Based on your comments:

When b becomes the leader it will commit the entry for which it never received the updated commitIndex from the prior leader. Indeed, part of the new leader's responsibilities when it becomes the leader is to ensure all entries it logged prior to the start of its term are stored on a majority of servers and then commit them. That means server B sends an AppendEntries RPC to C, verifies that C has all the entries prior to the start of leader B's term, then increases the commitIndex beyond the start of its term (usually by committing a no-op entry) and applies the entries left over from the prior term to its state machine.

Every time a new term (leadership) begins, it is possible that a Read request will read a stale state machine. Right? Since the new leader needs to send and collect one round of heartbeat (AppendEntries). Does this means the before the first heartbeat round is finishing, no read is available?

Thanks,
Yihao

jordan.h...@gmail.com

unread,

Mar 20, 2018, 2:28:44 PM3/20/18

to raft...@googlegroups.com

Until an entry from its term has been committed* I mean, not applied (to the state machine).

chosen0ne

unread,

Apr 19, 2018, 5:50:27 AM4/19/18

to raft-dev

What about this situation:

a 5-node raft cluster a[leader], b, c, d, e. And current index and commit index for each node are:

a b c d e

current_idx n n n n n

commit_idx n n-1 n-1 n-1 n-1

A write request is issued to 'a'. 'a' is crashed after it send the log entry to 'd'. No other nodes received the entry except 'd'. And current index and commit index for each node are:

b c d e

current_idx n n n+1 n

commit_idx n-1 n-1 n n-1

Log entry indexed by n+1 isn't committed, as it's not replicated to the majority. In the implementation https://github.com/willemt/raft, it uses current index to check whether a candidate has up-to-date logs. Node 'd' which has a uncommitted entry has the largest current index, and it will win the leader election. This will make an uncommitted log entry to be visible, that is not expected.

Because of the asynchronous commitment(a AppendEntry RPC will make the log entry in the previous AppendEntry commit) in followers, commit index can't be used to check whether a candidate has up-to-date logs. I want to know if I misunderstanded? And how to avoid an uncommitted log entry to be visible?

Thanks!

David B Murray

unread,

Apr 19, 2018, 2:06:16 PM4/19/18

to raft...@googlegroups.com

As Jordan said:

> [A] leader should reject reads until an entry from its term has been [committed] to ensure it committed all prior entries.

D cannot serve reads until it has committed a (generally no-op) entry at index n+2. The process of doing so will also ensure the entry at n+1 becomes committed, so it is never visible before it is committed.

-d

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

chosen0ne

unread,

Apr 23, 2018, 5:12:27 AM4/23/18

to raft-dev

As you said, the entry at n+1 will be committed if 'd' win the election. However, the client can't get a response(timeout) from 'a' when it issue the write request which generate the entry at n+1. The client will think that the request is failed. If 'd' win the election, the effect of a failed request will be visible. This is unexpected.

在 2018年4月20日星期五 UTC+8上午2:06:16，David Murray写道：

David B Murray

unread,

Apr 23, 2018, 9:53:40 AM4/23/18

to raft...@googlegroups.com

A timeout is not a guarantee that the request failed. For example, it’s always possible the request completed successfully from ‘a’s perspective, but the response to the client was lost on the network. A timeout means the client doesn’t know what happened, and needs to issue a read request if it wants to find out.

-d

Reply all

Reply to author

Forward