Confusion regarding majority appends and retries

Arya

unread,

Apr 27, 2021, 3:28:37 PM4/27/21

to raft-dev

Hi all,

While working on the raft implementation I am a little perplexed by the following statement in the paper.

Once a leader has been elected, it begins servicing client requests. Each client request contains a command to be executed by the replicated state machines. The leader appends the command to its log as a new entry, then issues AppendEntries RPCs in parallel to each of the other servers to replicate the entry. When the entry has been safely replicated (as described below), the leader applies the entry to its state machine and returns the result of that execution to the client. If followers crash or run slowly, or if network packets are lost, the leader retries AppendEntries RPCs indefinitely (even after it has responded to the client) until all followers eventually store all log entries.

It seems like the paper is suggesting spawning separate AppendEntry request threads and have a countdown latch (with the majority count) that is decremented upon a successful completion of the AppendEntry request. Once the majority count has reached the leader applies the command to the resident state machine and then responding to the client with an ACK.

My question is regarding the subset of threads from the original request that are still in a retry loop or in a delayed state due to either network/server failures, network flaps or in a state of maintaining a log-matching property. If we are keeping those threads alive, what happens when a subsequent client request arrives at the node? Do we cancel the execution of the threads from the previous request before re-spawning a new set of requests from the leader?

Here is an example:

We have a set of 5 nodes: [A, B, C, D, E]

' A' is the leader.

Here's the chronology of events:

1. Client request C1 is intercepted by 'A'.

2. A appends the entry to its own log.

3. A initializes a counting semaphore of size 3 (5 nodes, the majority will be 3)

4. A spawns 4 RPC threads , [RPC-B, RPC-C, RPC-D, RPC-E]

5. [RPC-B,RPC-C, RPC-D] completes, decrementing the semaphore upon each completion.

6. A meets the majority requirement and commits the entry.

7. A returns an OK response for client request C1. [RPC-E] is still running in the background.

8. Client request C2 is intercepted by 'A'

9. Same as steps 2 and 3.

10. Spawn 4 RPC threads for B, C, D and E. (Since E is still in progress, do we dispatch another thread for E???)

It will be great if somebody could clarify what I am missing here.

Thanks

Archie Cobbs

unread,

Apr 27, 2021, 3:55:34 PM4/27/21

to raft-dev

Hi Arya,

Strictly speaking, whether to "dispatch a thread" is an implementation detail. From the outside, looking at your leader A, you just see messages being emitted without any notion of what "thread" they came from. There is no notion of a "thread", only messages passing back & forth at certain times.

So the "Raft" question here is how should the leader behave (looking at it like a black box) in this situation? The answer is that when the leader sends an AppendRequest, it always sends the most up-to-date information that it has at that moment.

So in your example, the next message A sends to E should incorporate both log entries (i.e., the most recent one added at step 9).

-Archie

Oren Eini (Ayende Rahien)

unread,

Apr 27, 2021, 4:16:47 PM4/27/21

to raft...@googlegroups.com

I tried to answer this here:

https://ayende.com/blog/193762-A/practical-considerations-for-implementing-raft?key=95de157709f94d2cb792d0b6eee4f9c1

Basically, you aren't spawning a separate thread per entry in the log, those threads are long lived.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/1e9c13cd-b7e3-4cf9-ad51-cf1aa6114c11n%40googlegroups.com.

Arya

unread,

Apr 27, 2021, 6:21:25 PM4/27/21

to raft-dev

Archie/Ayende,

Thanks for your quick response. I think I over-indexed a bit too much on the implementation detail while missing out on the crux of the protocol which is fundamentally based on message passing. The explanations seem to imply that the order in which the messages are being sent by the leader is not important. In my example, the previous AppendEntry call (C1) being stuck in a retry loop for one of the followers (E) upon the arrival of a new AppendEntry call (C2) is a non-issue. The protocol is designed to be self-correcting even for out of order messages originating from the leader since the log-matching will take care of any gaps in the write as long as the invariants for LM are enforced on the receiver end.

This essentially means (I am once again delving a bit into a pseudo-implementation), that the leader (thread/worker/whatever) can maintain a retry buffer for all the AppendEntry payloads that "failed" to "append". The order in which the client requests are queued up is irrelevant since the protocol will take care of any gaps in the log due to out of order arrivals in an idempotent manner. The leader can keep retrying all the payloads in its retry buffer for as long as it is a "leader" without any side-effects.

I hope I am on the right track here.

Archie Cobbs

unread,

Apr 27, 2021, 9:44:05 PM4/27/21

to raft-dev

Hi Arya,

On Tuesday, April 27, 2021 at 5:21:25 PM UTC-5 Arya wrote:

This essentially means (I am once again delving a bit into a pseudo-implementation), that the leader (thread/worker/whatever) can maintain a retry buffer for all the AppendEntry payloads that "failed" to "append". The order in which the client requests are queued up is irrelevant since the protocol will take care of any gaps in the log due to out of order arrivals in an idempotent manner. The leader can keep retrying all the payloads in its retry buffer for as long as it is a "leader" without any side-effects.

That sounds reasonable. FWIW the way my implementation works there are no threads or queues, instead it is written like a giant state machine (with timers). Each time you need to send a message (e.g., retry timer expires) it is created from scratch based on the current state. Each time you receive a message you update your current state accordingly, and this may in turn trigger further internal events (e.g., that cause you to apply a log entry to the state machine). It's somewhat tedious to implement this way but from another perspective it's easier to understand and debug because the logic is "flat" so to speak. You just need to be able to answer the question, "What do I need to do now?"

-Archie

Supratik Chakraborty

unread,

May 4, 2021, 2:29:52 PM5/4/21

to raft...@googlegroups.com

Thanks for the feedback. This was very helpful.

--
You received this message because you are subscribed to a topic in the Google Groups "raft-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raft-dev/PnvThDlMczU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/983f957d-3eb3-43ef-9f64-b62938010322n%40googlegroups.com.

Reply all

Reply to author

Forward