Can we make client do the quorum check to reduce message delay?

Jinkun Geng

unread,

Jul 28, 2022, 3:24:50 PM7/28/22

to raft-dev

Hi, Raft guys.

I am considering such a modification of Raft as below:

We know Raft needs four message delays to commit a request: client->leader->followers->leader->client

In the third message delay, leader is doing the quorum check and then after the third message delay, the leader already knows the request is committed, but it still needs one additional message delay to let the client know that.

I am thinking, what if the client can also do the quorum check, or it does the quorum check in parallel with the leader?

1st message delay: Client sends the request to the leader

2nd message delay:

(1) leader sends an acknowledge reply to the client, the reply includes the state information of the leader (ie. term, the seuqence number of the entry)

(2) leader also multicast the request to the followers

3rd message delay:

(1) follower sends reply to the leader

(2) follower also sends reply to the client

So after 3 message delays, the client can also do the quorum check and know whether this request is committed or not.

I know that the above design will cause heavier burden to clients and also increse the number of messages for replicas. Forget that for now, let's only talk about the number of message delays. I am just asking, whether the 3-message workflow can work, i.e., in this case, can we reduce the latency from 4 message delay to 3 message delay without violating correctness? If it violate correctness, can you give me some explanation or error trace?

Thanks,

Jinkun

Vilho Raatikka

unread,

Jul 28, 2022, 3:50:46 PM7/28/22

to raft...@googlegroups.com

Hi Jinkun,

I think you'd have to describe handling of cases where by any reason, client receives different number of responses from followers than what the leader does.

This would happen, say, if network of a follower fails between sending responses to the leader and to the client.

Regards

Vilho

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/01c08644-550a-41b9-ae11-864b5005d746n%40googlegroups.com.

Archie Cobbs

unread,

Jul 28, 2022, 3:53:57 PM7/28/22

to raft-dev

Specifically, if the client receives all the messages and the leader receives none (or not enough).

In that case the modification would not actually be committed on the leader, but the client would incorrectly be assuming that it is.

A network disruption at this point could cause the leader to be deposed, etc., and that transaction could be never actually committed.

-Archie

John Ousterhout

unread,

Jul 28, 2022, 4:19:18 PM7/28/22

to raft...@googlegroups.com

Actually, once a majority of followers have stored the entry locally and replied to the leader, the entry is (or will eventually be) committed, even if the leader receives none of the followers' responses. So I think a mechanism like the 3-step one might work, at least for some purposes.

However, a problem with this approach is that after 3 message delays the operation has not yet been applied to the state machine: it has only been committed to the log. Thus, if the client needs a result produced by applying the operation to the state machine, then that has to happen on the leader before a response is sent to the client.

-John-

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/a80ae32f-fafb-4841-848f-571979f43e03n%40googlegroups.com.

Jinkun Geng

unread,

Jul 28, 2022, 4:24:35 PM7/28/22

to raft-dev

Thanks John.

Continuing with your point, what if we make the leader execute immediately and include the result in its reply, and also we make the client 's quorum check must contain the leader's reply?

See the prototype I described in my last email replied to Archie.

Jinkun Geng

unread,

Jul 28, 2022, 4:28:08 PM7/28/22

to raft-dev

I think the most similar thing I describe here is NOPaxos https://www.usenix.org/system/files/conference/osdi16/osdi16-li.pdf

If we let the leader replica of Multi-Paxos/Raft become the sequencer, it should be exactly the fast path of NoPaxos.

Vilho Raatikka

unread,

Jul 28, 2022, 5:47:14 PM7/28/22

to raft...@googlegroups.com

Hi Jinkun,

the leader and the client may end up having different response count, even for quite long period of time, but eventually they both should end up in the same, or, at least they shouldn't end up in conflicting results. So, as mentioned already, it should be ok.

Surely it'd be costly with 10k clients but with small amount of them I don't see a problem in the increased communication.

Interesting idea, though.

Regards

Vilho

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/e1ceebe3-2af2-4585-bd22-90202e0c1effn%40googlegroups.com.

Jinkun Geng

unread,

Jul 28, 2022, 7:39:28 PM7/28/22

to raft-dev

Thanks, Vilho.

For the potential concern, I think it is not about network traffic (communication), it is about the CPU overheads added to the client. Previously the leader does the quorum check, now it becomes the client's burden, especially when the client is not beefy.

Speaking of the communication, in the opposite, I feel the communication actually can become better. Previously, every time a follower receives the request broadcast from the leader, it sends back a reply to the leader, so the leader is undertaking all the traffic from all the followers. Now the follower directly send the reply to the client (of course, the follow later should also let the leader know, but that is not the critical path anymore, it can do batching of the reply and send to the leader ), which actually relieves the network communication bottleneck for the leader. In short, I feel such an offloading of quorum check can achieve 3 message delay to commit one request, it will cause CPU burden to clients but not communication bottleneck to leader. If all clients are beefy enough, that should not be a problem.

Another conern after the discussion with some colleagues:

If the leader change (which is called view change in Viewstamped Replication) is very frequent, then the ousted leader needs to abandon its state and retrieve the states from the new leader, because its state may become divergent with the other normal replicas. A similar problem and solution is briefly discussed in NOPaxos technical report.

https://syslab.cs.washington.edu/papers/nopaxos-tr16.pdf

"

The only rare case when a replica will execute an operation that is not eventually committed is if a functioning leader is incorrectly replaced through a view change, losing some operations it executed. Because this case is rare, it is reasonable to handle it by having the ousted leader transfer application state from another replica, rather than application-level rollback.

"

John Ousterhout

unread,

Jul 28, 2022, 8:12:22 PM7/28/22

to raft...@googlegroups.com

This sounds reasonable, but I'd have to think about it a lot more before having confidence. One of the big lessons that distributed systems keep teaching me (often the hard way) is that ideas that initially seem reasonable don't always turn out to be correct...

-John-

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/0202ef67-3b18-466d-b0f5-c5d3ecacfc04n%40googlegroups.com.

Oren Eini (Ayende Rahien)

unread,

Jul 29, 2022, 2:00:30 AM7/29/22

to raft...@googlegroups.com

I really dislike this idea, because this _only_ handles the recording of the message in the log.

Consider the simplest scenario, you have a calculator implemented with Raft, and you have an "Increment" operation.

The followers may send you their confirmations, and on the client side you can tell that this is committed, but you won't know the _result_.

For this example, it is cheap to retain the entire state and send the "if it were committed, the value would be X".

However... consider what happens when the command is "ReserverUniqueUsername".

The size of the state you need can be quite large (the total number of users), so it isn't feasible to track it client side.

The follower can still send the "Speculative" result, of course, but that becomes useful only until you realize that the result of the command may be quite large and speculative execution in this manner is costly.

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/a09dc58f-0d46-48e0-a37e-a780d360e4fan%40googlegroups.com.

--

Oren Eini
CEO / Hibernating Rhinos LTD

Mobile: 972-52-548-6969

Sales: sa...@ravendb.net

Skype: ayenderahien

Support: sup...@ravendb.net

Jinkun Geng

unread,

Jul 29, 2022, 2:05:05 AM7/29/22

to raft-dev

Hi, Ayende.

My idea is: I let the leader replica immediately execute the requests and reply to the client, and also the followers also reply to the client.

During the quorum check, the client needs to ensure the leader's reply is included in the quorum, as well as f other follower's reply (which do not have execution result).

In this way, the client can get the result of the "committed request" in 3 message delay.

Jinkun Geng

unread,

Jul 29, 2022, 2:08:35 AM7/29/22

to raft-dev

In other words, the leader is the first guy who knows the execution result, because it immediately execute it after it receives the request. But even so, the leader is not the first guy who know whether this request is committed or not.

The client is the first guy who "knows the execution result and also know this result is the committed result."

Of course, the leader finally syncs with the other followers and he also knows that his request is committed, but that happens in the background, not in the critical path of the 3 message delays.

Vilho Raatikka

unread,

Jul 29, 2022, 6:49:51 AM7/29/22

to raft...@googlegroups.com

Hi Jinkun,

out of curiosity, what kind of clients do you have in mind? This may be due to my limited imagination but I don't see how quorum check would become an issue CPU-wise. But sure, an incoming response must be handled, and a thread, memory+cache, and CPU is involved. With high throughput the accumulated load might be substantial.

Having a database background, I'd expect potential large memory consumption caused by large results as a potential concern, assuming that the client stores them longer than what it would need by itself. Especially in cases where due to a failure the quorum checks between client and the leader are not in sync.

The other issue you refer to is something I need to read. In my first response, I had a gut feeling that configuration changes might cause issues but I couldn't quickly find any problematic cases. Maybe reading the nopaxos paper helps. The answer to "what could possibly go wrong" in distributed processing is often interesting.

Regards

Vilho

On Fri, Jul 29, 2022 at 2:39 AM Jinkun Geng <gjk...@stanford.edu> wrote:

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/a09dc58f-0d46-48e0-a37e-a780d360e4fan%40googlegroups.com.

Oren Eini (Ayende Rahien)

unread,

Jul 29, 2022, 7:06:05 AM7/29/22

to raft...@googlegroups.com

So the pattern is:

* Leader execute the command to compute the result

* Send the reply and its position in the log to the client

* Broadcast to followers

* Followers will broadcast their acceptance as well

* Client can listen to the follower broadcasts and compute the majority directly

The cost here is that you're doing the waiting on the client, and assuming your network topology allows it, you can save a network trip.

However, that is a really strange topology, to be honest. Where clients can receive broadcasts from all the nodes.

It is actually more fragile, since if the follower isn't able to broadcast to the client, they can't progress.

Another issue here is the handling of edge cases.

Consider a leader failure after sending the result back, but before commit. Another follower is elected, then that record is _removed_.

Or it may be committed at some far future time.

Most worryingly, there is the rule about only committing entries from previous terms after committing a noop entry from the current term.

How do you intend to do that? Because otherwise, there are sequences of operations that would show commit of the message (majority of the nodes) that aren't actually committed.

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/b55f496c-a0d1-444f-9e5b-e0c6f1caefd1n%40googlegroups.com.

John Ousterhout

unread,

Jul 29, 2022, 11:26:12 AM7/29/22

to raft...@googlegroups.com

One other issue with Jinkun's scheme is that it increases overhead. Each follower must now send 2 response messages: one to the leader and one to the client. This additional overhead will reduce the throughput of the cluster.

-John-

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/CAF0G-Zg3Nz%2BfQbMk%2BdQ7%3DKDqrPH_%2BT2G1TZKs8_sWW60ZnLkuw%40mail.gmail.com.

Jinkun Geng

unread,

Jul 29, 2022, 11:33:04 AM7/29/22

to raft-dev

It is true that each follower now send 2 responses. One to the client and the other to the leader. But the one to the leader can be batched without causing additional latency. e.g., Every time a follower receives a request, it immediately send the reply to clients. In the background, it sends a batched reply to the leader periodically (say. every 1 second). It will cause the leader to know the commit status later, but the client just needs 3 message delay to know the commit result. In this way, it does not reduce the throughput, instead, it reduces the leader's burden and should benefit the overall throughput of the cluster.

Best

Jinkun

Jinkun Geng

unread,

Jul 29, 2022, 11:42:15 AM7/29/22

to raft-dev

Related to Ayende's two concerns:

1. since if the follower isn't able to broadcast to the client, they can't progress

It is true that the client cannot make progress if ENOUGH (f) followers cannot broadcast to the client, but why should we expect that they cannot broadcast. Following the same logic, (1) if the follower cannot send reply to the leader, the leader cannot make progress either. (2) If the leader canot send reply to the client, the client cannot make progress either. Why does nobody worry about these two issues?

Besides, in etcd's implementation, both leader and followers are exposed to clients.

2. Consider a leader failure after sending the result back, but before commit. Another follower is elected, then that record is _removed_.

Or it may be committed at some far future time.

I have mentioned this in the previous email, just copy here

Another conern after the discussion with some colleagues:

If the leader change (which is called view change in Viewstamped Replication) is very frequent, then the ousted leader needs to abandon its state and retrieve the states from the new leader, because its state may become divergent with the other normal replicas. A similar problem and solution is briefly discussed in NOPaxos technical report.

https://syslab.cs.washington.edu/papers/nopaxos-tr16.pdf

"

The only rare case when a replica will execute an operation that is not eventually committed is if a functioning leader is incorrectly replaced through a view change, losing some operations it executed. Because this case is rare, it is reasonable to handle it by having the ousted leader transfer application state from another replica, rather than application-level rollback.

"

3. Most worryingly, there is the rule about only committing entries from previous terms after committing a noop entry from the current term.

How do you intend to do that? Because otherwise, there are sequences of operations that would show commit of the message (majority of the nodes) that aren't actually committed.

Based on Point 2, this will not happen "sequences of operations that would show commit of the message (majority of the nodes) that aren't actually committed"

The rule still holds "committed entries in the previous term/view will still be committed in future term/view"

Jinkun Geng

unread,

Jul 29, 2022, 11:51:16 AM7/29/22

to raft-dev

Related to Vilho's question:

out of curiosity, what kind of clients do you have in mind? This may be due to my limited imagination but I don't see how quorum check would become an issue CPU-wise. But sure, an incoming response must be handled, and a thread, memory+cache, and CPU is involved. With high throughput the accumulated load might be substantial.

Suppose the client is an open-loop clients, and its CPU is not beefy, i.e., only has 1 thread for quorum check.

When the client wants to submit at a high rate, say 100K req/sec.

Previously for each request to commit, it only needs to receive one reply from the leader, and that's enough.

But for now, for each request to commit, it needs to receive reply from 2f+1 replies (all the replicas will send reply to it ), and needs to check quorum for f+1 replies.

Consider you have 2f+1=9 replicas, now every request will lead to 9 replies for the client to handle.

If you are submitting 100K req/sec, you are receiving 900K reply/sec. And you only have 1 thread at the client, that will make the client CPU-intensive.

The story is like this: Previously Raft/Multi-Paxos cause leader bottleneck, because leader's burden is much heavier. Now the burden for the leader is removed, but the burden does not disappear, it is just migrated to the client.

2. Having a database background, I'd expect potential large memory consumption caused by large results as a potential concern, assuming that the client stores them longer than what it would need by itself. Especially in cases where due to a failure the quorum checks between client and the leader are not in sync.

Clients do not need to stay for long. If the client finds the request has been outstanding for a long time, it just needs to retry or abandon.

3. The other issue you refer to is something I need to read. In my first response, I had a gut feeling that configuration changes might cause issues but I couldn't quickly find any problematic cases. Maybe reading the nopaxos paper helps. The answer to "what could possibly go wrong" in distributed processing is often interesting.

True. Totally agree. Actually, I post this discussion because we are designing a similar mechanism. One of my cooperators is worrying but he cannot find an error trace to prove it violate correctness, so I put it here to see anybody can come up with an error trace to show me the quorum offloading is problematic.

Reply all

Reply to author

Forward