Fast Paxos in etcd-raft lib

Yifei Ma

unread,

Oct 11, 2023, 1:43:49 PM10/11/23

to etcd-dev

Hi etcd dev community,

I just started to work with the etcd-raft lib recently, and I don't know the development history of etcd-raft. I have a question here about this history.

I knew the raft lib implementing the consensus protocol introduced by the Raft research paper. The raft protocol borrows many key ideas from the classic Paxos protocol and raft is a more easy-to-understand and easy-to-implement protocol.
The original raft protocol in my understanding is based on the classic multi-Paxos. In raft or classic Paxos, there are at least two RTT message delivery delay from a command is proposed and it can be committed, even when a stable leader has be voted before. The revised version of Paxos, fast-Paxos by Lamport (https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2005-112.pdf) can reduce the delay to one RTT with a cost of larger quorum. The Fast Paxos has been studied by academia for many years. When we looked into the raft lib, we don't see an implementation of the fast Paxos (if I overlooked, I will appreciate if you can share where it is in the lib).
My question about the etcd-raft is that if the community has ever considered of adding the fast Paxos within this lib. If the community has worked on it (e.g. its planning or high level design), what is the outcome? Is there any technical concern or challenge of implementing it? If the community has not worked on it, does the community believe it worth of giving it a try (e.g. POC).

Looking forward to hearing your opinions on it. Thank you,
Yifei

James

unread,

Oct 11, 2023, 1:47:50 PM10/11/23

to Yifei Ma, etcd-dev

I expect we would welcome the patch, if it was an improvement in etcd
overall. Sounds like a good idea. I think one challenge we currently
face is a lack of maintainers, so I think it would be more likely to
be accepted particularly if you and others were around to maintain
this longer term.

HTH

Josh Berkus

unread,

Oct 12, 2023, 12:42:15 PM10/12/23

to Yifei Ma, etcd-dev

On 10/11/23 10:43, Yifei Ma wrote:
> My question about the etcd-raft is that if the community has ever
> considered of adding the fast Paxos within this lib. If the community
> has worked on it (e.g. its planning or high level design), what is the
> outcome? Is there any technical concern or challenge of implementing it?
> If the community has not worked on it, does the community believe it
> worth of giving it a try (e.g. POC).

As far as I know, nobody's considered it. Etcd originated as the
demonstration case of Raft. I don't remember at this point why the Raft
researchers rejected Fast Paxos, and it may not matter. I know that the
folks at CitusDB adopted Fast Paxos instead of Raft, because we had a
long argument about it.

The technical challenge would be demonstrating that an implementation
based on Fast Paxos was correct under real-world circumstances. At this
point, Etcd has 9 years of engineering behind consistency, data
retention, and crash recovery -- most of which is closely tied to how
etcd-raft works. So doing this would probably require adding to Etcd's
testing in order to make sure that we haven't introduced a new data loss
bug. We'd also need testing to show that real-world performance was
actually better.

So realistically you're looking at a year-long project here, during
which you'd become an expert in Etcd. You can decide if that's worth it
for you. Regardless of how it came out, the results would be interesting.

--
-- Josh Berkus
Kubernetes Community Architect
OSPO, OCTO

Yifei Ma

unread,

Oct 18, 2023, 1:13:50 PM10/18/23

to Josh Berkus, Katie Gioioso, etcd-dev

Hi Josh,

Thank you for sharing your opinions. We are evaluating the challenges and our plan of reducing the network latency in Raft. There is a key question we would like to know before actually work on it. Does the Etcd community know the key bottleneck of Etcd with high load (e.g., large scale k8s cluster)? If the key bottle neck in Etcd is network, then our proposal of reducing network latency could bring in a lot of value. Otherwise, the work may not make much sense.

Best,
Yifei

Danielle Lancashire

unread,

Oct 18, 2023, 1:15:57 PM10/18/23

to Yifei Ma, Josh Berkus, Katie Gioioso, etcd-dev

My testing has mostly historically shown that disk becomes a bottleneck before the network does - even when running on all solid-state, especially in cloud environments. It's probably worth running some more structured benchmarks there to see though.

--
You received this message because you are subscribed to the Google Groups "etcd-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to etcd-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/etcd-dev/358C0FCA-692C-4ABC-AE62-044E41B4EE35%40clockwork.io.

Josh Berkus

unread,

Oct 18, 2023, 1:17:11 PM10/18/23

to Yifei Ma, Katie Gioioso, etcd-dev

On 10/18/23 10:13, Yifei Ma wrote:
> Thank you for sharing your opinions. We are evaluating the challenges
> and our plan of reducing the network latency in Raft. There is a key
> question we would like to know before actually work on it. Does the Etcd
> community know the key bottleneck of Etcd with high load (e.g., large
> scale k8s cluster)? If the key bottle neck in Etcd is network, then our
> proposal of reducing network latency could bring in a lot of value.
> Otherwise, the work may not make much sense.

I don't personally know; it used to be storage but we've done a lot to
speed that up. Someone else might. Is there a reason not to ask about
this on the etcd-dev mailing list?

Yifei Ma

unread,

Oct 18, 2023, 1:19:05 PM10/18/23

to Josh Berkus, Katie Gioioso, etcd-dev

Sorry, Josh, we don’t know there is a mail list for dev. We will send tech questions to this list in the future.

Josh Berkus

unread,

Oct 18, 2023, 1:24:38 PM10/18/23

to Yifei Ma, Katie Gioioso, etcd-dev

On 10/18/23 10:18, Yifei Ma wrote:
> Sorry, Josh, we don’t know there is a mail list for dev. We will send
> tech questions to this list in the future.

Oh, whoops, you have the mailing list on CC. Didn't realize it because
nobody else replied.

Hey, folks, has anyone done testing lately that would show whether
network is a bottleneck in addition to storage?

Yifei Ma

unread,

Oct 18, 2023, 1:26:59 PM10/18/23

to Josh Berkus, Katie Gioioso, etcd-dev

We are more than happy to profile Etcd if anyone could point us a reasonable setup and benchmark workloads.

Josh Berkus

unread,

Oct 18, 2023, 1:41:27 PM10/18/23

to Yifei Ma, Katie Gioioso, etcd-dev

On 10/18/23 10:26, Yifei Ma wrote:
> We are more than happy to profile Etcd if anyone could point us a
> reasonable setup and benchmark workloads.

It's worth checking whether, even if disk is the main bottleneck,
whether network messaging time nevertheless adds lag on top of that.

It would also be fun do to a "distributed cluster" test where Etcd nodes
are located in different availability zones/datacenters. I suspect that
you'd have a very different network/disk balance in that case.

Xiang Li

unread,

Oct 18, 2023, 1:42:54 PM10/18/23

to Yifei Ma, Josh Berkus, Katie Gioioso, etcd-dev

Hi Yifei,

Probably not. gRPC between client/server is the major bottleneck for throughput, and disk I/O is the major bottleneck for latency.
Moreover, write latency (reducing ~100ms) is probably not that important for k8s anyway. Read throughput and peak write throughput are more critical.

You probably also want to explain how you would like to extend Raft to support Fast Paxos before starting the work.

Thanks,
Xiang

--
You received this message because you are subscribed to the Google Groups "etcd-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to etcd-dev+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/etcd-dev/FFC92FFD-CBEC-4D58-88DA-545DE95143E7%40clockwork.io.

Xiang Li

unread,

Oct 18, 2023, 1:49:59 PM10/18/23

to Yifei Ma, Josh Berkus, Katie Gioioso, etcd-dev

If I understand Fast Paxos correctly, the client shall send messages directly to the server for the latency gain. In the k8s case, the client talks to the server with the k8s protocol, then translates to the etcd protocol, and then to the raft protocol.

To get all the latency gains, do we need to change all these so the client can directly send messages to raft?

Yifei Ma

unread,

Oct 18, 2023, 2:10:14 PM10/18/23

to Xiang Li, Josh Berkus, Katie Gioioso, etcd-dev

Hi Xiang,

What you said about Fast Paxos is correct, a client has to be broadcasted its proposals to all replicas and then if number of replicas who replied form a fast path quorum (larger than regular quorum), then the command can be considered as committed. The leader eventually will know the proposal and its position (e.g., next Paxos vote).

Of course, we prefer making the minimum design and code change in the existing etcd and its raft lib for fast Paxos. Our initial design effort is only making changes in the Raft lib. The message flow in our current plan is when the followers in the Raft lib receive a new proposal, they create a new type of Fast path message and broadcast it to the other known followers. In this way, new message types will be added and the Step() of each roles need code changes in the Raft lib only. We don’t plan to make any changes outside Raft lib, if possible.

Do you think the idea makes sense or feasible?

Thanks,

Yifei

Yifei Ma

unread,

Oct 18, 2023, 2:18:33 PM10/18/23

to Josh Berkus, Katie Gioioso, etcd-dev

The type of setup we believe the idea makes sense is 1) cloud environment, disk and network latency could be arbitrary large, 2) replicas across different network regions (WAN). We can prove the idea in 2) should help in theory.

Xiang Li

unread,

Oct 18, 2023, 2:18:47 PM10/18/23

to Yifei Ma, Josh Berkus, Katie Gioioso, etcd-dev

Thanks for your explanation.

I am not trying to discourage you from doing so, but you shall probably do some experiments ( e.g. just entirely bypass consensus without conflict resolution, just send the proposal to all nodes, collect the responses, and commit. ).

That would take you probably a few days to get the result with the etcd bench command. If you see a significant latency gain, we may want to proceed.

Yifei Ma

unread,

Oct 18, 2023, 2:24:32 PM10/18/23

to Xiang Li, Josh Berkus, Katie Gioioso, etcd-dev

Your proposed experiment plan is similar with what we plan to start with (also as a learning experience), but actually even smaller scope and simpler than ours. Thanks for the input.

Reply all

Reply to author

Forward