Raft: Theory and Reality

Roman Sakno

unread,

Jul 8, 2019, 4:32:51 AM7/8/19

to raft-dev

Hi,

I'm working on Raft implementation in C# consisting of two parts: transport-agnostic implementation of algorithm and concrete realization for ASP.NET Core based on HTTP/S. It is completed at 99%. I found that some real-worlds aspects are not covered in related docs. Probably, that's OK for academic paper, but I would like to discuss them and find answers.

Communication protocol between cluster members. Let's split them into the following categories: connection can be persistent (like HTTP with KeepAlive or raw TCP), non-persistent (like HTTP without KeepAlive), connectionless datagram-based (like UDP). The last one is out of my interest so left it aside. Persistent connection produces N^2 interconnections (where N - a number of members) between nodes in worst case, e.g. if we have 20 members the total number of TCP connections is 400. Network engineers will be unhappy with that in real production environment. Moreover, such approach consumes a lot of system and network resources. When cluster operating normally, Raft uses just N connections in the same time. As a result, efficiency of resource consumption is equal to N/N^2 = 1/N. It decreases while increasing number of nodes. That's why I decide to use Connection header with Close value which means to close underlying TCP connection after each response. Of course, non-persistent connection causes lower performance.

According with HTTP specification, there are three kinds of methods: idempotent, safe and non-idempotent. By default, HTTP protocol offers at-least-once message delivery pattern. It means that RequestVote and AppendEntries can be duplicated in case of network failures. For instance, when http client sent the request but didn't receive response then it decide to re-send the request. It is possible to beat this problem using two approaches:

Assume that RequestVote and AppendEntries are idempotent (for the same term, prevLogIndex and commitIndex)
Introduce RequestId and detect duplicate requests on the receiver side. Personally, I don't like this approach because I should keep the buffer of such IDs for tracking duplicate requests.

Which way is correct?

The next thing is a request timeout. I chose it equal to upper bound of the range from which election timeout randomly picked up. I did that because .NET HttpClient doesn't allow to specify timeout for each request individually. Is that correct choice? I guess that is normal case when HttpClient allows to specify request timeout it should be equal to randomly chosen election timeout.

Now let's discuss behavior of Raft-based cluster in unstable network where partitioning is possible. I had a chance to emulate such behavior on my computer using localhost connections and I found that there are multiple leaders possible for external observer. It means that I had 1 leader in each partition. That's OK, I know. But as a result, split and join between different partitions may happen very often in unstable network. It is not good for applications that use leader election mechanism as alternative to distributed exclusive lock. I decided to introduce absoluteMajority configuration parameter which interprets unreachable members as negatively voted. Now just one partition may have leader while others not. Such configuration may lead to situation when partitioned cluster becomes totally unavailable because no one of them contain absolute majority of members. From my point of view, it is better than have multiple acquired distributed locks in every partition.

The last thing that I would like to ask is a synchronous commit to Raft cluster. If I understanding correctly, Raft allows to implement weak consistency only. This happens because leader node responds to the client immediately after all necessary log records will be recorded into its audit trail. There is no way for the client to ensure that change is committed to the majority of nodes before the response has been received. Is it possible to modify Raft algorithm somehow to provide strong consistency?

Thanks in advance for your help!

Arthur Wang

unread,

Jul 8, 2019, 5:20:41 AM7/8/19

to raft...@googlegroups.com

Hi Roman:

I'm also working on my own implementation of raft, so I think I can give some suggestions.

Persistent connection produces N^2 interconnections

No, I don't think we need N^2 for the persistent connections, we just need to keep one connection between each pair of <Leader,Follower> , which is N-1. There is only one scenario where followers needing to communicate with each other : they've switched to candidate and already started voting. And when this happens, you can simply start short connections between them since doing election is not a high frequent event, no need to worrying about the network overhead when it happens.

And there is also one exception for the above : you are implementing a multi-raft cluster, from where you will need to keep a N^2 persistent connections between each of the node in the cluster, but that's totally another thing, beyond our talking scope.

RequestVote and AppendEntries can be duplicated in case of network failures

Yes, the message on the wire may get duplicate delivered in case of network unstable, but the TCP protocol(also the HTTP based upon it) already handle this for you, you needn't and shouldn't caring about the idempotent or non-idempotent thing just because of worrying about the network layer, what you need to worry about is what the receiver side should do if the sender explicitly(in application level code) resent the timed out messages.

| Assume that RequestVote and AppendEntries are idempotent

Yes, they both should be idempotent, and this is required by the raft protocol.

| I chose it equal to upper bound of the range from which election timeout randomly picked up.

No, this is wrong. Request timeout and election timeout are totally two different kind of things, they just have nothing in common. You just decide the request timeout value based on your own requirements.

| there are multiple leaders possible for external observer. It means that I had 1 leader in each partition

This definitely shouldn't happen. At most one of the partitions can get a majority vote from the nodes inside it, this is also the things about raft protocol itself.

| If I understanding correctly, Raft allows to implement weak consistency

There are many opaque talks about the 'consistency' topic not only on the raft protocol. I give you my own definition on raft's consistency:

1. raft satisfies strong consistency in terms of observing from the outside. One client can always see its writing result once it get a success message from the leader.

2. raft is eventual consistency in term of observing from the inside. Not all followers are guaranteed of having committed the log entries even the entries already committed on the leader.

It seems that you'd better read the paper more times to get a correct understanding of raft. And for the implementation, the only thing I feel difficult is that how to deal with the message disorder problem (disorderly received on the follower side and disorderly finished on the leader side) correctly under an asynchronous and multi-threading environment.

- Regards

- Arthur

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/4f544054-88ea-4f65-a304-e11342a52459%40googlegroups.com.

Oren Eini (Ayende Rahien)

unread,

Jul 8, 2019, 5:29:06 AM7/8/19

to raft...@googlegroups.com

inline

On Mon, Jul 8, 2019 at 11:32 AM Roman Sakno <roman...@gmail.com> wrote:

Hi,

I'm working on Raft implementation in C# consisting of two parts: transport-agnostic implementation of algorithm and concrete realization for ASP.NET Core based on HTTP/S. It is completed at 99%. I found that some real-worlds aspects are not covered in related docs. Probably, that's OK for academic paper, but I would like to discuss them and find answers.

Communication protocol between cluster members. Let's split them into the following categories: connection can be persistent (like HTTP with KeepAlive or raw TCP), non-persistent (like HTTP without KeepAlive), connectionless datagram-based (like UDP). The last one is out of my interest so left it aside. Persistent connection produces N^2 interconnections (where N - a number of members) between nodes in worst case, e.g. if we have 20 members the total number of TCP connections is 400.

I'm not sure how you go to this number. A stable Raft cluster has a leader connected to all the nodes, a total of N-1 connections.

The nodes don't talk to one another directly.

This changes if there are elections, but that tend to be rare, and each candidate will only talk to N nodes, and reach majority quickly.

Network engineers will be unhappy with that in real production environment. Moreover, such approach consumes a lot of system and network resources. When cluster operating normally, Raft uses just N connections in the same time. As a result, efficiency of resource consumption is equal to N/N^2 = 1/N. It decreases while increasing number of nodes. That's why I decide to use Connection header with Close value which means to close underlying TCP connection after each response. Of course, non-persistent connection causes lower performance.

Really bad idea, in my experience. The time it takes to setup a connection is _huge_. Especially when you consider things like TLS.

If you have multiple IPs & IPv4 vs IPv6, you have to wait for first IPv6 timeout before trying IPv4, so your connection has a builtin delay of about 1 second.

I found it best to use TCP connections and rely on the ordered guarantees there.

According with HTTP specification, there are three kinds of methods: idempotent, safe and non-idempotent. By default, HTTP protocol offers at-least-once message delivery pattern. It means that RequestVote and AppendEntries can be duplicated in case of network failures. For instance, when http client sent the request but didn't receive response then it decide to re-send the request. It is possible to beat this problem using two approaches:
Assume that RequestVote and AppendEntries are idempotent (for the same term, prevLogIndex and commitIndex)
Introduce RequestId and detect duplicate requests on the receiver side. Personally, I don't like this approach because I should keep the buffer of such IDs for tracking duplicate requests.
Which way is correct?

HTTP is inherently stateless, even if you use persistent connections. You can do sync RPC, so you'll know on the client side what index you sent, or you can have it sent from the other side.

But I found it better to use a stateful protocol, because that simplify a lot of the archtiecture.

The next thing is a request timeout. I chose it equal to upper bound of the range from which election timeout randomly picked up. I did that because .NET HttpClient doesn't allow to specify timeout for each request individually. Is that correct choice? I guess that is normal case when HttpClient allows to specify request timeout it should be equal to randomly chosen election timeout.

Timeout should probably be half of the election timeout.

Now let's discuss behavior of Raft-based cluster in unstable network where partitioning is possible. I had a chance to emulate such behavior on my computer using localhost connections and I found that there are multiple leaders possible for external observer. It means that I had 1 leader in each partition. That's OK, I know.

Um, no. That is not okay. If the network is unstable, you may have different leaders in _different terms_, which is somewhat different.

You need to implement a feature where your leader will step down if it can't talk to its followers for an election timeout.

But as a result, split and join between different partitions may happen very often in unstable network. It is not good for applications that use leader election mechanism as alternative to distributed exclusive lock. I decided to introduce absoluteMajority configuration parameter which interprets unreachable members as negatively voted.

What else can you have here?

Now just one partition may have leader while others not. Such configuration may lead to situation when partitioned cluster becomes totally unavailable because no one of them contain absolute majority of members. From my point of view, it is better than have multiple acquired distributed locks in every partition.

That is how Raft works, yes. The other mode is not Raft, and I have no idea what it is.

The last thing that I would like to ask is a synchronous commit to Raft cluster. If I understanding correctly, Raft allows to implement weak consistency only. This happens because leader node responds to the client immediately after all necessary log records will be recorded into its audit trail. There is no way for the client to ensure that change is committed to the majority of nodes before the response has been received. Is it possible to modify Raft algorithm somehow to provide strong consistency?

Sure, all you need to do is to have the client send the last raft index it committed on all future calls. The other members will wait to apply that commit before responding.

Thanks in advance for your help!

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/4f544054-88ea-4f65-a304-e11342a52459%40googlegroups.com.

--

Oren Eini
CEO / Hibernating Rhinos LTD

Mobile: +972-52-548-6969

Sales: sa...@ravendb.net

Skype: ayenderahien

Support: sup...@ravendb.net

Henrik Ingo

unread,

Jul 8, 2019, 5:40:59 AM7/8/19

to raft...@googlegroups.com

On Mon, Jul 8, 2019 at 11:32 AM Roman Sakno <roman...@gmail.com> wrote:

Hi,

Hi Roman

It's been a few years, but let me see if I still remember this...

I'm working on Raft implementation in C# consisting of two parts: transport-agnostic implementation of algorithm and concrete realization for ASP.NET Core based on HTTP/S. It is completed at 99%. I found that some real-worlds aspects are not covered in related docs. Probably, that's OK for academic paper, but I would like to discuss them and find answers.

Communication protocol between cluster members. Let's split them into the following categories: connection can be persistent (like HTTP with KeepAlive or raw TCP), non-persistent (like HTTP without KeepAlive), connectionless datagram-based (like UDP). The last one is out of my interest so left it aside. Persistent connection produces N^2 interconnections (where N - a number of members) between nodes in worst case, e.g. if we have 20 members the total number of TCP connections is 400. Network engineers will be unhappy with that in real production environment. Moreover, such approach consumes a lot of system and network resources. When cluster operating normally, Raft uses just N connections in the same time. As a result, efficiency of resource consumption is equal to N/N^2 = 1/N. It decreases while increasing number of nodes. That's why I decide to use Connection header with Close value which means to close underlying TCP connection after each response. Of course, non-persistent connection causes lower performance.

I guess this part is subjective and from Raft point of view all of the above are fine. Where I'm coming from 3-7 is the most likely number of members and otoh clients open tens of thousands of connections, so 400 seems like a small number to worry about!

According with HTTP specification, there are three kinds of methods: idempotent, safe and non-idempotent. By default, HTTP protocol offers at-least-once message delivery pattern. It means that RequestVote and AppendEntries can be duplicated in case of network failures. For instance, when http client sent the request but didn't receive response then it decide to re-send the request. It is possible to beat this problem using two approaches:
Assume that RequestVote and AppendEntries are idempotent (for the same term, prevLogIndex and commitIndex)
Introduce RequestId and detect duplicate requests on the receiver side. Personally, I don't like this approach because I should keep the buffer of such IDs for tracking duplicate requests.
Which way is correct?

RequestVote is idempotent. However, if the RequestVote is lost in transit, you don't really want to resend it. If you don't get sufficient votes in a timely manner, you will reach election timeout and the same node or another node already calls the next election anyway.

AppendEntries is resumable. You just continue from last successful state.

The next thing is a request timeout. I chose it equal to upper bound of the range from which election timeout randomly picked up. I did that because .NET HttpClient doesn't allow to specify timeout for each request individually. Is that correct choice? I guess that is normal case when HttpClient allows to specify request timeout it should be equal to randomly chosen election timeout.

In any case a timeout longer than that doesn't make sense.

Now let's discuss behavior of Raft-based cluster in unstable network where partitioning is possible. I had a chance to emulate such behavior on my computer using localhost connections and I found that there are multiple leaders possible for external observer. It means that I had 1 leader in each partition. That's OK, I know. But as a result, split and join between different partitions may happen very often in unstable network. It is not good for applications that use leader election mechanism as alternative to distributed exclusive lock. I decided to introduce absoluteMajority configuration parameter which interprets unreachable members as negatively voted. Now just one partition may have leader while others not. Such configuration may lead to situation when partitioned cluster becomes totally unavailable because no one of them contain absolute majority of members. From my point of view, it is better than have multiple acquired distributed locks in every partition.

Note that in Raft you should indeed have absolute majority. In a partitioned cluster the unreachable nodes are still part of the cluster configuration. This is analogous to standards bodies where an abstain vote is really a no vote.

The only time this changes is when administrator explicitly removes a node from the config.

The last thing that I would like to ask is a synchronous commit to Raft cluster. If I understanding correctly, Raft allows to implement weak consistency only. This happens because leader node responds to the client immediately after all necessary log records will be recorded into its audit trail. There is no way for the client to ensure that change is committed to the majority of nodes before the response has been received. Is it possible to modify Raft algorithm somehow to provide strong consistency?

I never read the client parts thoroughly, but I see now that others are responding too.

Thanks in advance for your help!

You're welcome!

henrik

--

henri...@avoinelama.fi
+358-40-5697354 skype: henrik.ingo irc: hingo
www.openlife.cc

My LinkedIn profile: http://fi.linkedin.com/pub/henrik-ingo/3/232/8a7

Roman Sakno

unread,

Jul 8, 2019, 6:27:02 AM7/8/19

to raft-dev

Note that in Raft you should indeed have absolute majority. In a partitioned cluster the unreachable nodes are still part of the cluster configuration. This is analogous to standards bodies where an abstain vote is really a no vote.

Are you sure? Look at http://thesecretlivesofdata.com/raft/#replication where you can see that two partitions may have their own leader. Additionally, you can observe the same picture on https://raft.github.io/:

Wait for leader election
Separate leader from all other nodes (just stop all other nodes except leader)
Leader continues operating
Now we have two partitions: the first one has 1 node that is leader already, the second one has N-1 nodes and successfully elects the new leader
As a result, we have two partitions with separate leader in each of them

According with that I should add a feature mentioned by Ayende Rahien:

You need to implement a feature where your leader will step down if it can't talk to its followers for an election timeout.

Now let's talk about consistency:

There are many opaque talks about the 'consistency' topic not only on the raft protocol.

Let me rephrase my question. If we have three nodes A, B, C and A is a leader then what can I do as client of such cluster? The possible options are:

I can read/write from/to node A because it is leader
I can read (but not write) from nodes B, C.

If #2 is allowed in Raft algorithm then we have weak consistency scenario. As a client, I can add new uncommitted entry on leader node A then go to node B (because of load balancer) and obtain previous consistent state, not the recent one. There is no guarantee for the client to see recent changes on follower nodes, right? If so, the only way for the client is to read/write using the single leader node. In this case reading scenario cannot be scaled across cluster. That's why I'm asking about mechanism to wait for cluster-wide replication before the client receives a response.

Oren Eini (Ayende Rahien)

unread,

Jul 8, 2019, 6:31:36 AM7/8/19

to raft...@googlegroups.com

inline

On Mon, Jul 8, 2019 at 1:27 PM Roman Sakno <roman...@gmail.com> wrote:

Note that in Raft you should indeed have absolute majority. In a partitioned cluster the unreachable nodes are still part of the cluster configuration. This is analogous to standards bodies where an abstain vote is really a no vote.
Are you sure? Look at http://thesecretlivesofdata.com/raft/#replication where you can see that two partitions may have their own leader. Additionally, you can observe the same picture on https://raft.github.io/:
Wait for leader election
Separate leader from all other nodes (just stop all other nodes except leader)
Leader continues operating
Now we have two partitions: the first one has 1 node that is leader already, the second one has N-1 nodes and successfully elects the new leader
As a result, we have two partitions with separate leader in each of them

With different terms,yes. External observer can see that one of the leader is of higher term and use that.

According with that I should add a feature mentioned by Ayende Rahien:
You need to implement a feature where your leader will step down if it can't talk to its followers for an election timeout.

Now let's talk about consistency:

There are many opaque talks about the 'consistency' topic not only on the raft protocol.

Let me rephrase my question. If we have three nodes A, B, C and A is a leader then what can I do as client of such cluster? The possible options are:
I can read/write from/to node A because it is leader
I can read (but not write) from nodes B, C.
If #2 is allowed in Raft algorithm then we have weak consistency scenario. As a client, I can add new uncommitted entry on leader node A then go to node B (because of load balancer) and obtain previous consistent state, not the recent one. There is no guarantee for the client to see recent changes on follower nodes, right? If so, the only way for the client is to read/write using the single leader node. In this case reading scenario cannot be scaled across cluster. That's why I'm asking about mechanism to wait for cluster-wide replication before the client receives a response.

There are actually three states:

* Cmd accepted on leader

* Cmd confirmed by majority of cluster

* Cmd applied on the leader

Depending on your impl, you may return after any of these. Only the later two are safe, though.

And you'll typically only read after a particular log index.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/cd5c79d0-c3a8-4916-860f-ae44b7da769a%40googlegroups.com.

Roman Sakno

unread,

Jul 8, 2019, 6:36:05 AM7/8/19

to raft-dev

On Monday, July 8, 2019 at 12:20:41 PM UTC+3, Arthur Wang wrote:

And for the implementation, the only thing I feel difficult is that how to deal with the message disorder problem (disorderly received on the follower side and disorderly finished on the leader side) correctly under an asynchronous and multi-threading environment.

I solved this using async/await-friendly synchronization primitives as well as state transitions are protected by locks.

Roman Sakno

unread,

Jul 8, 2019, 6:44:47 AM7/8/19

to raft-dev

On Monday, July 8, 2019 at 1:31:36 PM UTC+3, Ayende Rahien wrote:

Cmd applied on the leader

Oren Eini
CEO   /   Hibernating Rhinos LTD
Mobile:  +972-52-548-6969
Sales:  sa...@ravendb.net
Skype:  ayenderahien
Support:  sup...@ravendb.net

That what I'm looking for. Cmd can be applied (committed) on the leader it it is applied by majority of nodes. But actual implementation of such behavior is hard because I need to synchronize on that event and provide guarantees to the client. For instance, the client sends the command to the leader and wait for commit event. It is happened but in the same time network connection breaks down. The client assumes that its command was not committed to the cluster and retry the request. But in reality its command is already committed by leader and majority of nodes.

Karl Nilsson

unread,

Jul 8, 2019, 6:53:49 AM7/8/19

to raft...@googlegroups.com

In terms of guarantees that is the best you're going to get. You either know if the command was processed successfully or you're not sure (timeout). For the scenarios where you are waiting for a command to be applied and it times out you simply cannot know if it was processed successfully or not or will be at some later point (assuming it made into some leader's log).

You will need to retry the the command on timeout. If exactly once processing is important to you you need to do some kind of deduplication in your state machine where each command has a unique identifier and the state machine keeps a buffer of recently processed identifiers so it can tell if an incoming command has been processed already or not.

Cheers

Karl

--

You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/d29899a9-4244-4fc8-a9c6-d16ebe9572fd%40googlegroups.com.

--

Karl Nilsson

Keine Neco

unread,

Jul 8, 2019, 6:57:23 AM7/8/19

to raft...@googlegroups.com

Well, I spent sometime to understand your problem.

1. A follower would just keep its link with leader or candidate only, two follower wouldn't have to keep an active connection.

2. All request or response can be delayed or lost in Raft.

3. The raft can keep strong consistency, you make a mistake in here: a partitioned Raft cluster wouldn't have several working Leaders, there is just one leader working on the newest term. And leader would reply request after majority committed but not it self committed only.

Roman Sakno <roman...@gmail.com> 于2019年7月8日周一下午4:32写道：

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/4f544054-88ea-4f65-a304-e11342a52459%40googlegroups.com.

Roman Sakno

unread,

Jul 8, 2019, 7:06:43 AM7/8/19

to raft-dev

Old good dedup using buffer, this is the first idea that came into my head but I tried to avoid this due to difficulties in choice of correct eviction policy for such buffer. Now I see that there is no other way.

To unsubscribe from this group and stop receiving emails from it, send an email to raft...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/d29899a9-4244-4fc8-a9c6-d16ebe9572fd%40googlegroups.com.

--
Karl Nilsson

Karl Nilsson

unread,

Jul 8, 2019, 7:13:51 AM7/8/19

to raft...@googlegroups.com

Yes sadly Raft only solves a very specific problem - not all the problems :)

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/553bcffa-1cdc-4fa0-8ec5-00037a752103%40googlegroups.com.

--

Karl Nilsson

Henrik Ingo

unread,

Jul 8, 2019, 4:39:52 PM7/8/19

to raft...@googlegroups.com

On Mon, Jul 8, 2019 at 1:27 PM Roman Sakno <roman...@gmail.com> wrote:

Note that in Raft you should indeed have absolute majority. In a partitioned cluster the unreachable nodes are still part of the cluster configuration. This is analogous to standards bodies where an abstain vote is really a no vote.
Are you sure? Look at http://thesecretlivesofdata.com/raft/#replication where you can see that two partitions may have their own leader. Additionally, you can observe the same picture on https://raft.github.io/:
Wait for leader election
Separate leader from all other nodes (just stop all other nodes except leader)
Leader continues operating
Now we have two partitions: the first one has 1 node that is leader already, the second one has N-1 nodes and successfully elects the new leader
As a result, we have two partitions with separate leader in each of them
According with that I should add a feature mentioned by Ayende Rahien:
You need to implement a feature where your leader will step down if it can't talk to its followers for an election timeout.

Interesting observation: The Raft algorithm only specifies this for followers. (Figure 3.1) You're right that the old leader will stay a lonely leader until the new leader is able to reconnect to it. It would make sense and seems harmless that instead also the old leader converts to candidate state after election timeout.

That said, note that there is always a possibility that the new leader and old leader may coexist for a short while even with the above modification. The guarantee Raft gives is that the old leader cannot successfully commit new log entries in this state.

henrik

milan...@axoniq.io

unread,

Sep 19, 2024, 7:24:37 AM9/19/24

to raft-dev

The state machine's idempotency is critical here since RAFT cannot guarantee exactly-once command execution. The best bet is at-least-once due to the entry applier retry mechanism. So, the state machine should skip commands it processed earlier.

We could assign a unique ID to each command, keep a buffer of the last N commands, and do the uniqueness check. However, the size of this buffer is not easy to determine.

Since the RAFT guarantees the ordering of the commands, we could use its log index to check for deduplication. The RAFT log index is monotonically increasing. If the state machine receives a command with the deduplication key (RAFT log index) that is less or equal to the last processed one, it just skips the command; otherwise, it processes it.

Cheers,

Milan

On Monday, July 8, 2019 at 12:53:49 PM UTC+2 Karl Nilsson wrote:

In terms of guarantees that is the best you're going to get. You either know if the command was processed successfully or you're not sure (timeout). For the scenarios where you are waiting for a command to be applied and it times out you simply cannot know if it was processed successfully or not or will be at some later point (assuming it made into some leader's log).

You will need to retry the the command on timeout. If exactly once processing is important to you you need to do some kind of deduplication in your state machine where each command has a unique identifier and the state machine keeps a buffer of recently processed identifiers so it can tell if an incoming command has been processed already or not.

Cheers
Karl

On Mon, 8 Jul 2019 at 11:44, Roman Sakno <roman...@gmail.com> wrote:

On Monday, July 8, 2019 at 1:31:36 PM UTC+3, Ayende Rahien wrote:

Cmd applied on the leader

Oren Eini

CEO / Hibernating Rhinos LTD
Mobile: +972-52-548-6969Sales: sa...@ravendb.net

Skype: ayenderahienSupport: sup...@ravendb.net

That what I'm looking for. Cmd can be applied (committed) on the leader it it is applied by majority of nodes. But actual implementation of such behavior is hard because I need to synchronize on that event and provide guarantees to the client. For instance, the client sends the command to the leader and wait for commit event. It is happened but in the same time network connection breaks down. The client assumes that its command was not committed to the cluster and retry the request. But in reality its command is already committed by leader and majority of nodes.

Reply all

Reply to author

Forward