Persistent connection produces N^2 interconnections
RequestVote and AppendEntries can be duplicated in case of network failures
--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/4f544054-88ea-4f65-a304-e11342a52459%40googlegroups.com.
Hi,I'm working on Raft implementation in C# consisting of two parts: transport-agnostic implementation of algorithm and concrete realization for ASP.NET Core based on HTTP/S. It is completed at 99%. I found that some real-worlds aspects are not covered in related docs. Probably, that's OK for academic paper, but I would like to discuss them and find answers.Communication protocol between cluster members. Let's split them into the following categories: connection can be persistent (like HTTP with KeepAlive or raw TCP), non-persistent (like HTTP without KeepAlive), connectionless datagram-based (like UDP). The last one is out of my interest so left it aside. Persistent connection produces N^2 interconnections (where N - a number of members) between nodes in worst case, e.g. if we have 20 members the total number of TCP connections is 400.
Network engineers will be unhappy with that in real production environment. Moreover, such approach consumes a lot of system and network resources. When cluster operating normally, Raft uses just N connections in the same time. As a result, efficiency of resource consumption is equal to N/N^2 = 1/N. It decreases while increasing number of nodes. That's why I decide to use Connection header with Close value which means to close underlying TCP connection after each response. Of course, non-persistent connection causes lower performance.
According with HTTP specification, there are three kinds of methods: idempotent, safe and non-idempotent. By default, HTTP protocol offers at-least-once message delivery pattern. It means that RequestVote and AppendEntries can be duplicated in case of network failures. For instance, when http client sent the request but didn't receive response then it decide to re-send the request. It is possible to beat this problem using two approaches:
- Assume that RequestVote and AppendEntries are idempotent (for the same term, prevLogIndex and commitIndex)
- Introduce RequestId and detect duplicate requests on the receiver side. Personally, I don't like this approach because I should keep the buffer of such IDs for tracking duplicate requests.
Which way is correct?
The next thing is a request timeout. I chose it equal to upper bound of the range from which election timeout randomly picked up. I did that because .NET HttpClient doesn't allow to specify timeout for each request individually. Is that correct choice? I guess that is normal case when HttpClient allows to specify request timeout it should be equal to randomly chosen election timeout.
Now let's discuss behavior of Raft-based cluster in unstable network where partitioning is possible. I had a chance to emulate such behavior on my computer using localhost connections and I found that there are multiple leaders possible for external observer. It means that I had 1 leader in each partition. That's OK, I know.
But as a result, split and join between different partitions may happen very often in unstable network. It is not good for applications that use leader election mechanism as alternative to distributed exclusive lock. I decided to introduce absoluteMajority configuration parameter which interprets unreachable members as negatively voted.
Now just one partition may have leader while others not. Such configuration may lead to situation when partitioned cluster becomes totally unavailable because no one of them contain absolute majority of members. From my point of view, it is better than have multiple acquired distributed locks in every partition.
The last thing that I would like to ask is a synchronous commit to Raft cluster. If I understanding correctly, Raft allows to implement weak consistency only. This happens because leader node responds to the client immediately after all necessary log records will be recorded into its audit trail. There is no way for the client to ensure that change is committed to the majority of nodes before the response has been received. Is it possible to modify Raft algorithm somehow to provide strong consistency?
Thanks in advance for your help!
--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/4f544054-88ea-4f65-a304-e11342a52459%40googlegroups.com.
![]() | Oren Eini CEO / Hibernating Rhinos LTD
|
I'm working on Raft implementation in C# consisting of two parts: transport-agnostic implementation of algorithm and concrete realization for ASP.NET Core based on HTTP/S. It is completed at 99%. I found that some real-worlds aspects are not covered in related docs. Probably, that's OK for academic paper, but I would like to discuss them and find answers.Communication protocol between cluster members. Let's split them into the following categories: connection can be persistent (like HTTP with KeepAlive or raw TCP), non-persistent (like HTTP without KeepAlive), connectionless datagram-based (like UDP). The last one is out of my interest so left it aside. Persistent connection produces N^2 interconnections (where N - a number of members) between nodes in worst case, e.g. if we have 20 members the total number of TCP connections is 400. Network engineers will be unhappy with that in real production environment. Moreover, such approach consumes a lot of system and network resources. When cluster operating normally, Raft uses just N connections in the same time. As a result, efficiency of resource consumption is equal to N/N^2 = 1/N. It decreases while increasing number of nodes. That's why I decide to use Connection header with Close value which means to close underlying TCP connection after each response. Of course, non-persistent connection causes lower performance.
According with HTTP specification, there are three kinds of methods: idempotent, safe and non-idempotent. By default, HTTP protocol offers at-least-once message delivery pattern. It means that RequestVote and AppendEntries can be duplicated in case of network failures. For instance, when http client sent the request but didn't receive response then it decide to re-send the request. It is possible to beat this problem using two approaches:
- Assume that RequestVote and AppendEntries are idempotent (for the same term, prevLogIndex and commitIndex)
- Introduce RequestId and detect duplicate requests on the receiver side. Personally, I don't like this approach because I should keep the buffer of such IDs for tracking duplicate requests.
Which way is correct?
The next thing is a request timeout. I chose it equal to upper bound of the range from which election timeout randomly picked up. I did that because .NET HttpClient doesn't allow to specify timeout for each request individually. Is that correct choice? I guess that is normal case when HttpClient allows to specify request timeout it should be equal to randomly chosen election timeout.
Now let's discuss behavior of Raft-based cluster in unstable network where partitioning is possible. I had a chance to emulate such behavior on my computer using localhost connections and I found that there are multiple leaders possible for external observer. It means that I had 1 leader in each partition. That's OK, I know. But as a result, split and join between different partitions may happen very often in unstable network. It is not good for applications that use leader election mechanism as alternative to distributed exclusive lock. I decided to introduce absoluteMajority configuration parameter which interprets unreachable members as negatively voted. Now just one partition may have leader while others not. Such configuration may lead to situation when partitioned cluster becomes totally unavailable because no one of them contain absolute majority of members. From my point of view, it is better than have multiple acquired distributed locks in every partition.
The last thing that I would like to ask is a synchronous commit to Raft cluster. If I understanding correctly, Raft allows to implement weak consistency only. This happens because leader node responds to the client immediately after all necessary log records will be recorded into its audit trail. There is no way for the client to ensure that change is committed to the majority of nodes before the response has been received. Is it possible to modify Raft algorithm somehow to provide strong consistency?
Thanks in advance for your help!
Note that in Raft you should indeed have absolute majority. In a partitioned cluster the unreachable nodes are still part of the cluster configuration. This is analogous to standards bodies where an abstain vote is really a no vote.
You need to implement a feature where your leader will step down if it can't talk to its followers for an election timeout.
There are many opaque talks about the 'consistency' topic not only on the raft protocol.
Note that in Raft you should indeed have absolute majority. In a partitioned cluster the unreachable nodes are still part of the cluster configuration. This is analogous to standards bodies where an abstain vote is really a no vote.Are you sure? Look at http://thesecretlivesofdata.com/raft/#replication where you can see that two partitions may have their own leader. Additionally, you can observe the same picture on https://raft.github.io/:
- Wait for leader election
- Separate leader from all other nodes (just stop all other nodes except leader)
- Leader continues operating
- Now we have two partitions: the first one has 1 node that is leader already, the second one has N-1 nodes and successfully elects the new leader
- As a result, we have two partitions with separate leader in each of them
According with that I should add a feature mentioned by Ayende Rahien:You need to implement a feature where your leader will step down if it can't talk to its followers for an election timeout.Now let's talk about consistency:There are many opaque talks about the 'consistency' topic not only on the raft protocol.Let me rephrase my question. If we have three nodes A, B, C and A is a leader then what can I do as client of such cluster? The possible options are:
- I can read/write from/to node A because it is leader
- I can read (but not write) from nodes B, C.
If #2 is allowed in Raft algorithm then we have weak consistency scenario. As a client, I can add new uncommitted entry on leader node A then go to node B (because of load balancer) and obtain previous consistent state, not the recent one. There is no guarantee for the client to see recent changes on follower nodes, right? If so, the only way for the client is to read/write using the single leader node. In this case reading scenario cannot be scaled across cluster. That's why I'm asking about mechanism to wait for cluster-wide replication before the client receives a response.
--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/cd5c79d0-c3a8-4916-860f-ae44b7da769a%40googlegroups.com.
And for the implementation, the only thing I feel difficult is that how to deal with the message disorder problem (disorderly received on the follower side and disorderly finished on the leader side) correctly under an asynchronous and multi-threading environment.
Cmd applied on the leader
Oren Eini
CEO / Hibernating Rhinos LTD
Mobile: +972-52-548-6969 Sales: sa...@ravendb.net
Skype: ayenderahien Support: sup...@ravendb.net
--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/d29899a9-4244-4fc8-a9c6-d16ebe9572fd%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/4f544054-88ea-4f65-a304-e11342a52459%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to raft...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/d29899a9-4244-4fc8-a9c6-d16ebe9572fd%40googlegroups.com.
--Karl Nilsson
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/553bcffa-1cdc-4fa0-8ec5-00037a752103%40googlegroups.com.
Note that in Raft you should indeed have absolute majority. In a partitioned cluster the unreachable nodes are still part of the cluster configuration. This is analogous to standards bodies where an abstain vote is really a no vote.Are you sure? Look at http://thesecretlivesofdata.com/raft/#replication where you can see that two partitions may have their own leader. Additionally, you can observe the same picture on https://raft.github.io/:According with that I should add a feature mentioned by Ayende Rahien:
- Wait for leader election
- Separate leader from all other nodes (just stop all other nodes except leader)
- Leader continues operating
- Now we have two partitions: the first one has 1 node that is leader already, the second one has N-1 nodes and successfully elects the new leader
- As a result, we have two partitions with separate leader in each of them
You need to implement a feature where your leader will step down if it can't talk to its followers for an election timeout.
In terms of guarantees that is the best you're going to get. You either know if the command was processed successfully or you're not sure (timeout). For the scenarios where you are waiting for a command to be applied and it times out you simply cannot know if it was processed successfully or not or will be at some later point (assuming it made into some leader's log).You will need to retry the the command on timeout. If exactly once processing is important to you you need to do some kind of deduplication in your state machine where each command has a unique identifier and the state machine keeps a buffer of recently processed identifiers so it can tell if an incoming command has been processed already or not.CheersKarlOn Mon, 8 Jul 2019 at 11:44, Roman Sakno <roman...@gmail.com> wrote:
On Monday, July 8, 2019 at 1:31:36 PM UTC+3, Ayende Rahien wrote:
Cmd applied on the leader
CEO / Hibernating Rhinos LTDMobile: +972-52-548-6969Sales: sa...@ravendb.net
Skype: ayenderahienSupport: sup...@ravendb.net
That what I'm looking for. Cmd can be applied (committed) on the leader it it is applied by majority of nodes. But actual implementation of such behavior is hard because I need to synchronize on that event and provide guarantees to the client. For instance, the client sends the command to the leader and wait for commit event. It is happened but in the same time network connection breaks down. The client assumes that its command was not committed to the cluster and retry the request. But in reality its command is already committed by leader and majority of nodes.