Raft on high unreliable networks and Android devices

748 views
Skip to first unread message

Luca Lovagnini

unread,
Feb 13, 2016, 5:46:14 AM2/13/16
to raft-dev
Hello to all the Raft communities. 

I'm in a project where we have Android devices connected with WiFi and we want to elect a leader and replicate the data between the devices actually connected to the local network. The set of devices is "highly unreliable", which means that since we are talking about mobile devices connected with wireless connection, new devices can join the network or leave it with high frequency (and the probability to create a partition is higher than a classic wired network).

Raft seems a perfect candidate because it implements leader election and replication mechanism in a fantastic way, in addiction it supports configuration changing. But actually, do you think that it is suitable for such a scenario? This are our two main concerns:
  1. Raft was designed for a configuration that change sometimes,for example if a server or if we want to add new nodes, so something like once per day. In our case we're talking once every 10 seconds. Obviously how good Raft can handle this situation is implementation-dependent, but what do you think about this problem?
  2. Since Android is designed to run Java (and other languages like C or C++, but with more problems) which implementation in this language do you suggest me for such a case?
Before concluding this post, I wanted to thank this community for the support that will give us. I implemented a version of Raft in past for a school project, and you helped so much that time, and that's one of the reasons because I proposed to my team to use Raft (and obviously because it works so well). I hope that you will help us this time as well!

Thanks in advance for any answer. 

Oren Eini (Ayende Rahien)

unread,
Feb 13, 2016, 8:56:42 AM2/13/16
to raft...@googlegroups.com
How many nodes do you intend to have in your cluster?

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Luca Lovagnini

unread,
Feb 13, 2016, 10:35:10 AM2/13/16
to raft...@googlegroups.com

It's not easy to answer to such a question, but we can imagine a case of use like a shop or a restaurant, so at most 100 hundred nodes with frequently departure and joining.

You received this message because you are subscribed to a topic in the Google Groups "raft-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raft-dev/8qJLpf-xfqk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raft-dev+u...@googlegroups.com.

Oren Eini (Ayende Rahien)

unread,
Feb 13, 2016, 10:36:42 AM2/13/16
to raft...@googlegroups.com
That is going to be a problem.
With 100 nodes joining / leaving, that means that any decision needs to go to at least 51 nodes.
That is a REALLY high number to get consensus from.

What are you actually trying to achieve here?

Luca Lovagnini

unread,
Feb 13, 2016, 10:43:11 AM2/13/16
to raft...@googlegroups.com

Yeah I know, but I know also that usually consensus means distributed which means cloud...and in cloud clusters we have thousands of node! Maybe the problem that you're talking about is the high joining/departure rate? Btw I cannot go to much in detail, but as I told initially,  until now our goal is to find a good leader election system and a replication system. We want to create a cloud cluster formed by mobile devices.

Oren Eini (Ayende Rahien)

unread,
Feb 13, 2016, 10:50:27 AM2/13/16
to raft...@googlegroups.com
You might want to read this post: https://ayende.com/blog/169090/large-scale-distributed-consensus-approaches-computing-with-a-hundred-node-cluster

You can't get to a consensus with thousands of nodes. It is just not practical.

What data are you keeping in the cluster? What kind of consistency gurantees do you need?

dr-dr xp

unread,
Feb 13, 2016, 11:05:55 AM2/13/16
to raft-dev
In my opinion you do not need all of the nodes to participate into election. Choose 20 of them as candidate to vote. And keep all other nodes slave!

If you can guarantee that these 20 nodes will not be removed from this cluster, this way would be OK.

Archie Cobbs

unread,
Feb 13, 2016, 12:47:24 PM2/13/16
to raft-dev
On Saturday, February 13, 2016 at 4:46:14 AM UTC-6, Luca Lovagnini wrote:
Raft was designed for a configuration that change sometimes,for example if a server or if we want to add new nodes, so something like once per day. In our case we're talking once every 10 seconds. Obviously how good Raft can handle this situation is implementation-dependent, but what do you think about this problem?

Raft guarantees the same linearizable semantics for membership changes as for regular consensus operations, so "in theory" there's nothing special about them so it shouldn't be a problem.

One detail that differs is that Raft can only have one such operation occurring at a time - i.e., they are serialized. Depending on your implementation, this may mean slower overall throughput for membership operations vs. regular operations.

Since Android is designed to run Java (and other languages like C or C++, but with more problems) which implementation in this language do you suggest me for such a case?

I'm totally biased but I'd recommend mine of course :) It implements a transactional key/value store (not just a state machine) and comes with a Java persistence framework as an optional bonus. In exchange for testing and feedback I'll offer free support :)

http://archiecobbs.github.io/jsimpledb/publish/reports/javadoc/index.html?org/jsimpledb/kv/raft/RaftKVDatabase.html
https://github.com/archiecobbs/jsimpledb/

-Archie


Jordan Halterman

unread,
Feb 14, 2016, 2:33:29 AM2/14/16
to raft...@googlegroups.com
Hey Luca,

As others have mentioned, you don't necessarily need the entire cluster to participate in the Raft algorithm. I'm a long-time contributor to the Vert.x project (http://vertx.io). Vert.x has long relied on Hazelcast for cluster management, and so Copycat/Atomix were originally developed for sefer cluster management in Vert.x. Vert.x clusters are not fixed to a certain size, so we needed the flexibility to be able to arbitrarily add and remove nodes without impacting the performance and availability of the cluster. So, our goal in Atomix was precisely that.

We built on the Raft algorithm in Copycat/Atomix to balance the cluster membership to maintain at least n Raft voting members and replicate asynchronously to some set of non-voting members and/or heartbeat some set of stateless members. Availability is simply managed through commits to the Raft log, and the leader is responsible for replacing unavailable voting members with available non-voting members. This allows servers - including a majority of voting members - to crash without losing the availability of the cluster so long as a majority of the voting members are not lost within a short time period. Asynchronous replication - replicating from followers to asynchronous nodes - allows voting members to be replaced with non-voting members more quickly to decrease the risk of losing availability while the new member is being caught up. The process of promoting and demoting members is done in a way that ensures the number of voting members does not decrease below the desired size.

The problem in Raft certainly is that membership changes are more expensive than other operations since they're serialized. But a membership change every 10 seconds shouldn't significantly impact the performance of your cluster, particularly if that cluster doesn't include all 100 nodes. Still, it feels like there may be some better alternatives to consensus, but that depends on your consistency requirements.

The documentation for the membership change algorithms in Copycat/Atomix is one of the more sparse areas ATM, but it does discuss the basic roles of each type of member and how availability is determined. The documentation of the membership change algorithm and various other algorithms (events, log compaction) is here:
http://atomix.io/copycat/docs/internals#membership-changes

And Copycat and Atomix:

Neither has been tested on Android AFAIK. I'm not sure which project I would recommend as I haven't done significant experimentation with any of the other Java Raft implementations, so I can only really comment on my own work. But feel free to steal my ideas and code if you have to go it alone :-)

Luca Lovagnini

unread,
Apr 4, 2016, 5:11:54 AM4/4/16
to raft-dev
I would like to resume this topic. First of all thanks for all of you that answered to this topic!

Now I'm considering a much smaller network, something like 20 mobilie-unreliable devices with high join/leave rates, so the cluster in this scenario is not so big as before (where we contemplated even 100 smartphones), making things simpler.

I have some questions and comments based on this new scenario and your previous comments: 
  1. What are the possible applications/cases of uses of implementing a mobile version of Raft or any strong-consistent consensus algorithm?
  2. This comment raised my interest:
     Still, it feels like there may be some better alternatives to consensus, but that depends on your consistency requirements.\
    What are the possible alternatives to consensus? 
  3. In such a scenario, what could be the most challanges aspect of implementing Raft and what could be the possible cons of using this algorithm?
  4. Do you know if there are already papers/libraries about consensus on mobile-unreliable devices?
Again, thanks to all of you for your help!   

jordan.h...@gmail.com

unread,
Apr 4, 2016, 4:56:21 PM4/4/16
to raft...@googlegroups.com
I started answering inline but I'll elaborate on all the questions here instead.

I just think the unreliable nature of the network may justify a high availability option. Even if you do what was done in Atomix where some subset of the nodes are active voting Raft members and a cluster manager scales the Raft portion of the cluster by replacing failed nodes with standby nodes, there's still an increased risk that the cluster can become entirely unavailable. For example, if in a cluster with three Raft nodes, two of the Raft nodes simultaneously disconnect, the cluster manager can't commit a configuration change to promote one of the standby nodes, and so the entire cluster will become unavailable until the disconnected nodes return. This is exacerbated when you have to increase the failure detection timeout to account for Raft nodes that temporarily disconnect. In Copycat, the leader will mark a follower unavailable after a few failed heartbeats. But in practice that may be too short a span of time. Short-lived network partitions can result in expensive operations when the cluster manager replaces a Raft node and the leader has to catch up a new follower. But increasing the amount of time after which the leader marks an unreachable follower unavailable will increase the odds that the majority of the Raft cluster becomes unavailable without any of the nodes being replaced by the cluster manager.

Of course, this is somewhat mitigated by simply using a larger Raft cluster that can tolerate more failures. But at the same time, a larger Raft cluster will require more nodes at all times. Is that acceptable? Also, if nodes may frequently fail and never return to a cluster, that poses issues as well. If the majority of the Raft cluster fails and never returns, writes may be lost and some administrative intervention may be required to resolve add nodes to the cluster. It just seems to me that the trade offs are too great for highly unreliable networks. If you can consistently expect the same 20 nodes in your cluster, I suppose that mitigates the availability issues a bit.

I suppose the alternatives are many, but they may be highly dependent on your use case. If consistency is critical - which it appears to be - perhaps CRDTs and consistent hashing could be explored as more consistent forms of high availability.

Luca Lovagnini

unread,
Apr 4, 2016, 9:12:10 PM4/4/16
to raft...@googlegroups.com

Thanks for your answer, it was very detailed, helpful and it confirmed some of my doubts (like "what happens if one node leaves the network forever"?).

I still don't know if consistency is a mandatory point, since there are many possible interesting case of use for each kind of consistency, but could you please refer me some document/book/article where different kind of consistency (with relative application of use and algorithm that implement them) are described? For now, the only algorithms that I know for are Paxos, Raft (for strong consistency) and Dynamo (which is not an algorithm, but a system, for weak consistency). I'm asking this since I've heard of CRDT or distributed applications for consistent hashing.

Again, thanks so much for all your help!

Best regards,
Luca Lovagnini

You received this message because you are subscribed to a topic in the Google Groups "raft-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raft-dev/8qJLpf-xfqk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raft-dev+u...@googlegroups.com.

jordan.h...@gmail.com

unread,
Apr 4, 2016, 10:27:03 PM4/4/16
to raft...@googlegroups.com
Dynamo was what I was thinking about when I mentioned consistent hashing. It uses consistent hashing to map partitions to nodes IIRC (haven't studied it extensively). It's really designed for scalability, but the same concepts (consistent hashing) can be applied without partitions to achieve high availability. Of course, in terms of fault tolerance the same is true of this algorithm as is true of having a portion of your cluster participate in the Raft algorithm. If you're mapping keys to partitions, and each partition is hashed to a primary and n replicas, the simultaneous failure (disconnection) of those n replicas will result in the loss of state. IIRC the Dynamo paper mentions they had an administrator intervene to rebalance the cluster when a node crashed forever, but this process can also be done automatically as well. I think Hazelcast may do something like that.

But you can also just implement a gossip protocol with some form of conflict resolution. Gossip protocols are comparatively easy to implement, and depending on your requirements conflict resolution can be easy or may be difficult. For example, a gossip protocol with vector clocks for partial ordering of updates can be fairly easy to implement and result in a highly available system. The simplest form of conflict resolution would be to simply write updates with timestamps and use a last-write-wins policy. In that case, the gossip protocol would simply share the latest update/timestamp tuple and the node receiving it would choose the greatest timestamp. But there are all sorts of issues with using timestamps in distributed systems. Lamport clocks or vector clocks would do effectively the same thing, but with a mechanism that doesn't require clock synchronization (which I assume you can't rely on anyways).

This Quora answer covers a lot of these topics and might give you some ideas: https://www.quora.com/What-are-the-seminal-papers-in-distributed-systems-Why

Lamport's paper "Time, Clocks and Ordering of Events in a Distributed System" is one of the foundational papers in distributed systems, is easy to understand, and provides the foundation for vector clocks. Aside from that, perhaps the reading list can help you explore.

Kyle Kingsbury's blog post on consistency models is a great explanation of some of the most common strong consistency models, but weak consistency models don't seem to be covered:

Luca Lovagnini

unread,
Apr 6, 2016, 12:14:22 AM4/6/16
to raft...@googlegroups.com
Jordan, you helped me already when I tried to implement a Raft version written in Java for my project school and now you helped me so much with this answer. What can I say? Thanks so much! I'll update this post for eventual further questions.

Best Regards,
Luca Lovagnini

jordan.h...@gmail.com

unread,
Apr 6, 2016, 12:30:10 AM4/6/16
to raft...@googlegroups.com
Happy to help :-)

Suman Bhunia

unread,
Jul 17, 2019, 12:06:21 PM7/17/19
to raft-dev
Hi Luca,

I saw yourpost while searching for a suitable RAFT implementation for mobile phones. I hope you have already found the suitable solution. WHich one did you use? In my project I want to run consensus on ANdroid devices (20-30 nodes at a time). These nodes can get disconnected  from the main cluster. Can you suggest me a good implementation of RAFT for this?

Thanks,
Suman

On Tuesday, April 5, 2016 at 11:14:22 PM UTC-5, Luca Lovagnini wrote:
Jordan, you helped me already when I tried to implement a Raft version written in Java for my project school and now you helped me so much with this answer. What can I say? Thanks so much! I'll update this post for eventual further questions.

Best Regards,
Luca Lovagnini

On Tue, Apr 5, 2016 at 10:27 AM, jordan....@gmail.com <jordan....@gmail.com> wrote:
Dynamo was what I was thinking about when I mentioned consistent hashing. It uses consistent hashing to map partitions to nodes IIRC (haven't studied it extensively). It's really designed for scalability, but the same concepts (consistent hashing) can be applied without partitions to achieve high availability. Of course, in terms of fault tolerance the same is true of this algorithm as is true of having a portion of your cluster participate in the Raft algorithm. If you're mapping keys to partitions, and each partition is hashed to a primary and n replicas, the simultaneous failure (disconnection) of those n replicas will result in the loss of state. IIRC the Dynamo paper mentions they had an administrator intervene to rebalance the cluster when a node crashed forever, but this process can also be done automatically as well. I think Hazelcast may do something like that.

But you can also just implement a gossip protocol with some form of conflict resolution. Gossip protocols are comparatively easy to implement, and depending on your requirements conflict resolution can be easy or may be difficult. For example, a gossip protocol with vector clocks for partial ordering of updates can be fairly easy to implement and result in a highly available system. The simplest form of conflict resolution would be to simply write updates with timestamps and use a last-write-wins policy. In that case, the gossip protocol would simply share the latest update/timestamp tuple and the node receiving it would choose the greatest timestamp. But there are all sorts of issues with using timestamps in distributed systems. Lamport clocks or vector clocks would do effectively the same thing, but with a mechanism that doesn't require clock synchronization (which I assume you can't rely on anyways).

This Quora answer covers a lot of these topics and might give you some ideas: https://www.quora.com/What-are-the-seminal-papers-in-distributed-systems-Why

Lamport's paper "Time, Clocks and Ordering of Events in a Distributed System" is one of the foundational papers in distributed systems, is easy to understand, and provides the foundation for vector clocks. Aside from that, perhaps the reading list can help you explore.

Kyle Kingsbury's blog post on consistency models is a great explanation of some of the most common strong consistency models, but weak consistency models don't seem to be covered:
On Apr 4, 2016, at 6:12 PM, Luca Lovagnini <lucal...@gmail.com> wrote:

Thanks for your answer, it was very detailed, helpful and it confirmed some of my doubts (like "what happens if one node leaves the network forever"?).

I still don't know if consistency is a mandatory point, since there are many possible interesting case of use for each kind of consistency, but could you please refer me some document/book/article where different kind of consistency (with relative application of use and algorithm that implement them) are described? For now, the only algorithms that I know for are Paxos, Raft (for strong consistency) and Dynamo (which is not an algorithm, but a system, for weak consistency). I'm asking this since I've heard of CRDT or distributed applications for consistent hashing.

Again, thanks so much for all your help!

Best regards,
Luca Lovagnini

To unsubscribe from this group and stop receiving emails from it, send an email to raft...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "raft-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raft-dev/8qJLpf-xfqk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raft...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "raft-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raft-dev/8qJLpf-xfqk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raft...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages