Cluster failure tolerance

86 views
Skip to first unread message

Kai Yu

unread,
Apr 26, 2015, 6:21:13 AM4/26/15
to akka...@googlegroups.com
Hi group,

I am working on an AKKA cluster with four nodes. In my setup, each of the four nodes has different functionality, but they are all equal in position in the cluster, with very similar configuration (except for the host:port things) and all being seeds of the cluster. My configuration file looks something like this:

cluster-conf.akka {
    log
-dead-letters-during-shutdown = false


    actor
.provider = "akka.cluster.ClusterActorRefProvider"


    remote
{
        netty
.tcp {
            hostname
= ${rep.ep.httpd-game-1.int_ip}
       
}


        watch
-failure-detector.acceptable-heartbeat-pause = 15 s
   
}


    cluster
{
        seed
-nodes = [
               
"akka.tcp://"${cluster-conf.clu.name}"@"${rep.ep.httpd-game-1.int_ip}":"${cluster-conf.akka.remote.netty.tcp.port},
               
"akka.tcp://"${cluster-conf.clu.name}"@"${rep.ep.httpd-game-2.int_ip}":"${cluster-conf.akka.remote.netty.tcp.port},
               
"akka.tcp://"${cluster-conf.clu.name}"@"${rep.ep.httpd-sso.int_ip}":"${cluster-conf.akka.remote.netty.tcp.port},
               
"akka.tcp://"${cluster-conf.clu.name}"@"${rep.ep.misc.int_ip}":"${cluster-conf.akka.remote.netty.tcp.port},
       
]


       
auto-down-unreachable-after = 10s


        metrics
.native-library-extract-folder=${user.dir}/target/native
   
}
}


 In normal conditions, it works fine. Now my goal is to achieve maximum failure tolerance. As far as I can think of, I want to ensure one node can (automatically) rejoin the cluster, when

 1. it crashes and is brought up by a daemon program automatically,
 2. the network fails (for example, the NIC used to clustering with other nodes fails)  for a short time and recovers.

For 1, my setup works when one node crashes and after a while, rejoins to the cluster. But if it crashes and restarts too quickly (before auto-down-unreachable-after runs up), then it somehow causes the cluster to scatter, i.e. all nodes are removed from the cluster and become isolated. Am I doing something wrong? How can I fix that other than adding a delay to my daemon program?

For 2, if the network failure recovers after auto-down-unreachable-after runs up, the node will no longer be able to rejoin unless manual interventions be taken. Can someone shed some light on how to make the cluster automatically down the node when it tries to rejoin in such a situation?

And any suggestion regarding fault tolerance in a cluster setup is welcome. Thanks in advance.

Kai Yu

unread,
Apr 26, 2015, 11:25:30 PM4/26/15
to akka...@googlegroups.com
If I choose not to auto-down unreachable nodes, the cluster seems to perform well against 2. However, if one node crashes and is brought up again, the cluster would ignore the rejoining of the node because it thinks the member still exists, and the doc says that the node must be removed before it can join to the cluster again. What events can be used for the cluster to detect a rejoining like that?

Kai Yu

unread,
Apr 28, 2015, 11:43:48 PM4/28/15
to akka...@googlegroups.com
* BUMP *

Kai Yu

unread,
Apr 30, 2015, 3:36:09 AM4/30/15
to akka...@googlegroups.com
It turned out, the issues I encountered are specific to version 2.3.8. Everything works as expected in 2.3.10.

Konrad Malawski

unread,
Apr 30, 2015, 3:48:57 AM4/30/15
to akka...@googlegroups.com, Kai Yu
Thanks for reporting!
Yes, we did include some bugfixes in those areas in the .9 and .10 releases, glad it helps!

-- 
Cheers,
Konrad 'ktoso’ Malawski
Akka @ Typesafe

On 30 April 2015 at 09:36:11, Kai Yu (niels.he...@gmail.com) wrote:

It turned out, the issues I encountered are specific to version 2.3.8. Everything works as expected in 2.3.10.
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages