Cluster inconsistency when using AWS EC2 autoscale groups ('No route to host' not triggering unreachable state)

99 views
Skip to first unread message

molo...@ajax.systems

unread,
Sep 27, 2016, 5:52:48 AM9/27/16
to Akka User List
I have some stateless actors that running at nodes in AWS EC2 autoscale group. That means that if load is rising - new node will be spawned, and the load subsides, those extra server will be closed(server termination). But closed not gracefully, just like switching power off.
At this situation cluster leader cant detect that node is unreachable, at some reason error like


Association with remote system [akka.tcp://Cluste...@10.10.23.240:2551] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://Cluste...@10.10.23.240:2551]] Caused by: [No route to host]
Not counts as  evidence of unreachability.
A am using auto-down-unreachable-after = 20s. And normally even if i terminate java process, leader detects member unreachable.
I can't decide from incide of application, because members list is ok

VISIBLE MEMBERS: TreeSet(Member(address = akka.tcp://Cluste...@10.10.21.190:2551, status = WeaklyUp), Member(address = akka.tcp://Cluste...@10.10.22.96:2551, status = Up), Member(address = akka.tcp://Cluste...@10.10.23.145:2551, status = Up), Member(address = akka.tcp://Cluste...@10.10.23.180:2551, status = Up), Member(address = akka.tcp://Cluste...@10.10.23.181:2551, status = Up), Member(address = akka.tcp://Cluste...@10.10.23.182:2551, status = Up), Member(address = akka.tcp://Cluste...@10.10.23.240:2551, status = Up), Member(address = akka.tcp://Cluste...@10.10.23.241:2551, status = WeaklyUp), Member(address = akka.tcp://Cluste...@10.10.24.62:2551, status = Up))

Actually unreachable node 10.10.23.240, which was terminated, is in 'Up' state.
So I have cluster which can't accept new members because it contains gated members. I dont whant to use 'Weakly up' state, cuz this will solve only part of problems, new members will never be moved to 'Up' state.

What should i do in this situation? Only decision i see is restart all cluster, but this is last thing i want to do. I need mope permanent solution, insteaqd of restarting cluster every time.



molo...@ajax.systems

unread,
Sep 27, 2016, 7:20:02 AM9/27/16
to Akka User List
        <dependency>
            <groupId>com.typesafe.akka</groupId>
            <artifactId>akka-cluster_2.11</artifactId>
            <version>2.4.7</version>
        </dependency>
akka version 

molo...@ajax.systems

unread,
Sep 27, 2016, 9:24:52 AM9/27/16
to Akka User List
i also redefined values in failure detector 
failure-detector {
      threshold = 16
      acceptable-heartbeat-pause = 10s
      heartbeat-interval = 3s  
      #expected-response-after = 20s #this means time for receiving first gossip from anyone or node will be detected as broken https://groups.google.com/forum/#!topic/akka-user/2JKtDw7dcJs
    }#failure-detector

Patrik Nordwall

unread,
Oct 2, 2016, 2:40:46 AM10/2/16
to akka...@googlegroups.com
It is expected that it detects crashed/stopped nodes as unreachable. When unreachable have been downed and removed the cluster canmove joining nodes to Up. I'm not sure I understand the question/problem.

Also, don't change heartbeat-interval. That will not make anything better.
acceptable-heartbeat-pause is what you should increase if needed.

/Patrik

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages