I have some stateless actors that running at nodes in AWS EC2 autoscale group. That means that if load is rising - new node will be spawned, and
the load subsides, those extra server will be closed(server termination). But closed not gracefully, just like switching power off.
At this situation cluster leader cant detect that node is unreachable, at some reason error like
Association with remote system [akka.tcp://Cluste...@10.10.23.240:2551] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://Cluste...@10.10.23.240:2551]] Caused by: [No route to host] |
Not counts as evidence of unreachability.
A am using auto-down-unreachable-after = 20s. And normally even if i terminate java process, leader detects member unreachable.
I can't decide from incide of application, because members list is ok
VISIBLE MEMBERS: TreeSet(Member(address = akka.tcp://Cluste...@10.10.21.190:2551, status = WeaklyUp), Member(address = akka.tcp://Cluste...@10.10.22.96:2551, status = Up), Member(address = akka.tcp://Cluste...@10.10.23.145:2551, status = Up), Member(address = akka.tcp://Cluste...@10.10.23.180:2551, status = Up), Member(address = akka.tcp://Cluste...@10.10.23.181:2551, status = Up), Member(address = akka.tcp://Cluste...@10.10.23.182:2551, status = Up), Member(address = akka.tcp://Cluste...@10.10.23.240:2551, status = Up), Member(address = akka.tcp://Cluste...@10.10.23.241:2551, status = WeaklyUp), Member(address = akka.tcp://Cluste...@10.10.24.62:2551, status = Up))
Actually unreachable node 10.10.23.240, which was terminated, is in 'Up' state.
So I have cluster which can't accept new members because it contains gated members. I dont whant to use 'Weakly up' state, cuz this will solve only part of problems, new members will never be moved to 'Up' state.
What should i do in this situation? Only decision i see is restart all cluster, but this is last thing i want to do. I need mope permanent solution, insteaqd of restarting cluster every time.