Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Akka Cluster ⇒ AssociationError Error [Invalid address]

805 views
Skip to first unread message

Eugene Dzhurinsky

unread,
Jul 16, 2015, 11:56:10 AM7/16/15
to akka...@googlegroups.com
Hello!

Recently I updated Akka from 2.3.9 to 2.3.11, and for some reason my cluster started to fall apart. From time to time I'm getting errros like this:

INFO   | jvm 1    | 2015/07/16 11:45:39 | 2015-07-16 16:45:39,369 ERROR  [EndpointWriter] AssociationError [akka.tcp://HttpC...@192.168.0.203:2551] -> [akka
.tcp://HttpC...@192.168.0.200:2551]: Error [Invalid address: akka.tcp://HttpC...@192.168.0.200:2551] [                                                  
INFO  
| jvm 1    | 2015/07/16 11:45:39 | akka.remote.InvalidAssociation: Invalid address: akka.tcp://HttpC...@192.168.0.200:2551                          
INFO  
| jvm 1    | 2015/07/16 11:45:39 | Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system has quarantined this system
. No further associations to the remote system are possible until this system is restarted.
INFO  
| jvm 1    | 2015/07/16 11:45:40 | 2015-07-16 16:45:40,526 WARN   [ReliableDeliverySupervisor] Association with remote system [akka.tcp://HttpCluster@19
2.168.0.202:2551] has failed, address is now gated for [5000] ms. Reason: [Disassociated]                                                                      
INFO  
| jvm 1    | 2015/07/16 11:45:40 | 2015-07-16 16:45:40,543 WARN   [EndpointWriter] AssociationError [akka.tcp://HttpC...@192.168.0.203:2551] -> [akka
.tcp://HttpC...@192.168.0.204:2551]: Error [Invalid address: akka.tcp://HttpC...@192.168.0.204:2551]

I don't see any suspicious activities in logs, like connection reset or some other network errors, it just fails. The cluster-specific configuration looks like below:

    cluster {
       
auto-down-unreachable-after = 10s

        failure
-detector {
          threshold
= 10
          heartbeat
-interval = 10s
          acceptable
-heartbeat-pause = 30 s
       
}

        role
{
          scheduler
.min-nr-of-members = 1
          chunk
.min-nr-of-members = 1
          http
.min-nr-of-members = 1
       
}

   
}



Can somebody please advice how can I troubleshoot this problem? Or at least how can I intercept that cluster error and restart the cluster node that failed?

Thank you!

Eugene Dzhurinsky

unread,
Jul 17, 2015, 9:54:46 PM7/17/15
to akka...@googlegroups.com
I did some experiments, varying the number of nodes in the cluster, and realized that this error always happens with 6 nodes under the heavy load. It seems that if a single node can not communicate with another node in the cluster - that may lead to "unreachable" state, and then the node falls off the cluster.

I increased the heartbeat interval to 60 seconds - but that doesn't help.

I added some GC output, and I haven't seen any major pause there, even Full GC takes around 1 second tops.

Is it possible that there could be some network errors? How do I know what Invalid address: akka.tcp://HttpC...@192.168.0.205:2551 might mean?

Thanks!

Viktor Klang

unread,
Jul 18, 2015, 2:57:39 AM7/18/15
to Akka User List

Hi Eugene,

I assume you've read the following but in case you didn't:

http://doc.akka.io/docs/akka/2.3.12/scala/cluster-usage.html#Failure_Detector

--
Cheers,

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Eugene Dzhurinsky

unread,
Jul 18, 2015, 12:52:30 PM7/18/15
to akka...@googlegroups.com
Yes, I've read that, and I think that I may face some network issues now. I decreased the number of actors for each node to 5 with roundrobin pool, and that seems to solve the problems - for the last night there's no issue with the node marked as failed.

The cluster is deployed on 7 nodes (3 Raspberry Pi 2 ARMv7 and 4 Raspberry Pi B Armv6), so there definitely could be glitches in the network stack.

To mitigate the problem when cluster gets 40% of its nodes down due to some network error - is there any way to watch if the current node was ditched off the cluster? Any event to listen on?

I'd like to have an ability to restart the actor system on such event. My nodes are totally stateless, so it doesn't harm to restart them as many times as needed.

Please advice.

Thanks!

Tom Pantelis

unread,
Jul 24, 2015, 3:34:09 AM7/24/15
to Akka User List, jdev...@gmail.com
I just posted a similar question. I want to know when a node is quarantined in code so we can auto-restart.

The node gets quarantined due to auto-down so you can bump up auto-down-unreachable-after or just disable it.  If you're cluster is mainly static and you don't commonly add new nodes then disabling is probably fine. 

Sandeep Raja Rao

unread,
Apr 23, 2019, 8:53:05 AM4/23/19
to Akka User List
Even with auto-down configuration disabled, same issue had occurred recently in a three node cluster.

2019-04-04 22:44:33,205 | ult-dispatcher-3 | Remoting                         | Tried to associate with unreachable remote address [akka.tcp://opendaylight...@x.x.x.155:2550]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.]

(.154, .155 and .156 were last octet of the cluster node ips) and the above was observed in .156.  Until we restarted .156 the node .155 was not able to join the cluster.

Since it is not recommended to enable auto-down (from few other posts), wanted to know if there are any other configurations that can be looked into.

Current parallelism configuration is as follows :

     fork-join-executor {
           parallelism-min = 2
           parallelism-max = 4
     }

Reply all
Reply to author
Forward
0 new messages