Node quarantined

603 views
Skip to first unread message

Benjamin Black

unread,
Mar 22, 2016, 1:34:23 PM3/22/16
to Akka User List
Hello,

I'm trying to understand the cause of nodes being quarantined and possible solutions to fixing it. I'm using akka 2.3.11. On the quarantined node I see this logging:

2:45:44.204 ERROR [geyser-akka.remote.default-remote-dispatcher-6] a.r.EndpointWriter - AssociationError [akka.tcp://gey...@172.16.120.174:7000] <- [akka.tcp://gey...@172.17.100.105:7000]: Error [Invalid address: akka.tcp://gey...@172.17.100.105:7000] [
akka.remote.InvalidAssociation: Invalid address: akka.tcp://gey...@172.17.100.105:7000
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.
]
12:45:44.205 WARN  [geyser-akka.remote.default-remote-dispatcher-25] Remoting - Tried to associate with unreachable remote address [akka.tcp://gey...@172.17.100.105:7000]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.]

And on the node that cause the box to be quarantined I see this logging:

12:45:44.194 WARN  [geyser-akka.remote.default-remote-dispatcher-6] Remoting - Association to [akka.tcp://gey...@172.16.120.174:7000] having UID [-450748474] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
12:45:44.202 WARN  [geyser-akka.remote.default-remote-dispatcher-7] a.r.EndpointWriter - AssociationError [akka.tcp://gey...@172.17.100.105:7000] -> [akka.tcp://gey...@172.16.120.174:7000]: Error [Invalid address: akka.tcp://gey...@172.16.120.174:7000] [
akka.remote.InvalidAssociation: Invalid address: akka.tcp://gey...@172.16.120.174:7000
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system has a UID that has been quarantined. Association aborted.
]
12:45:44.203 WARN  [geyser-akka.remote.default-remote-dispatcher-7] Remoting - Tried to associate with unreachable remote address [akka.tcp://gey...@172.16.120.174:7000]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has a UID that has been quarantined. Association aborted.]
12:45:44.221 ERROR [geyser-akka.remote.default-remote-dispatcher-7] Remoting - Association to [akka.tcp://gey...@172.16.120.174:7000] with UID [-450748474] irrecoverably failed. Quarantining address.
java.lang.IllegalStateException: Error encountered while processing system message acknowledgement buffer: [-1 {}] ack: ACK[6, {}]
        at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:288) ~[geyser.jar:1.1.17-SNAPSHOT]
        at akka.actor.Actor$class.aroundReceive(Actor.scala:467) ~[geyser.jar:1.1.17-SNAPSHOT]
Caused by: java.lang.IllegalArgumentException: Highest SEQ so far was -1 but cumulative ACK is 6
        at akka.remote.AckedSendBuffer.acknowledge(AckedDelivery.scala:103) ~[geyser.jar:1.1.17-SNAPSHOT]
        at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:284) ~[geyser.jar:1.1.17-SNAPSHOT]
        ... 11 common frames omitted
12:45:44.221 WARN  [geyser-akka.remote.default-remote-dispatcher-7] Remoting - Association to [akka.tcp://gey...@172.16.120.174:7000] having UID [-450748474] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.

Quite a bit of data can be passed between the nodes ~200 Mb/sec and maybe the system is hitting a capacity issue although I don't see any issue with CPU or memory. I noticed that the default-remote-dispatcher only has two threads. Are these threads being used to send the data? And if so should I try increase the thread count? Are there any other settings I could play with of things I can look for in the logs that might highlight what is wrong?

Thanks,
Ben

Guido Medina

unread,
Mar 22, 2016, 2:00:15 PM3/22/16
to Akka User List
To eliminate noise please update to 2.3.14 which from 2.3.11 has some cluster fixes, there are also several fixes on Scala 2.11.8 (not related)

I don't know, I just have the custom of keeping my libs up to date.

HTH,

Guido.

Benjamin Black

unread,
Mar 22, 2016, 5:22:00 PM3/22/16
to Akka User List
I see the same issue with 2.3.14.
Message has been deleted

Guido Medina

unread,
Mar 22, 2016, 6:23:14 PM3/22/16
to Akka User List
Hi Benjamin,

You have nodes with predefined ports, one thing I have which eliminates that problem for these nodes is that
only my seed node(s) have the port set, the rest will just get a dynamic and available port, making it get a different port when you
do a rolling restart.

I suspect you are doing a rolling restart right? so you need to wait for that node with that address to completely leave the cluster (I'm also doing that),
basically you terminate your system when you receive the message MemberRemoved for _self_ address.

I think I saw a discussion related to quarantine nodes when they are re-joining using the same address, not sure if here or if it is an actual Git ticket.

HTH,

Guido.

Benjamin Black

unread,
Mar 22, 2016, 6:27:26 PM3/22/16
to Akka User List
Hi Guido, yes I'm aware of the leaving cluster conversation as I started it :-) This is separate issue. I am observing this behavior whilst the cluster seems stable with no nodes being added/removed. I suspect that this issue was first observed when I upgraded a different library that brought in a new version of the netty library.

Guido Medina

unread,
Mar 22, 2016, 6:38:56 PM3/22/16
to Akka User List
Yeah sorry I thought it was related with rolling restart.

As for Netty, I'm using a non-published yet Netty with the following fixes:

You can just get it from Git and:

$ git checkout 3.10
$ mvn versions
:set -DnewVersion=3.10.6.Final -DgenerateBackupPoms=false
$ mvn clean install

And see if your problem goes away,

Guido.

Patrik Nordwall

unread,
Mar 23, 2016, 2:08:10 AM3/23/16
to Akka User List
We have fixed the issue that is noticed as
"Error encountered while processing system message acknowledgement buffer: [-1 {}] ack: ACK[6, {}]"

https://github.com/akka/akka/pull/20093

It will be released in 2.4.3 and 2.3.15, probably by end of next week.

/Patrik
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Guido Medina

unread,
Mar 23, 2016, 5:38:44 AM3/23/16
to Akka User List
Hi Benjamin,

For what I could understand from the issue, this is happening only to nodes that rejoined
the cluster under the same address (host and port) so I believe that setting

akka.remote.netty.tcp.port = 0

should solve the problem in the meantime,

Cheers,

Guido.

Guido Medina

unread,
Mar 23, 2016, 5:39:38 AM3/23/16
to Akka User List
Correction: Set that only for non-seed nodes.

Benjamin Black

unread,
Mar 23, 2016, 9:33:02 AM3/23/16
to Akka User List
I look forward to trying out the new version. Not totally sure it is the same issue I'm seeing this happen on a cluster where no node is being restarted. I shall continue to investigate what has changed on my side, because I wasn't see this before I upgraded other libraries.

Benjamin Black

unread,
Apr 28, 2016, 2:22:42 PM4/28/16
to Akka User List
I'm following up on this topic after upgrading to akka 2.3.15. I'm reasonably confident that the issue is the resullt of using akka along with another library that causes the netty dependency to be upgraded from 3.9.2.Final to 3.10.0.Final. For now I have removed the dependency on the newer version of netty, but I thought I'd report what I was seeing in the logs. I am running five nodes for a few hours with no issue, and then two nodes fall out of the cluster. Here are the logs from each node:

IP: 160
13:59:57.252 INFO  [geyser-akka.actor.default-dispatcher-6] AngelOfTheAbyss - Unreachable member (Member(address = akka.tcp://gey...@172.16.119.42:7000, status = Up)|Size:4)
13:59:58.541 INFO  [geyser-akka.actor.default-dispatcher-306] AngelOfTheAbyss - Unreachable member (Member(address = akka.tcp://gey...@172.16.125.13:7000, status = Up)|Size:3)
14:00:11.540 INFO  [geyser-akka.actor.default-dispatcher-282] AngelOfTheAbyss - Member removed (Member(address = akka.tcp://gey...@172.16.119.42:7000, status = Removed)|Size:3)
14:00:11.541 INFO  [geyser-akka.actor.default-dispatcher-282] AngelOfTheAbyss - Member removed (Member(address = akka.tcp://gey...@172.16.125.13:7000, status = Removed)|Size:3)
14:00:11.545 WARN  [geyser-akka.remote.default-remote-dispatcher-8] Remoting - Association to [akka.tcp://gey...@172.16.119.42:7000] having UID [-477546934] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
14:00:11.546 WARN  [geyser-akka.remote.default-remote-dispatcher-8] Remoting - Association to [akka.tcp://gey...@172.16.125.13:7000] having UID [-1471771858] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.

IP: 42
13:59:57.326 WARN  [geyser-cluster-dispatcher-15] a.c.ClusterCoreDaemon - Cluster Node [akka.tcp://gey...@172.16.119.42:7000] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://gey...@172.16.125.13:7000, status = Up)]
13:59:57.328 INFO  [geyser-akka.actor.default-dispatcher-46] AngelOfTheAbyss - Unreachable member (Member(address = akka.tcp://gey...@172.16.125.13:7000, status = Up)|Size:4)
14:00:07.345 INFO  [geyser-cluster-dispatcher-15] Cluster(akka://geyser) - Cluster Node [akka.tcp://gey...@172.16.119.42:7000] - Leader is auto-downing unreachable node [akka.tcp://gey...@172.16.125.13:7000]
14:00:07.346 INFO  [geyser-cluster-dispatcher-15] Cluster(akka://geyser) - Cluster Node [akka.tcp://gey...@172.16.119.42:7000] - Marking unreachable node [akka.tcp://gey...@172.16.125.13:7000] as [Down]
14:00:07.694 INFO  [geyser-cluster-dispatcher-15] Cluster(akka://geyser) - Cluster Node [akka.tcp://gey...@172.16.119.42:7000] - Shutting down...
14:00:07.695 INFO  [geyser-cluster-dispatcher-15] Cluster(akka://geyser) - Cluster Node [akka.tcp://gey...@172.16.119.42:7000] - Successfully shut down
14:00:07.703 WARN  [geyser-akka.remote.default-remote-dispatcher-27] Remoting - Association to [akka.tcp://gey...@172.16.125.13:7000] having UID [-1471771858] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
14:00:10.360 WARN  [geyser-akka.remote.default-remote-dispatcher-7] a.r.ReliableDeliverySupervisor - Association with remote system [akka.tcp://gey...@172.16.119.46:7000] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
14:00:11.361 WARN  [geyser-akka.remote.default-remote-dispatcher-7] a.r.ReliableDeliverySupervisor - Association with remote system [akka.tcp://gey...@172.17.110.139:7000] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
14:00:11.544 WARN  [geyser-akka.remote.default-remote-dispatcher-7] a.r.ReliableDeliverySupervisor - Association with remote system [akka.tcp://gey...@172.16.120.160:7000] has failed, address is now gated for [5000] ms. Reason: [Disassociated]

IP: 13
13:59:57.244 WARN  [geyser-cluster-dispatcher-17] a.c.ClusterCoreDaemon - Cluster Node [akka.tcp://gey...@172.16.125.13:7000] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://gey...@172.16.119.42:7000, status = Up)]
13:59:57.245 INFO  [geyser-akka.actor.default-dispatcher-61] AngelOfTheAbyss - Unreachable member (Member(address = akka.tcp://gey...@172.16.119.42:7000, status = Up)|Size:4)
13:59:57.326 INFO  [geyser-cluster-dispatcher-15] Cluster(akka://geyser) - Cluster Node [akka.tcp://gey...@172.16.125.13:7000] - Ignoring received gossip status from unreachable [UniqueAddress(akka.tcp://gey...@172.16.119.42:7000,-477546934)]
14:00:07.711 WARN  [geyser-akka.remote.default-remote-dispatcher-7] a.r.ReliableDeliverySupervisor - Association with remote system [akka.tcp://gey...@172.16.119.42:7000] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
14:00:09.243 INFO  [geyser-cluster-dispatcher-17] Cluster(akka://geyser) - Cluster Node [akka.tcp://gey...@172.16.125.13:7000] - Shutting down...
14:00:09.246 INFO  [geyser-cluster-dispatcher-17] Cluster(akka://geyser) - Cluster Node [akka.tcp://gey...@172.16.125.13:7000] - Successfully shut down
14:00:09.253 WARN  [geyser-akka.remote.default-remote-dispatcher-7] Remoting - Association to [akka.tcp://gey...@172.16.119.42:7000] having UID [-477546934] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
14:00:10.361 WARN  [geyser-akka.remote.default-remote-dispatcher-7] a.r.ReliableDeliverySupervisor - Association with remote system [akka.tcp://gey...@172.16.119.46:7000] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
14:00:10.394 ERROR [geyser-akka.remote.default-remote-dispatcher-26] a.r.EndpointWriter - AssociationError [akka.tcp://gey...@172.16.125.13:7000] <- [akka.tcp://gey...@172.16.119.46:7000]: Error [Invalid address: akka.tcp://gey...@172.16.119.46:7000] [
akka.remote.InvalidAssociation: Invalid address: akka.tcp://gey...@172.16.119.46:7000
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.
]
14:00:10.394 WARN  [geyser-akka.remote.default-remote-dispatcher-26] Remoting - Tried to associate with unreachable remote address [akka.tcp://gey...@172.16.119.46:7000]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.]
14:00:11.364 WARN  [geyser-akka.remote.default-remote-dispatcher-7] a.r.ReliableDeliverySupervisor - Association with remote system [akka.tcp://gey...@172.17.110.139:7000] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
14:00:11.546 WARN  [geyser-akka.remote.default-remote-dispatcher-26] a.r.ReliableDeliverySupervisor - Association with remote system [akka.tcp://gey...@172.16.120.160:7000] has failed, address is now gated for [5000] ms. Reason: [Disassociated]


IP: 46
13:59:57.358 INFO  [geyser-akka.actor.default-dispatcher-2] AngelOfTheAbyss - Unreachable member (Member(address = akka.tcp://gey...@172.16.119.42:7000, status = Up)|Size:4)
13:59:58.329 INFO  [geyser-akka.actor.default-dispatcher-7] AngelOfTheAbyss - Unreachable member (Member(address = akka.tcp://gey...@172.16.125.13:7000, status = Up)|Size:3)
14:00:07.372 INFO  [geyser-cluster-dispatcher-21] Cluster(akka://geyser) - Cluster Node [akka.tcp://gey...@172.16.119.46:7000] - Leader is auto-downing unreachable node [akka.tcp://gey...@172.16.119.42:7000]
14:00:07.373 INFO  [geyser-cluster-dispatcher-21] Cluster(akka://geyser) - Cluster Node [akka.tcp://gey...@172.16.119.46:7000] - Marking unreachable node [akka.tcp://gey...@172.16.119.42:7000] as [Down]
14:00:08.342 INFO  [geyser-cluster-dispatcher-21] Cluster(akka://geyser) - Cluster Node [akka.tcp://gey...@172.16.119.46:7000] - Leader is auto-downing unreachable node [akka.tcp://gey...@172.16.125.13:7000]
14:00:08.342 INFO  [geyser-cluster-dispatcher-21] Cluster(akka://geyser) - Cluster Node [akka.tcp://gey...@172.16.119.46:7000] - Marking unreachable node [akka.tcp://gey...@172.16.125.13:7000] as [Down]
14:00:10.352 INFO  [geyser-cluster-dispatcher-21] Cluster(akka://geyser) - Cluster Node [akka.tcp://gey...@172.16.119.46:7000] - Leader is removing unreachable node [akka.tcp://gey...@172.16.125.13:7000]
14:00:10.353 INFO  [geyser-cluster-dispatcher-21] Cluster(akka://geyser) - Cluster Node [akka.tcp://gey...@172.16.119.46:7000] - Leader is removing unreachable node [akka.tcp://gey...@172.16.119.42:7000]
14:00:10.353 INFO  [geyser-akka.actor.default-dispatcher-2] AngelOfTheAbyss - Member removed (Member(address = akka.tcp://gey...@172.16.119.42:7000, status = Removed)|Size:3)
14:00:10.353 INFO  [geyser-akka.actor.default-dispatcher-2] AngelOfTheAbyss - Member removed (Member(address = akka.tcp://gey...@172.16.125.13:7000, status = Removed)|Size:3)
14:00:10.353 INFO  [geyser-akka.actor.default-dispatcher-5] a.c.p.ClusterSingletonManager - Member removed [akka.tcp://gey...@172.16.119.42:7000]
14:00:10.354 INFO  [geyser-akka.actor.default-dispatcher-5] a.c.p.ClusterSingletonManager - Member removed [akka.tcp://gey...@172.16.125.13:7000]
14:00:10.356 WARN  [geyser-akka.remote.default-remote-dispatcher-9] Remoting - Association to [akka.tcp://gey...@172.16.119.42:7000] having UID [-477546934] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
14:00:10.356 WARN  [geyser-akka.remote.default-remote-dispatcher-9] Remoting - Association to [akka.tcp://gey...@172.16.125.13:7000] having UID [-1471771858] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
14:00:10.385 WARN  [geyser-akka.remote.default-remote-dispatcher-10] a.r.EndpointWriter - AssociationError [akka.tcp://gey...@172.16.119.46:7000] -> [akka.tcp://gey...@172.16.125.13:7000]: Error [Invalid address: akka.tcp://gey...@172.16.125.13:7000] [
akka.remote.InvalidAssociation: Invalid address: akka.tcp://gey...@172.16.125.13:7000
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system has a UID that has been quarantined. Association aborted.
]
14:00:10.386 INFO  [geyser-akka.remote.default-remote-dispatcher-27] Remoting - Quarantined address [akka.tcp://gey...@172.16.125.13:7000] is still unreachable or has not been restarted. Keeping it quarantined.


IP: 139
13:59:57.544 INFO  [geyser-akka.actor.default-dispatcher-187] AngelOfTheAbyss - Unreachable member (Member(address = akka.tcp://gey...@172.16.119.42:7000, status = Up)|Size:4)
13:59:58.359 INFO  [geyser-akka.actor.default-dispatcher-178] AngelOfTheAbyss - Unreachable member (Member(address = akka.tcp://gey...@172.16.125.13:7000, status = Up)|Size:3)
14:00:11.358 INFO  [geyser-akka.actor.default-dispatcher-32] AngelOfTheAbyss - Member removed (Member(address = akka.tcp://gey...@172.16.119.42:7000, status = Removed)|Size:3)
14:00:11.359 INFO  [geyser-akka.actor.default-dispatcher-32] AngelOfTheAbyss - Member removed (Member(address = akka.tcp://gey...@172.16.125.13:7000, status = Removed)|Size:3)
14:00:11.361 WARN  [geyser-akka.remote.default-remote-dispatcher-27] Remoting - Association to [akka.tcp://gey...@172.16.119.42:7000] having UID [-477546934] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
14:00:11.361 WARN  [geyser-akka.remote.default-remote-dispatcher-27] Remoting - Association to [akka.tcp://gey...@172.16.125.13:7000] having UID [-1471771858] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.

Is there anything abnormal in the logs?

Regards,
Ben

Guido Medina

unread,
Apr 28, 2016, 3:13:20 PM4/28/16
to Akka User List
Hi Ben,

As my experience goes Netty 3 doesn't get much love, issues are barely fixed,
like I mentioned before I'm running my own Netty 3.10.6 built internally, also; 3.10.0 is not even a good version,
if you want force your version to 3.10.5.Final until they release 3.10.6.Final which has nice fixes.

or

you could get my branch, set the version to whatever is comfortable for you and build your own Netty,



plus some minor fixes I added myself, as of interest there is a race condition fixed at 3.10.6 and
I saw another between 3.10.0 and 3.10.5 which might be causing the issue you are experiencing.

HTH,

Guido.

Patrik Nordwall

unread,
Apr 29, 2016, 3:18:55 PM4/29/16
to Akka User List
There can be several reasons, but a good start is to use latest Akka version.
--

Benjamin Black

unread,
Apr 29, 2016, 4:59:49 PM4/29/16
to Akka User List
This is the latest version of akka for java 7. 
Reply all
Reply to author
Forward
0 new messages