Hello,
I'm trying to understand the cause of nodes being quarantined and possible solutions to fixing it. I'm using akka 2.3.11. On the quarantined node I see this logging:
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.
]
12:45:44.205 WARN [geyser-akka.remote.default-remote-dispatcher-25] Remoting - Tried to associate with unreachable remote address [akka.tcp://gey...@172.17.100.105:7000]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.]
And on the node that cause the box to be quarantined I see this logging:
12:45:44.194 WARN [geyser-akka.remote.default-remote-dispatcher-6] Remoting - Association to [akka.tcp://gey...@172.16.120.174:7000] having UID [-450748474] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation. Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system has a UID that has been quarantined. Association aborted.
]
12:45:44.203 WARN [geyser-akka.remote.default-remote-dispatcher-7] Remoting - Tried to associate with unreachable remote address [akka.tcp://gey...@172.16.120.174:7000]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has a UID that has been quarantined. Association aborted.] 12:45:44.221 ERROR [geyser-akka.remote.default-remote-dispatcher-7] Remoting - Association to [akka.tcp://gey...@172.16.120.174:7000] with UID [-450748474] irrecoverably failed. Quarantining address. java.lang.IllegalStateException: Error encountered while processing system message acknowledgement buffer: [-1 {}] ack: ACK[6, {}]
at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:288) ~[geyser.jar:1.1.17-SNAPSHOT]
at akka.actor.Actor$class.aroundReceive(Actor.scala:467) ~[geyser.jar:1.1.17-SNAPSHOT]
Caused by: java.lang.IllegalArgumentException: Highest SEQ so far was -1 but cumulative ACK is 6
at akka.remote.AckedSendBuffer.acknowledge(AckedDelivery.scala:103) ~[geyser.jar:1.1.17-SNAPSHOT]
at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:284) ~[geyser.jar:1.1.17-SNAPSHOT]
... 11 common frames omitted
12:45:44.221 WARN [geyser-akka.remote.default-remote-dispatcher-7] Remoting - Association to [akka.tcp://gey...@172.16.120.174:7000] having UID [-450748474] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
Quite a bit of data can be passed between the nodes ~200 Mb/sec and maybe the system is hitting a capacity issue although I don't see any issue with CPU or memory. I noticed that the default-remote-dispatcher only has two threads. Are these threads being used to send the data? And if so should I try increase the thread count? Are there any other settings I could play with of things I can look for in the logs that might highlight what is wrong?
Thanks,
Ben