Hi! I've had some problems handling temporary network
partitions/communication problems which causes nodes to get quarantined.
Earlier I've found out that I can hit the system message buffer size;
https://groups.google.com/d/topic/akka-user/NGLi9GTZ42o/discussion which causes quarantine. In the earlier case it was due to mass-death/creation of remotely watched actors.
Now
I have struck into another use case where I get quarantined nodes, this
time not because of excessive load but instead because of network
partition and usage of publish-subscribe. My use case isn't maybe the
most normal but let me try to explain it anyway;
I have one node
running my own java-akka process (ServerA) and one node running play
framework (ServerB). In ServerA I've called the actor system
"myactorsystem"(ActorSystemA). At first I tried to make the internal
actor system(ActorSystemB) in play framework to connect to the same
cluster as ActorSystemA but that didn't work because apparently Play
Framework has the actor system name of "application" that can't be
changed. I ended up creating a separate actor system(ActorSystemC) in my
play application with the name "myactorsystem" and connect that to the
cluster and that seemed to work fine. I use WebSockets in the play
framework which gets represented by an actor in ActorSystemB. This
websocket actor gets sent to an actor in ActorSystemA, through a cluster
aware router in ActorSystemC, which subscribes it to a topic using
Publish-Subscribe.
It all runs fine until I introduce a network
partition between the nodes for a few seconds. When I remove the
partition ActorSystemA and ActorSystemB reconnects to each other fine,
but ActorSystemC gets quarantined by ActorSystemA. I get the following
in the log:
[ERROR] [02/11/2015 09:14:33.059] [myactorsystem-akka.remote.
default-remote-dispatcher-7] [Remoting] Association to [akka.tcp://application@ip:port] with UID [-1690961114] irrecoverably failed. Quarantining address.
java.lang.IllegalStateException: Error encountered while processing system message acknowledgement buffer: [-1 {}] ack: ACK[3, {}]
at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:287)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:188)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.IllegalArgumentException: Highest SEQ so far was -1 but cumulative ACK is 3
at akka.remote.AckedSendBuffer.acknowledge(AckedDelivery.scala:103)
at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:283)
... 12 more
[INFO] [02/11/2015 09:14:33.740] [myactorsystem-akka.remote.default-remote-dispatcher-7] [Remoting] Quarantined address [akka.tcp://application@ip2:port2] is still unreachable or has not been restarted. Keeping it quarantined.
Attached
is the scaled-down version with configs I've used to reproduce the
test. I've used Akka 2.3.9 and deployed it in Amazon EC2 and fabricated a
network partition using iptables.
In the full system I also have seen the following:
17:45:13.524UTC ERROR[reactor-akka.actor.default-dispatcher-2] Remoting - Association to [akka.tcp://myactorsystem@ip:port] with UID [1330505507] irrecoverably failed. Quarantining address.
java.util.concurrent.TimeoutException: Delivery of system messages timed out and they were dropped.
This
doesn't seem right. It's fair when I put excessive load on the system
that I get some buffer overflow, but going quarantined after a few
seconds of network partition isn't. This doesn't happen if I don't use
publish-subscribe which leads me to think publish-subscribe use system
messaging as well, is that right? Also I've struck into quarantine
problems quite a few times now so I'm having a hard time to know how I
should tackle them. First I would need to get it in shape so it wouldn't
ever happen under somewhat normal circumstances. In the rare cases it
would happen, is there some event or something I can listen on so I can
shutdown properly or do appropriate actions, right now all I get are
some log messages?
So far I know about remote death watching,
remote actor deployment and publish-subscribe that uses system messages
that can cause quarantine, what else is there?
I'm happy to file a bug report if you think that's appropriate for this case.