akka-remote (non-clustered) quarantines and general best practices

Steven Scott

unread,

May 23, 2014, 2:27:46 PM5/23/14

to akka...@googlegroups.com

I've been slowly migrating an application to Akka since scala 2.10 came out and pushed me away from scala actors; sorry for any stupid questions as I'm always learning.

As a general picture, the application consists of multiple long-lived JVMs communicating over ActiveMQ. The standard deployment is to a single machine, but with multiple services communicating over AMQ for the ability to move specific pieces of functionality to other boxes. As the migration and component rewrites have progressed, I'm solely left with actors communicating with each other over AMQ using akka-camel. The natural next step was to explore akka-remote.

My questions started out as "is this an abuse/unintended usage of akka-remote? Is akka-remote meant to be used outside of akka-cluster? Is it useful for communicating to local JVMs? What about network hiccups for remote JVMs?"

I did as much reading as I could and found that Victor Klang has said it's useful for transient networks: http://stackoverflow.com/questions/6401500/is-akka-suitable-for-systems-with-transient-network-coverage, and the smoking gun for same-machine inter-JVM communication being an expected use-case was Dr. Kuhn's comment here: http://stackoverflow.com/questions/10268613/whats-the-equivalent-of-akka/11787971#comment13246146_10268748

I went ahead and implemented a decent amount of code for using akka-remote to talk to one of the services after bumping our akka version to 2.3.3, and have to say I'm pleased, especially when comparing to ActiveMQ. Local machine communication is flawless, but once I started testing with remote machines and doing "ifdown eth0; sleep 20; ifup eth0" network disruption tests, I'm left with questions about how to handle quarantines. I looked at reference.conf and heeded the admonition to NOT change the quarantine timeout from 5 days - restarting one of the actor systems is the only alternative.

So - what're the best practices concerning restarting the ActorSystem?

- I'm not clustering - these are a few long-lived "heavy" services, not just nodes spinning up to do small processing tasks

- Our general deployment is not HA, we don't usually have standbys waiting

- Restarting the JVM isn't optimal

* since the services are fairly substantial and there's a non-trivial amount of initialization including database hits to pre-fill caches, restarting the JVM is a possibility (less time than the remoting gate time), but isn't the first route I'd choose

* we (very rarely) run on non-linux platforms and so tend to try to keep stuff in the JVM instead of relying on upstart/launchd/windows services/etc

My only other thought is to run an additional ActorSystem for remoting.

- allows programmatic configuration (our runtime configuration system could change remoting settings and restart the remoting ActorSystem with the new settings)

- a quarantine situation would just require the remoting ActorSystem to be recreated, not a restart of the whole JVM

However, one of the very earliest entries in the Akka documentation states "An ActorSystem is a heavyweight structure that will allocate 1…N Threads, so create one per logical application." I know creating multiple dispatchers in the same ActorSystem is fine, and sometimes (at least historically) a dedicated dispatcher was recommended for some remoting cases; I also know starting a new ActorSystem takes some amount of time to create dispatchers, parse configs, etc; so I'm thinking that the big yellow warning in the documentation is a general guideline for getting started with Akka, not a hard and fast rule.

Sorry for the long post, can anybody give me some guidance on the situation?

Roland Kuhn

unread,

Jun 3, 2014, 1:16:17 AM6/3/14

to akka-user

Hi Steven,

thanks for this write-up, your analysis is thorough and correct on all counts.

Remoting needs to use a simplistic approach to the coroner problem (i.e. when to declare another system “dead”—and zombies are not tolerated), which is mostly just a timeout that you should set high enough to avoid false positives given your expected outages (network and GC). Using a dedicated (minimal) ActorSystem for the remoting should be the optimal solution for your use-case, the overhead is a few hundred milliseconds for starting it up plus its default dispatcher (which then should run the remoting etc.), saving the remoting dispatcher on the heavy local ActorSystem behind it. The added benefit you note is that you can then reconfigure the remoting part of the application at runtime.

One thing to watch out for is that you don’t accidentally share an ActorRef from the local system in or with a remote message—including sender()—because that can of course not work if the originating system does not have remoting enabled. The symptom would be dropped messages.

Regards,

Roland

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Dr. Roland Kuhn
Akka Tech Lead
Typesafe – Reactive apps on the JVM.
twitter: @rolandkuhn

Azad Bolour

unread,

Aug 27, 2014, 12:18:03 PM8/27/14

to akka...@googlegroups.com

I am wondering how one peer in Akka remoting can detect that the other peer has been quaranteed, so that it can restart the actor system used for remoting to clear the quarantine? Is there an API call for finding out that Akka has quarantined a peer? Or does the application need to use say hearbeats to detect that the peer is unreachable and deduce that it has been quarantined?

Many thanks.

Azad

Akka Team

unread,

Aug 28, 2014, 5:50:36 AM8/28/14

to Akka User List

Hi Azad,

On Wed, Aug 27, 2014 at 6:18 PM, Azad Bolour <azadb...@bolour.com> wrote:

I am wondering how one peer in Akka remoting can detect that the other peer has been quaranteed, so that it can restart the actor system used for remoting to clear the quarantine? Is there an API call for finding out that Akka has quarantined a peer? Or does the application need to use say hearbeats to detect that the peer is unreachable and deduce that it has been quarantined?

The easiest way is just to use DeathWatch and watch one of the actors on the remote machine. If the other host goes away and gets quarantined you will get a Terminated event. If this actor never stops otherwise (i.e. it only stops when the actor system goes away) then it basically does what you need. If you use clustering though you can just listen to cluster membership events, since that handles all these things for you.

-Endre

--

Akka Team

Typesafe - The software stack for applications that scale

Blog: letitcrash.com
Twitter: @akkateam

Steven Scott

unread,

Aug 28, 2014, 9:42:23 AM8/28/14

to akka...@googlegroups.com

There's also the Remote Events section of http://doc.akka.io/docs/akka/2.3.4/scala/remoting.html; I subscribe to akka.remote.QuarantinedEvent events on the remoting ActorSystem.

Azad Bolour

unread,

Aug 28, 2014, 5:36:24 PM8/28/14

to akka...@googlegroups.com

Thank you Endre and Steven for your responses.

A follow-up question. Do we have to set up remote death watch or subscribe to the QuarantinedEvent in both peers? Or do we get the terminated message and the quarantined event on one side no matter which side has quarantined the other? It would be a little simpler if we could just do this on one side, and recycle the actor system only on that side to re-establish the link.

Thanks again.

Azad

Akka Team

unread,

Aug 29, 2014, 6:04:59 AM8/29/14

to Akka User List

Hi Azad,

The Quarantined will likely to happen on both sides (but it might take a long time depending on timeout settings) but you will get Terminated messages only in those actors that are watching a remote actor.

So if you have node A and B, and actor A1 on A watches the actor B1 on B, and the link between A and B goes away, then you will see eventually (might take long) Quarantined on both A and B, and A1 will receive a Terminated for B1. Since B1 did not watch anything on A1, it of course does not receive any Terminated messages.

-Endre

Azad Bolour

unread,

Aug 29, 2014, 6:29:51 PM8/29/14

to akka...@googlegroups.com

Thanks Endre.

My take from this is that the QuarantinedEvent will eventually be fired on both sides, so if I subscribe to it on one side only and recycle the actor system on that side I should be good. The only issue is that it might take a long time to get notified of the QuarantinedEvent, and for that I have to go study the timeout settings and adjust them accordingly.

Azad

Steven Scott

unread,

Aug 29, 2014, 7:29:12 PM8/29/14

to akka...@googlegroups.com

In my setup for the issue in question at the time of my first post, I have 2 long-lived services, SrvA & SrvB - SrvB is a "server"-type service (SrvA initiates requests, as needed, to SrvB, and expects replies to those requests), so in my case it was easiest to create one "Remoting" ActorSystem on SrvA; SrvB only runs one ActorSystem, ever. Theoretically we may run many copies of SrvA, all connecting to SrvB, which made it more imperative that SrvB never restart its ActorSystem. SrvA subscribes to the QuarantinedEvent, and if it receives one then a guardian actor restarts the "Remoting" ActorSystem (which is able to re-connect to SrvB since it has a new UID).

As a kind-of-related aside, in the issue my first post was about, SrvB also pushes messages to SrvA, after SrvA sends a SendThisActorPushMessages(receiver: ActorRef) message. SrvB subscribes to <receiver>'s DeathWatch (context watch <receiver>), and when it receives a Terminated(<receiver>) message it simply removes <receiver> from its list of push destinations. So SrvB doesn't really care about Quarantines, but it does care about Terminated/DeathWatch. In the Akka Remoting case, any time a remote is Quarantined you get a Terminated message for any of its actors that you're DeathWatch'ing.

Azad Bolour

unread,

Aug 30, 2014, 8:00:03 PM8/30/14

to akka...@googlegroups.com

Thanks Steven.

We have an application setup that is similar to yours and we are also using death watch as you described.

Do your SrvA actors have state that they don't want to lose in case of a quarantine? If so, in your remoting actor system on SrvA do you just use forwarding actors to pass messages between application actors in SrvA and actors in SrvB? I am thinking this strategy may need special care if messages can have embedded actor references to SrvA's application actors.