Recovering from the Quarantined state

Mark Kegel

unread,

Jan 26, 2015, 11:20:54 AM1/26/15

to akka...@googlegroups.com

We are using akka in a clustered configuration at work. Its a very simple cluster with just three node types: an admin node, "live" nodes, and "preview" nodes. The admin node will manage nodes of the other two types, and ask for things like status and uptime. Every so often one of the live/preview nodes will become unresponsive to requests from the admin node. The only way we've been able to fix this is to restart the node.

From reading the akka docs this seems to correspond to the node becoming Quarantined. While I appreciate that this state is necessary to maintain consistency, I'm at a loss in finding docs that show how to respond in code when this happens. On our admin node we'll know that some other live/preview node has failed and will require a restart, but what would work best is if we could have a service watching locally on the failed live/preview node that could force a restart of that nodes' JVM.

Is there any kind of exception that bubbles back to user code, or a cluster state message that I can receive, for when my local akka instance can't rejoin the cluster?

Is there any way a supervisor hierarchy can help solve this problem?

If someone can point me to code that is able to respond and recover from such failures intelligently, and using akka approved idioms, that would be most appreciated.

Mark

Patrik Nordwall

unread,

Feb 3, 2015, 7:32:20 AM2/3/15

to akka...@googlegroups.com

What version of Akka are you using? We fixed some issue related to quarantining in 2.3.9.

/Patrik

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

--

Patrik Nordwall
Typesafe - Reactive apps on the JVM
Twitter: @patriknw

Mark Kegel

unread,

Feb 3, 2015, 11:13:15 AM2/3/15

to akka...@googlegroups.com

We are using akka 2.3.4, but I don't think this is an issue with a specific version of akka. In fact the docs explicitly state that you have to restart the akka node after its been Quarantined.

I'm looking for some way to detect that my node has been quarantined so that I can force an exit, so that our puppet system can restart it, or just restart the akka system programmatically without exiting the process. This seems like basic error handling and recovery but I see nothing in the docs on how a person is supposed to handle this, or how they can even be notified of the issue.

Is there any kind of exception that bubbles back to user code, or a cluster state message that I can receive, for when my local akka instance can't rejoin the cluster?

Is there any way a supervisor hierarchy can help solve this problem?

If someone can point me to code that is able to respond and recover from such failures intelligently, and using akka approved idioms, that would be most appreciated.

Mark

Akka Team

unread,

Feb 6, 2015, 4:59:21 AM2/6/15

to Akka User List

Hi Mark,

On Tue, Feb 3, 2015 at 5:13 PM, Mark Kegel <mark....@gmail.com> wrote:

We are using akka 2.3.4, but I don't think this is an issue with a specific version of akka. In fact the docs explicitly state that you have to restart the akka node after its been Quarantined.

I'm looking for some way to detect that my node has been quarantined so that I can force an exit, so that our puppet system can restart it, or just restart the akka system programmatically without exiting the process. This seems like basic error handling and recovery but I see nothing in the docs on how a person is supposed to handle this, or how they can even be notified of the issue.

I agree that we can improve the documentation around this. The remoting publishes events that you can subscribe to:

http://doc.akka.io/docs/akka/2.3.9/scala/remoting.html#Remote_Events

One of those published events notifies of quarantine: http://doc.akka.io/api/akka/2.3.9/#akka.remote.QuarantinedEvent

-Endre

--

Akka Team

Typesafe - The software stack for applications that scale

Blog: letitcrash.com
Twitter: @akkateam

Patrik Nordwall

unread,

Feb 6, 2015, 5:15:16 AM2/6/15

to akka...@googlegroups.com

You should probably also look into why they are quarantined.

It can be two reasons:

1) The nodes are removed from the cluster, which will happen if failure detection triggers, you use auto-downing and they don't become reachable again within the configured akka.cluster.auto-down-unreachable-after timeout. You might want to increase the auto-down timeout?

2) Overflow of the system message delivery buffer, because of many remote watch or remote deployments. You might want to increase the akka.remote.system-message-buffer-size, or adjust your design?