How to detect sharding start failures and Singleon start failures?

134 views
Skip to first unread message

Marek Żebrowski

unread,
Feb 17, 2016, 3:00:54 AM2/17/16
to Akka User List
We observe problems with both cluster sharding and cluster singletons.
With sharders - usually problem is corrupted journal that prevents sharding coordinator from starting. In our situation easiest thing to do is to delete all data from journal and restart it - problem is that I can't find a way to detect that situation in a different way than observe logs - I can't find any way to detect such failure from the code. It should be pretty easy - as usually `akka.cluster.sharding.ShardCoordinator.State` throws exception with requirementFailed, but there is no way I can find a way to react on that - no easy way to put `supervisorStrategy` for shard coordinator or no other way to detect its state.
We can't use `ddata` mode - as current implementation does not work in our environment, when we need to scale nodes up and down - as it requires majority of nodes to respond, it fails to work even on simplest cases of scaling down in a small cluster.

Similar situation applies to cluster singleton - if cluster singleton is stuck, there is no way to detect that situation from the code - only by observing logs.

Does anybody have experience in handling such situations?
I'm trying to implement some external monitoring for both things, with basically sending `Identity` message to actors that are supposed to exist - singletons - but it looks like rewriting already existing code inside akka.

Maybe adding such failure detection capabilities to akka (publish event bus, adding ability to set supervisor strategy ) is a better approach?

Patrik Nordwall

unread,
Feb 18, 2016, 4:17:16 AM2/18/16
to akka...@googlegroups.com
Sounds good to publish this on the event bus. It's a fatal error that will not repair itself, so one could claim that we should stop the actor system, but that might be too harsh.

Please open an issue, and a pull request would also be very welcome.

Thanks,
Patrik

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.



--

Patrik Nordwall
Typesafe Reactive apps on the JVM
Twitter: @patriknw

Filippo De Luca

unread,
Feb 18, 2016, 4:24:46 AM2/18/16
to akka...@googlegroups.com
Hi Marek,
What I do in my scenario is sending a Identify message to one of the sharded actor (Using a properly message wrapper) and wait for reply. I do that when the node start and at fixed interval of time to understand if the Shard is up and running.

I do the same for singleton proxy. It is little bit hackyish but works for now.

Marek Żebrowski

unread,
Feb 18, 2016, 4:33:14 AM2/18/16
to akka...@googlegroups.com
I implemented exactly the same mechanism - Identify message to a place where I expect to find shard coordinator - which is a cluster singleton. I don't like that approach tough, as it is rewriting parts of akka.cluster.singleton.ClusterSingletonManager  and  akka.cluster.sharding.ShardCoordinator  internals into application code. 
There is ideological reason for such requirement also - akka and actor model is about resilience, failure handling and so on - and this particular area is needs improvement in singletons and cluster sharding :)


You received this message because you are subscribed to a topic in the Google Groups "Akka User List" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/akka-user/_PJtBMzKuAk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to akka-user+...@googlegroups.com.

To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.



--
Marek Żebrowski

Filippo De Luca

unread,
Feb 18, 2016, 4:42:17 AM2/18/16
to akka...@googlegroups.com
I agree with you. I think a message on eventBus will solve it. 

What about ddata? You say it does not allow to scale up or down, is that correct?

Marek Żebrowski

unread,
Feb 18, 2016, 4:50:01 AM2/18/16
to Akka User List
In our situation of small cluster it does not allow to scale down gracefully - it requires majority read for shard allocations. If cluster scales down from 4 to 2 for example, or is in rolling-restart that requirement can't be met, and sharding is stuck. 

Filippo De Luca

unread,
Feb 18, 2016, 5:14:43 AM2/18/16
to akka...@googlegroups.com
I see,
so if you scale down one node at time, it should work correctly or am I wrong?

Marek Żebrowski

unread,
Feb 18, 2016, 5:27:00 AM2/18/16
to Akka User List
Probably yes - we didn't investigate very thoroughly what conditions are ok. 

Filippo De Luca

unread,
Feb 18, 2016, 5:29:05 AM2/18/16
to akka...@googlegroups.com
Thanks Marek.

On 18 February 2016 at 10:27, Marek Żebrowski <marek.z...@gmail.com> wrote:
Probably yes - we didn't investigate very thoroughly what conditions are ok. 

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages