Active restore ClusterShard

136 views
Skip to first unread message

Jeroen Gordijn

unread,
May 15, 2014, 4:09:36 PM5/15/14
to akka...@googlegroups.com
Hi,

When rebalancing a Shard, the old shard is stopped and a new shard is started on another node, after which all messages for that Shard will be send to the new node. When a message is received, the actor will be created. When Akka-persistence is used the Actor will reload all its events and restore state before processing the new message. But if no message is sent, the actor will not be created. This can be problematic when the actor is has some active state with retry mechanisme or timeout. Is my understanding correct?

Is there a way to actively restore the Shard state when the shard is moved to another node? One problem I can see with this is when going back to less nodes. This means that the shards will be rebalanced, but potentially giving memory problems. This will cause rebalancing and memory problems on the next node and eventually putting the whole cluster down. Starting the cluster will be also problematic for the same reason. 

Cheers,
Jeroen

Patrik Nordwall

unread,
May 16, 2014, 3:25:05 AM5/16/14
to akka...@googlegroups.com
Hi Jeroen,


On Thu, May 15, 2014 at 10:09 PM, Jeroen Gordijn <jeroen....@gmail.com> wrote:
Hi,

When rebalancing a Shard, the old shard is stopped and a new shard is started on another node, after which all messages for that Shard will be send to the new node. When a message is received, the actor will be created. When Akka-persistence is used the Actor will reload all its events and restore state before processing the new message. But if no message is sent, the actor will not be created. This can be problematic when the actor is has some active state with retry mechanisme or timeout. Is my understanding correct?

Your reasoning is correct. I think you can implement that by letting the actor schedule a keep-alive message to itself, but via the ShardRegion. Normally that will be local only message roundtrip via the scheduler and local ShardRegion, but after rebalancing it will delegate the message to the new node and thereby wake up the actor again.

What this doesn't solve is when a node crashes. An actor living on that node will not be automatically started somewhere else, until a message is sent to it. To solve that I think you have to let the actor register itself to a a few (for redundancy) watchdog actors, which watch the actor and know how to send the wake-up message via ClusterSharding.

Does that make sense?

Cheers,
Patrik
 

Is there a way to actively restore the Shard state when the shard is moved to another node? One problem I can see with this is when going back to less nodes. This means that the shards will be rebalanced, but potentially giving memory problems. This will cause rebalancing and memory problems on the next node and eventually putting the whole cluster down. Starting the cluster will be also problematic for the same reason. 

Cheers,
Jeroen

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.



--

Patrik Nordwall
Typesafe Reactive apps on the JVM
Twitter: @patriknw

Heiko Seeberger

unread,
May 16, 2014, 3:31:28 AM5/16/14
to Akka User List
Couldn't we add a watchdog feature to Cluster Sharding itself?

Heiko

Heiko Seeberger
Twitter: @hseeberger

Patrik Nordwall

unread,
May 16, 2014, 3:43:33 AM5/16/14
to akka...@googlegroups.com
On Fri, May 16, 2014 at 9:31 AM, Heiko Seeberger <heiko.s...@gmail.com> wrote:
Couldn't we add a watchdog feature to Cluster Sharding itself?

I don't think this is a core feature of Cluster Sharding and I think it can be better implemented at the application level. For example, the application knows how to send a message to a specific actor, since it is the application that defines the id extractor.

/Patrik

Jeroen Gordijn

unread,
May 17, 2014, 5:01:01 PM5/17/14
to akka...@googlegroups.com


Op vrijdag 16 mei 2014 09:43:33 UTC+2 schreef Patrik Nordwall:



On Fri, May 16, 2014 at 9:31 AM, Heiko Seeberger <heiko.s...@gmail.com> wrote:
Couldn't we add a watchdog feature to Cluster Sharding itself?

I don't think this is a core feature of Cluster Sharding and I think it can be better implemented at the application level. For example, the application knows how to send a message to a specific actor, since it is the application that defines the id extractor.

/Patrik

I think it would be great from the developers perspective to be able to mark an actor to survive rebalancing (or crashes), so that the actor itself is re-created on the new node when the Shard is moved. Why do you think that this feature should not be part of the Cluster Sharding? What makes it application level? In my mind sharding is just configuration and not part of the application itself.

It still leaves me wondering how I should prevent my whole cluster from going down when a node crashes and moving the shard results in an out-of-memory on the new node.

To be clear, I don't have a real problem yet, but I'm trying to get my head around the cluster sharding concepts.

Cheers,
Jeroen

delasoul

unread,
May 19, 2014, 12:18:11 PM5/19/14
to akka...@googlegroups.com
Hello,

good catch - we are running in this situation in the moment as well - some of our sharded actors register to a Cluster Pub-Sub topic
when started - when getting rebalanced or when a node crashes the sharded actor is not restarted and hence not subscribed anymore...
First thought would also be to just extend ClusterSharding.start with a keepAlive field to restart actors automatically?
For now, I think we will solve this by implementing a Watchdog as ClusterSingleton.

michael

Roland Kuhn

unread,
May 20, 2014, 12:37:42 PM5/20/14
to akka-user
19 maj 2014 kl. 18:18 skrev delasoul <michael...@gmx.at>:

Hello,

good catch - we are running in this situation in the moment as well - some of our sharded actors register to a Cluster Pub-Sub topic
when started - when getting rebalanced or when a node crashes the sharded actor is not restarted and hence not subscribed anymore...
First thought would also be to just extend ClusterSharding.start with a keepAlive field to restart actors automatically?

This would not solve the node crash scenario, since the only information that we can practically keep replicated is which shards it had (and not every entity within them).

For now, I think we will solve this by implementing a Watchdog as ClusterSingleton.

Yes, I think this needs to be handled specially; one thing I’m currently asking myself is why a sharded entity would need to keep itself actively scheduled when there is no other actor that is interested in it (read: sends it messages). If something needs to happen periodically, then either it is observable (i.e. needs to send messages elsewhere, which means that that other actor can use DeathWatch) or it is not (without external querying). In the latter case the activity is more of book-keeping quality and can be executed in batch later when the entity is rehydrated upon request.

Regards,

Roland



Dr. Roland Kuhn
Akka Tech Lead
Typesafe – Reactive apps on the JVM.
twitter: @rolandkuhn


Reply all
Reply to author
Forward
0 new messages