Shard coordinator persisting actor addresses containing the node IP

Johannes Berg

unread,

Feb 25, 2015, 2:06:51 AM2/25/15

to akka...@googlegroups.com

Hi!

We recently made a mistake by having two separate clusters use the same journal DB which obviously doesn't work but it lead to some questions about how the cluster sharding works.

We detected our problem by seeing a stack trace containing lines of code that didn't exist anywhere in the code deployed in that cluster and after tripple checking that we were sure we did run the correct code in all places in the cluster we finally found out that it in fact must be talking with an actor outside of the cluster. Looking at what the cluster sharding coordinator persists in the journal we found an IP address of a node in the other cluster.

Granted the data in the journal table is pretty invalid but this still raises the question why the shard coordinator could start using nodes that aren't part of the cluster just based on some historical persisted info in the journal table. For example what happens if you run sharding over two nodes and decide to move the second node to another cluster. You take down the system and change the config files and then start the two systems again, could the sharding coordinator on the first node under any circumstances still use the second node based on the info in the journal table even though it's now not part of the same cluster anymore?

Brice Figureau

unread,

Mar 2, 2015, 5:04:58 AM3/2/15

to akka...@googlegroups.com

On Tue, 2015-02-24 at 23:06 -0800, Johannes Berg wrote:
> Hi!
>
> We recently made a mistake by having two separate clusters use the
> same journal DB which obviously doesn't work but it lead to some
> questions about how the cluster sharding works.
>
> We detected our problem by seeing a stack trace containing lines of
> code that didn't exist anywhere in the code deployed in that cluster
> and after tripple checking that we were sure we did run the correct
> code in all places in the cluster we finally found out that it in fact
> must be talking with an actor outside of the cluster. Looking at what
> the cluster sharding coordinator persists in the journal we found an
> IP address of a node in the other cluster.

What a coincidence! I observed the exact same issue a couple of days
after you.
See here:
https://groups.google.com/d/msg/akka-user/OJKGEgKBn2g/BQLRWkoK8zcJ

So I tested a bit more and found that the system would heal by itself (a
bit later), because if there's no ActorSystem running on those other
nodes, the ShardCoordinator would finally receive a Terminated message
(at least under akka 2.3.9), and allocate the region to another place.
Granted if the node moved to another cluster using the same ActorSystem
name, it might be more problematic.

Still this look a bit strange to me, or something that had been
oversighted during design.

> Granted the data in the journal table is pretty invalid but this still
> raises the question why the shard coordinator could start using nodes
> that aren't part of the cluster just based on some historical
> persisted info in the journal table. For example what happens if you
> run sharding over two nodes and decide to move the second node to
> another cluster. You take down the system and change the config files
> and then start the two systems again, could the sharding coordinator
> on the first node under any circumstances still use the second node
> based on the info in the journal table even though it's now not part
> of the same cluster anymore?

Yes, this isn't practical at all in an elastic environment. I would have
preferred a mechanism when recovering the ShardCoordinator that would
allocate the region locally then rebalance the shards based on the
discovered regions at the end of the recovery.

I think I'll enter a bug report for this problem.

--
Brice Figureau <bric...@daysofwonder.com>

Johannes Berg

unread,

Mar 2, 2015, 5:39:53 AM3/2/15

to akka...@googlegroups.com

Hi Brice! Interesting that you've found out the same thing. I haven't looked closely at the details of the shard coordinator but I do find it quite strange that it would blindly use some persisted IP:s and ports.

will be interesting to see what the Akka team thinks about this.

--
Brice Figureau <bric...@daysofwonder.com>

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to a topic in the Google Groups "Akka User List" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/akka-user/vJxHkmP1cQA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Patrik Nordwall

unread,

Mar 2, 2015, 6:38:59 AM3/2/15

to akka...@googlegroups.com

This is an oversight. Please create an issue. For now I would recommend that you use unique names (actor system name) of the the clusters to avoid accidental cross talk like this.

Thanks for reporting!

Regards,

Patrik

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---

You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.

To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

--

Patrik Nordwall
Typesafe - Reactive apps on the JVM
Twitter: @patriknw

Patrik Nordwall

unread,

Mar 2, 2015, 6:58:45 AM3/2/15

to akka...@googlegroups.com

Thinking some more about this. Actually, this can only happen if you split an existing cluster in two separate clusters, typically by using auto-down. It is not only using host:port. It is using full ActorRef of the shard region and that includes the uid of the actor incarnation. That means that it will not match and this will not be an issue if you shutdown the system.

/Patrik

Johannes Berg

unread,

Mar 5, 2015, 8:49:49 AM3/5/15

to akka...@googlegroups.com

Okay, so following the other discussion as well I guess there's no need for a bug report then since the uid is part of the address. The problem we saw only occured because we had the clusters wrongfully configured and shouldn't happen if you shutdown and move the node to another cluster.

Thanks for clearing this out.

Johannes

Patrik Nordwall

unread,

Mar 5, 2015, 9:03:46 AM3/5/15

to akka...@googlegroups.com

On Thu, Mar 5, 2015 at 2:49 PM, Johannes Berg <jber...@gmail.com> wrote:

Okay, so following the other discussion as well I guess there's no need for a bug report then since the uid is part of the address. The problem we saw only occured because we had the clusters wrongfully configured and shouldn't happen if you shutdown and move the node to another cluster.

Thanks for clearing this out.

You're welcome. Thanks for sharing your observations. I was also thinking it was a bug for a moment.

/Patrik

Reply all

Reply to author

Forward