Cluster Singleton duplicated if primary seed restarted

550 views
Skip to first unread message

Jem Mawson

unread,
Apr 2, 2014, 9:37:10 PM4/2/14
to akka...@googlegroups.com
Hello. 

If I have a cluster singleton active and I restart the primary seed node, the singleton becomes active on two nodes. Does that imply that the primary seed does not rejoin the cluster?

Thanks
Jem

Björn Antonsson

unread,
Apr 3, 2014, 2:37:47 AM4/3/14
to akka...@googlegroups.com, Jem Mawson
Hi Jem,

What version of Akka are you running? There is a regression in Akka 2.3.1 that can cause issues with seed node joining.

It's also kind of hard to answer your question without some more information, like a log printout.

B/
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.
-- 
Björn Antonsson
Typesafe – Reactive Apps on the JVM
twitter: @bantonsson

Jem

unread,
Apr 6, 2014, 9:47:19 PM4/6/14
to akka...@googlegroups.com
OK, I just looked at that regression issue and it doesn't fit what I am seeing. It may be that what I'm experiencing is by design, but I think it isn't.

The source is at https://github.com/Synesso/scratch-akka-cluster-singleton. It's basically the activator template sample with some modifications.

The scenario is:
  1. Start primary seed
  2. Start secondary seed
  3. Start additional nodes
  4. Cluster singleton is running on the primary seed. Kill the primary seed JVM.
  5. Cluster singleton begins running on the additional node. Restart the primary seed.
  6. Cluster singleton begins running on the primary seed, and it still running on the additional node.
In the following output you can see the restarted primary seed running the cluster singleton again at 11:40.01, whilst the additional node is still running at 11:40.07. Left alone they continue to run concurrently. Are they in the cluster together?

Logs:

Primary seed:

apdmmac-39:akka-sample-cluster-scala mawsonj$  sbt "run-main sample.cluster.simple.SimpleClusterApp 2551"

[info] Loading global plugins from /Users/mawsonj/.sbt/0.13/plugins

[info] Loading project definition from /Users/mawsonj/projects/akka-sample-cluster-scala/project

[info] Set current project to akka-sample-cluster-scala (in build file:/Users/mawsonj/projects/akka-sample-cluster-scala/)

[info] Compiling 5 Scala sources to /Users/mawsonj/projects/akka-sample-cluster-scala/target/scala-2.10/classes...

[info] Running sample.cluster.simple.SimpleClusterApp 2551

[info] [INFO] [04/07/2014 11:38:59.681] [main] [Remoting] Starting remoting

[info] [INFO] [04/07/2014 11:38:59.846] [main] [Remoting] Remoting started; listening on addresses :[akka.tcp://Cluste...@192.168.161.139:2551]

[info] [INFO] [04/07/2014 11:38:59.859] [main] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2551] - Starting up...

[info] [INFO] [04/07/2014 11:38:59.942] [main] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2551] - Registered cluster JMX MBean [akka:type=Cluster]

[info] [INFO] [04/07/2014 11:38:59.942] [main] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2551] - Started up successfully

[info] [INFO] [04/07/2014 11:38:59.984] [ClusterSystem-akka.actor.default-dispatcher-4] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2551] - Metrics collection has started successfully

[info] [WARN] [04/07/2014 11:39:00.051] [ClusterSystem-akka.remote.default-remote-dispatcher-5] [akka.tcp://Cluste...@192.168.161.139:2551/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40192.168.161.139%3A2552-0] Association with remote system [akka.tcp://Cluste...@192.168.161.139:2552] has failed, address is now gated for [5000] ms. Reason is: [Association failed with [akka.tcp://Cluste...@192.168.161.139:2552]].

[info] [INFO] [04/07/2014 11:39:04.973] [ClusterSystem-akka.actor.default-dispatcher-17] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2551] - Node [akka.tcp://Cluste...@192.168.161.139:2551] is JOINING, roles []

[info] [INFO] [04/07/2014 11:39:05.966] [ClusterSystem-akka.actor.default-dispatcher-4] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2551] - Leader is moving node [akka.tcp://Cluste...@192.168.161.139:2551] to [Up]

[info] [INFO] [04/07/2014 11:39:05.967] [ClusterSystem-akka.actor.default-dispatcher-2] [akka.tcp://Cluste...@192.168.161.139:2551/user/clusterListener] Member is Up: akka.tcp://Cluste...@192.168.161.139:2551

[info] [INFO] [04/07/2014 11:39:05.971] [ClusterSystem-akka.actor.default-dispatcher-14] [akka.tcp://Cluste...@192.168.161.139:2551/user/clusterSingleton] Singleton manager [akka.tcp://Cluste...@192.168.161.139:2551] starting singleton actor

[info] [INFO] [04/07/2014 11:39:05.972] [ClusterSystem-akka.actor.default-dispatcher-14] [akka.tcp://Cluste...@192.168.161.139:2551/user/clusterSingleton] ClusterSingletonManager state change [Start -> Oldest]

[info] [INFO] [04/07/2014 11:39:05.976] [ClusterSystem-akka.actor.default-dispatcher-4] [akka.tcp://Cluste...@192.168.161.139:2551/user/clusterSingleton/pinger-ponger] Ping?

[info] [INFO] [04/07/2014 11:39:05.978] [ClusterSystem-akka.actor.default-dispatcher-14] [akka.tcp://Cluste...@192.168.161.139:2551/user/clusterSingleton/pinger-ponger/$b] Pong!

[info] [INFO] [04/07/2014 11:39:15.978] [ClusterSystem-akka.actor.default-dispatcher-17] [akka.tcp://Cluste...@192.168.161.139:2551/user/clusterSingleton/pinger-ponger] Ping?

[info] [INFO] [04/07/2014 11:39:15.978] [ClusterSystem-akka.actor.default-dispatcher-14] [akka.tcp://Cluste...@192.168.161.139:2551/user/clusterSingleton/pinger-ponger/$k] Pong!

[info] [INFO] [04/07/2014 11:39:16.954] [ClusterSystem-akka.actor.default-dispatcher-14] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2551] - Node [akka.tcp://Cluste...@192.168.161.139:57939] is JOINING, roles []

[info] [INFO] [04/07/2014 11:39:16.957] [ClusterSystem-akka.actor.default-dispatcher-19] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2551] - Leader is moving node [akka.tcp://Cluste...@192.168.161.139:57939] to [Up]

[info] [INFO] [04/07/2014 11:39:16.958] [ClusterSystem-akka.actor.default-dispatcher-15] [akka.tcp://Cluste...@192.168.161.139:2551/user/clusterListener] Member is Up: akka.tcp://Cluste...@192.168.161.139:57939

^C

apdmmac-39:akka-sample-cluster-scala mawsonj$  sbt "run-main sample.cluster.simple.SimpleClusterApp 2551"

[info] Loading global plugins from /Users/mawsonj/.sbt/0.13/plugins

[info] Loading project definition from /Users/mawsonj/projects/akka-sample-cluster-scala/project

[info] Set current project to akka-sample-cluster-scala (in build file:/Users/mawsonj/projects/akka-sample-cluster-scala/)

[info] Running sample.cluster.simple.SimpleClusterApp 2551

[info] [INFO] [04/07/2014 11:39:49.763] [main] [Remoting] Starting remoting

[info] [INFO] [04/07/2014 11:39:49.913] [main] [Remoting] Remoting started; listening on addresses :[akka.tcp://Cluste...@192.168.161.139:2551]

[info] [INFO] [04/07/2014 11:39:49.925] [main] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2551] - Starting up...

[info] [INFO] [04/07/2014 11:39:50.000] [main] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2551] - Registered cluster JMX MBean [akka:type=Cluster]

[info] [INFO] [04/07/2014 11:39:50.001] [main] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2551] - Started up successfully

[info] [INFO] [04/07/2014 11:39:50.030] [ClusterSystem-akka.actor.default-dispatcher-3] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2551] - Metrics collection has started successfully

[info] [INFO] [04/07/2014 11:39:50.136] [ClusterSystem-akka.actor.default-dispatcher-3] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2551] - Node [akka.tcp://Cluste...@192.168.161.139:2551] is JOINING, roles []

[info] [INFO] [04/07/2014 11:39:51.030] [ClusterSystem-akka.actor.default-dispatcher-14] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2551] - Leader is moving node [akka.tcp://Cluste...@192.168.161.139:2551] to [Up]

[info] [INFO] [04/07/2014 11:39:51.031] [ClusterSystem-akka.actor.default-dispatcher-3] [akka.tcp://Cluste...@192.168.161.139:2551/user/clusterListener] Member is Up: akka.tcp://Cluste...@192.168.161.139:2551

[info] [INFO] [04/07/2014 11:39:51.035] [ClusterSystem-akka.actor.default-dispatcher-15] [akka.tcp://Cluste...@192.168.161.139:2551/user/clusterSingleton] Singleton manager [akka.tcp://Cluste...@192.168.161.139:2551] starting singleton actor

[info] [INFO] [04/07/2014 11:39:51.036] [ClusterSystem-akka.actor.default-dispatcher-15] [akka.tcp://Cluste...@192.168.161.139:2551/user/clusterSingleton] ClusterSingletonManager state change [Start -> Oldest]

[info] [INFO] [04/07/2014 11:39:51.040] [ClusterSystem-akka.actor.default-dispatcher-20] [akka.tcp://Cluste...@192.168.161.139:2551/user/clusterSingleton/pinger-ponger] Ping?

[info] [INFO] [04/07/2014 11:39:51.042] [ClusterSystem-akka.actor.default-dispatcher-17] [akka.tcp://Cluste...@192.168.161.139:2551/user/clusterSingleton/pinger-ponger/$c] Pong!

[info] [INFO] [04/07/2014 11:39:52.664] [ClusterSystem-akka.actor.default-dispatcher-4] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2551] - Node [akka.tcp://Cluste...@192.168.161.139:2552] is JOINING, roles []

[info] [INFO] [04/07/2014 11:39:53.024] [ClusterSystem-akka.actor.default-dispatcher-21] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2551] - Leader is moving node [akka.tcp://Cluste...@192.168.161.139:2552] to [Up]

[info] [INFO] [04/07/2014 11:39:53.025] [ClusterSystem-akka.actor.default-dispatcher-14] [akka.tcp://Cluste...@192.168.161.139:2551/user/clusterListener] Member is Up: akka.tcp://Cluste...@192.168.161.139:2552

[info] [INFO] [04/07/2014 11:40:01.042] [ClusterSystem-akka.actor.default-dispatcher-20] [akka.tcp://Cluste...@192.168.161.139:2551/user/clusterSingleton/pinger-ponger] Ping?

[info] [INFO] [04/07/2014 11:40:01.043] [ClusterSystem-akka.actor.default-dispatcher-18] [akka.tcp://Cluste...@192.168.161.139:2551/user/clusterSingleton/pinger-ponger/$k] Pong!


Secondary seed:

apdmmac-39:akka-sample-cluster-scala mawsonj$ sbt "run-main sample.cluster.simple.SimpleClusterApp 2552"

[info] Loading global plugins from /Users/mawsonj/.sbt/0.13/plugins

[info] Loading project definition from /Users/mawsonj/projects/akka-sample-cluster-scala/project

[info] Set current project to akka-sample-cluster-scala (in build file:/Users/mawsonj/projects/akka-sample-cluster-scala/)

[info] Compiling 5 Scala sources to /Users/mawsonj/projects/akka-sample-cluster-scala/target/scala-2.10/classes...

[info] Running sample.cluster.simple.SimpleClusterApp 2552

[info] [INFO] [04/07/2014 11:39:02.185] [main] [Remoting] Starting remoting

[info] [INFO] [04/07/2014 11:39:02.337] [main] [Remoting] Remoting started; listening on addresses :[akka.tcp://Cluste...@192.168.161.139:2552]

[info] [INFO] [04/07/2014 11:39:02.350] [main] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2552] - Starting up...

[info] [INFO] [04/07/2014 11:39:02.423] [main] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2552] - Registered cluster JMX MBean [akka:type=Cluster]

[info] [INFO] [04/07/2014 11:39:02.423] [main] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2552] - Started up successfully

[info] [INFO] [04/07/2014 11:39:02.452] [ClusterSystem-akka.actor.default-dispatcher-2] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2552] - Metrics collection has started successfully

[info] [WARN] [04/07/2014 11:39:21.497] [ClusterSystem-akka.remote.default-remote-dispatcher-6] [akka.tcp://Cluste...@192.168.161.139:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40192.168.161.139%3A2551-0] Association with remote system [akka.tcp://Cluste...@192.168.161.139:2551] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

[info] [WARN] [04/07/2014 11:39:27.544] [ClusterSystem-akka.remote.default-remote-dispatcher-6] [akka.tcp://Cluste...@192.168.161.139:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40192.168.161.139%3A2551-0] Association with remote system [akka.tcp://Cluste...@192.168.161.139:2551] has failed, address is now gated for [5000] ms. Reason is: [Association failed with [akka.tcp://Cluste...@192.168.161.139:2551]].

[info] [WARN] [04/07/2014 11:39:32.562] [ClusterSystem-akka.remote.default-remote-dispatcher-6] [akka.tcp://Cluste...@192.168.161.139:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40192.168.161.139%3A2551-0] Association with remote system [akka.tcp://Cluste...@192.168.161.139:2551] has failed, address is now gated for [5000] ms. Reason is: [Association failed with [akka.tcp://Cluste...@192.168.161.139:2551]].

[info] [WARN] [04/07/2014 11:39:37.582] [ClusterSystem-akka.remote.default-remote-dispatcher-6] [akka.tcp://Cluste...@192.168.161.139:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40192.168.161.139%3A2551-0] Association with remote system [akka.tcp://Cluste...@192.168.161.139:2551] has failed, address is now gated for [5000] ms. Reason is: [Association failed with [akka.tcp://Cluste...@192.168.161.139:2551]].

[info] [WARN] [04/07/2014 11:39:42.602] [ClusterSystem-akka.remote.default-remote-dispatcher-5] [akka.tcp://Cluste...@192.168.161.139:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40192.168.161.139%3A2551-0] Association with remote system [akka.tcp://Cluste...@192.168.161.139:2551] has failed, address is now gated for [5000] ms. Reason is: [Association failed with [akka.tcp://Cluste...@192.168.161.139:2551]].

[info] [WARN] [04/07/2014 11:39:47.621] [ClusterSystem-akka.remote.default-remote-dispatcher-6] [akka.tcp://Cluste...@192.168.161.139:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40192.168.161.139%3A2551-0] Association with remote system [akka.tcp://Cluste...@192.168.161.139:2551] has failed, address is now gated for [5000] ms. Reason is: [Association failed with [akka.tcp://Cluste...@192.168.161.139:2551]].

[info] [INFO] [04/07/2014 11:39:52.744] [ClusterSystem-akka.actor.default-dispatcher-2] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:2552] - Welcome from [akka.tcp://Cluste...@192.168.161.139:2551]

[info] [INFO] [04/07/2014 11:39:52.747] [ClusterSystem-akka.actor.default-dispatcher-15] [akka.tcp://Cluste...@192.168.161.139:2552/user/clusterListener] Member is Up: akka.tcp://Cluste...@192.168.161.139:2551

[info] [INFO] [04/07/2014 11:39:53.036] [ClusterSystem-akka.actor.default-dispatcher-18] [akka.tcp://Cluste...@192.168.161.139:2552/user/clusterListener] Member is Up: akka.tcp://Cluste...@192.168.161.139:2552

[info] [INFO] [04/07/2014 11:39:53.040] [ClusterSystem-akka.actor.default-dispatcher-21] [akka.tcp://Cluste...@192.168.161.139:2552/user/clusterSingleton] ClusterSingletonManager state change [Start -> Younger]

[info] [WARN] [04/07/2014 11:40:10.478] [ClusterSystem-akka.remote.default-remote-dispatcher-5] [akka.tcp://Cluste...@192.168.161.139:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40192.168.161.139%3A2551-2] Association with remote system [akka.tcp://Cluste...@192.168.161.139:2551] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].


Additional node:

apdmmac-39:akka-sample-cluster-scala mawsonj$   sbt "run-main sample.cluster.simple.SimpleClusterApp"

[info] Loading global plugins from /Users/mawsonj/.sbt/0.13/plugins

[info] Loading project definition from /Users/mawsonj/projects/akka-sample-cluster-scala/project

[info] Set current project to akka-sample-cluster-scala (in build file:/Users/mawsonj/projects/akka-sample-cluster-scala/)

[info] Running sample.cluster.simple.SimpleClusterApp 

[info] [INFO] [04/07/2014 11:39:16.546] [main] [Remoting] Starting remoting

[info] [INFO] [04/07/2014 11:39:16.690] [main] [Remoting] Remoting started; listening on addresses :[akka.tcp://Cluste...@192.168.161.139:57939]

[info] [INFO] [04/07/2014 11:39:16.702] [main] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:57939] - Starting up...

[info] [INFO] [04/07/2014 11:39:16.777] [main] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:57939] - Registered cluster JMX MBean [akka:type=Cluster]

[info] [INFO] [04/07/2014 11:39:16.777] [main] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:57939] - Started up successfully

[info] [INFO] [04/07/2014 11:39:16.806] [ClusterSystem-akka.actor.default-dispatcher-4] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:57939] - Metrics collection has started successfully

[info] [INFO] [04/07/2014 11:39:17.041] [ClusterSystem-akka.actor.default-dispatcher-15] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:57939] - Welcome from [akka.tcp://Cluste...@192.168.161.139:2551]

[info] [INFO] [04/07/2014 11:39:17.045] [ClusterSystem-akka.actor.default-dispatcher-19] [akka.tcp://Cluste...@192.168.161.139:57939/user/clusterListener] Member is Up: akka.tcp://Cluste...@192.168.161.139:2551

[info] [INFO] [04/07/2014 11:39:17.051] [ClusterSystem-akka.actor.default-dispatcher-4] [akka.tcp://Cluste...@192.168.161.139:57939/user/clusterListener] Member is Up: akka.tcp://Cluste...@192.168.161.139:57939

[info] [INFO] [04/07/2014 11:39:17.055] [ClusterSystem-akka.actor.default-dispatcher-19] [akka.tcp://Cluste...@192.168.161.139:57939/user/clusterSingleton] ClusterSingletonManager state change [Start -> Younger]

[info] [WARN] [04/07/2014 11:39:21.494] [ClusterSystem-akka.remote.default-remote-dispatcher-5] [akka.tcp://Cluste...@192.168.161.139:57939/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40192.168.161.139%3A2551-0] Association with remote system [akka.tcp://Cluste...@192.168.161.139:2551] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

[info] [WARN] [04/07/2014 11:39:26.798] [ClusterSystem-akka.actor.default-dispatcher-2] [akka.tcp://Cluste...@192.168.161.139:57939/system/cluster/core/daemon] Cluster Node [akka.tcp://Cluste...@192.168.161.139:57939] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://Cluste...@192.168.161.139:2551, status = Up)]

[info] [INFO] [04/07/2014 11:39:26.800] [ClusterSystem-akka.actor.default-dispatcher-2] [akka.tcp://Cluste...@192.168.161.139:57939/user/clusterListener] Member detected as unreachable: Member(address =akka.tcp://Cluste...@192.168.161.139:2551, status = Up)

[info] [WARN] [04/07/2014 11:39:26.803] [ClusterSystem-akka.remote.default-remote-dispatcher-6] [akka.tcp://Cluste...@192.168.161.139:57939/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40192.168.161.139%3A2551-0] Association with remote system [akka.tcp://Cluste...@192.168.161.139:2551] has failed, address is now gated for [5000] ms. Reason is: [Association failed with [akka.tcp://Cluste...@192.168.161.139:2551]].

[info] [WARN] [04/07/2014 11:39:32.069] [ClusterSystem-akka.remote.default-remote-dispatcher-6] [akka.tcp://Cluste...@192.168.161.139:57939/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40192.168.161.139%3A2551-0] Association with remote system [akka.tcp://Cluste...@192.168.161.139:2551] has failed, address is now gated for [5000] ms. Reason is: [Association failed with [akka.tcp://Cluste...@192.168.161.139:2551]].

[info] [INFO] [04/07/2014 11:39:36.813] [ClusterSystem-akka.actor.default-dispatcher-16] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:57939] - Leader is auto-downing unreachable node [akka.tcp://Cluste...@192.168.161.139:2551]

[info] [INFO] [04/07/2014 11:39:36.816] [ClusterSystem-akka.actor.default-dispatcher-16] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:57939] - Marking unreachable node [akka.tcp://Cluste...@192.168.161.139:2551] as [Down]

[info] [INFO] [04/07/2014 11:39:37.798] [ClusterSystem-akka.actor.default-dispatcher-17] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://Cluste...@192.168.161.139:57939] - Leader is removing unreachable node [akka.tcp://Cluste...@192.168.161.139:2551]

[info] [INFO] [04/07/2014 11:39:37.799] [ClusterSystem-akka.actor.default-dispatcher-16] [akka.tcp://Cluste...@192.168.161.139:57939/user/clusterListener] Member is Removed: akka.tcp://Cluste...@192.168.161.139:2551after Down

[info] [INFO] [04/07/2014 11:39:37.800] [ClusterSystem-akka.actor.default-dispatcher-2] [akka.tcp://Cluste...@192.168.161.139:57939/user/clusterSingleton] Previous oldest removed [akka.tcp://Cluste...@192.168.161.139:2551]

[info] [INFO] [04/07/2014 11:39:37.801] [ClusterSystem-akka.actor.default-dispatcher-2] [akka.tcp://Cluste...@192.168.161.139:57939/user/clusterSingleton] Younger observed OldestChanged: [None -> myself]

[info] [INFO] [04/07/2014 11:39:37.801] [ClusterSystem-akka.actor.default-dispatcher-2] [akka.tcp://Cluste...@192.168.161.139:57939/user/clusterSingleton] Singleton manager [akka.tcp://Cluste...@192.168.161.139:57939] starting singleton actor

[info] [INFO] [04/07/2014 11:39:37.802] [ClusterSystem-akka.actor.default-dispatcher-2] [akka.tcp://Cluste...@192.168.161.139:57939/user/clusterSingleton] ClusterSingletonManager state change [Younger -> Oldest]

[info] [INFO] [04/07/2014 11:39:37.807] [ClusterSystem-akka.actor.default-dispatcher-17] [akka.tcp://Cluste...@192.168.161.139:57939/user/clusterSingleton/pinger-ponger] Ping?

[info] [INFO] [04/07/2014 11:39:37.809] [ClusterSystem-akka.actor.default-dispatcher-19] [akka.tcp://Cluste...@192.168.161.139:57939/user/clusterSingleton/pinger-ponger/$a] Pong!

[info] [INFO] [04/07/2014 11:39:47.812] [ClusterSystem-akka.actor.default-dispatcher-23] [akka.tcp://Cluste...@192.168.161.139:57939/user/clusterSingleton/pinger-ponger] Ping?

[info] [INFO] [04/07/2014 11:39:47.813] [ClusterSystem-akka.actor.default-dispatcher-14] [akka.tcp://Cluste...@192.168.161.139:57939/user/clusterSingleton/pinger-ponger/$k] Pong!

[info] [INFO] [04/07/2014 11:39:57.823] [ClusterSystem-akka.actor.default-dispatcher-20] [akka.tcp://Cluste...@192.168.161.139:57939/user/clusterSingleton/pinger-ponger] Ping?

[info] [INFO] [04/07/2014 11:39:57.823] [ClusterSystem-akka.actor.default-dispatcher-31] [akka.tcp://Cluste...@192.168.161.139:57939/user/clusterSingleton/pinger-ponger/$u] Pong!

[info] [INFO] [04/07/2014 11:40:07.822] [ClusterSystem-akka.actor.default-dispatcher-30] [akka.tcp://Cluste...@192.168.161.139:57939/user/clusterSingleton/pinger-ponger] Ping?

[info] [INFO] [04/07/2014 11:40:07.823] [ClusterSystem-akka.actor.default-dispatcher-15] [akka.tcp://Cluste...@192.168.161.139:57939/user/clusterSingleton/pinger-ponger/$E] Pong!


Konrad Malawski

unread,
Apr 8, 2014, 9:19:57 AM4/8/14
to akka...@googlegroups.com

Hello Jem,
We looked deeper into this and it seems that it’s both working as mendated by the current design (I’ll explain in detail bellow), as well as there is a way of forcing your desired behaviour (which totally makes sense in some scenarios).

Analysis:
First let’s dissect your log and see what’s happening:

Note1: Seed nodes are nothing very magical. It’s only a list of nodes, a joining node will try to talk to when trying to join a cluster.
Note2: Joining “self” is normal and expected.

Ok, so let’s look at the above logs and write up what’s happening:

// seed nodes = [51, 52]
// other node = [39]

> 51 starts; 52 not started yet, 39 not started yet
> 51 joins self, this is fine. This is the beginning of clusterA.
> 39 starts
> 39 contacts 51, joins it's cluster
> cluster singleton started on 39 or 51
> 52 starts
> 51 stops
// 51 never talked to 51 at this point (that's the root of the problem!), it didn't make it in time before 51 died
|| if singleton was running on 51 the manager notices this, and it will start it on 39
|| if singleton was running on 39, it stays there
> 52 tries to join the cluster; seed nodes are 51, 52; 51 just died
> 52 joins self, this is the beginning of clusterB! A new cluster has emerged.
> 52 has no idea about 39. Noone told it to contact 39, so it won't. (We do not have magical auto-discovery)
> 52 starts the singleton (!).
// the singleton is running twice among our apps, but not "twice in the same cluster" - because 52 has no way of knowing that there is some 39 node running "somewhere".
> 51 comes back up, it has 52 in seed nodes, so it will join it; 
> 51 notices that 52 has the singleton, and will not do anything to it.

Ok… Se we know why this happens. Is this “valid” behaviour? Well… It’s “expected” - effectively this shows that two clusters have raised, not one.

Then, the seed nodes never had the chance to talk to each other about “that new guy” who joined, so it’s address is unknown to 52 - which creates a new cluster, which the new 51 instance joins => creating a completely new cluster.

I may just say “this is fine” of course, and for some applications it might be. But I definitely see good use cases for really guaranteeing this singleton instance.

Suggestions:
Here are a few ways to increase it’s resilience:

1) We can leverage roles in order to keep the cluster singleton from starting until more seed nodes know about each other.
This allows us to not loose information about the 39 node if 51 goes down, because 52 will also be aware of it.

Basically the idea here is that “there always must be at least one seed node, that knows the singletons”.
This way you can increase the resilience of the system (how much guarantees we get about the singleton not suddenly becoming a doubleton ;-)), by increasing the number of seed nodes.
Graphically speaking: A B C X Y Z, where ABC are seed nodes and X Y Z joined later, means that we can afford to loose 2 of ABC at the same time, and the remaining one will keep track of the singletons replicated to the
X Y Z nodes, so even when B and C re-join (new instances of apps), they will get the information that the singleton is running already on “some node called X”, of which otherwise the rejoining nodes would not know the addresses (and would cause the problem as in the above example).

Code wise, it’s very simple to implement, and I’ve prepared a pull request with a sample for you: https://github.com/Synesso/scratch-akka-cluster-singleton/pull/1/files
We just mark all seed nodes with special seed role. This means that we won’t start the cluster until seed nodes have been contracted. By increasing their number you get more resilience against failing (and getting a doubleton on restarting these services, because they will not form a new cluster, but re-join the “last man standing” seed node).

2) You could try to stop using seed-nodes, because they’re static, and thus… tricky. 

And instead use a “global” service registry, where each ActorSystem would register itself when running.
Then when joining the cluster, you’d ask that service “hey, who is online now?”. The difference from seed nodes here is that the initial contact points can be updated, and seed-nodes are hardcoded in the config.
I’ve implemented such systems using ZooKeeper in the past. You would have a paths like /akka/clusters/banana-cluster/node-*, and do an “ls” on the parent directory, to find out about existing nodes (and their addresses)…

This is probably a good idea if you really need to be sure about everything in your cluster.
We currently do not provide cluster auto-discovery which would solve this for you “magically” :-)

I hope this makes sense! Please let me know if more explanation is required.
I have also opened an issue ( https://www.assembla.com/spaces/akka/simple_planner#/ticket:3986 ) around this and will improve the docs to include these patterns.
Not sure how much we can “automagically guarantee” in the future here - we would need to implement cluster discovery (not sure if it’s in the road map, will check).

Note3: For the suggested solution (above), please downgrade to Akka 2.3.0. We have introduced (and already fixed) a bug in the cluster in 2.3.1 which prevents the suggested solution from working (nodes won’t join).

// Whew, quite long email!


--
Cheers,
Konrad 'ktoso' Malawski
hAkker - Typesafe, Inc

Patrik Nordwall

unread,
Apr 8, 2014, 10:05:52 AM4/8/14
to akka...@googlegroups.com
Excellent analysis and suggestions, Konrad. One thing that this question highlights is the importance of using auto-down with care, especially when it is important to only have one singleton instance. If auto-down would not have been used 39 would not have started the singleton before there was an informed decision to let 39 form its own cluster.

Cheers,
Patrik


--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.



--

Patrik Nordwall
Typesafe Reactive apps on the JVM
Twitter: @patriknw

Jem

unread,
Apr 9, 2014, 1:46:38 AM4/9/14
to akka...@googlegroups.com
Thank you Konrad. I'm really impressed by the depth of investigation and your explanation. 


--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to a topic in the Google Groups "Akka User List" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/akka-user/ns7DPHGYbIk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to akka-user+...@googlegroups.com.

anil chalil

unread,
May 23, 2014, 12:46:54 PM5/23/14
to akka...@googlegroups.com
Hello

Is this situation can occur even if we have a small cluster like 3 instance and all instances in seed list?

Konrad Malawski

unread,
May 26, 2014, 5:43:09 PM5/26/14
to Akka User List
Hello Anil,
The size of the cluster does not really influence this, but downing it's members and partitions do.

In theory this problem can appear always when there is a "split brain" partition, meaning: when 2 clusters are formed - this is what happens in the above example.
It would also happen if we replace the above problem with a partition that would force the cluster to form 2 separate clusters - ergo, again 2 cluster singletons would be running.
Auto downing is the primary suspect in causing these unexpectedly, but there are other scenarios (I just imagined a horrible edge case of "partition between seed nodes during cluster starting"...).

In general simply not using auto-downing with the cluster singleton will keep you out of trouble.

You could also set the akka.cluster.min-nr-of-members = N, where N is the number of your Nodes (3), so a restarted-lonely-partitioned-seedNode won't decide to become a cluster on it's own.
This is described here http://doc.akka.io/docs/akka/snapshot/scala/cluster-usage.html#How_To_Startup_when_Cluster_Size_Reached and it will make sure that the cluster won't "Up" members,
until all members have joined - this will save us from spinning up the cluster singleton too early (and if a partition happens, you can.


I personally think the ClusterSingleton is a nice tool, but it may seem a little bit too tempting to use and forget about the problematic cases.
I've added some docs explaining this problem: https://github.com/akka/akka/commit/08fd4c93faac2d2aa625458090978723717a1864 but I just noticed this is only on the master branch, thus it's not in the published 2.3.3 docs (will fix this for 2.3.4).

Jeroen Gordijn

unread,
May 26, 2014, 6:15:52 PM5/26/14
to akka...@googlegroups.com
Hi,

The docs also advice against auto-downing. However I do not really get the alternative. Manual downing would be unworkable, because it could render your application unavailable for to long. So should I implement some strategy in my akka solution, or in some external monitoring system?

How are people using this in production?

Cheers,
Jeroen

Konrad Malawski

unread,
May 27, 2014, 5:09:32 AM5/27/14
to Akka User List
Hello Jeroen,
Indeed the options that you have listed are the other available, thus recommended, options. Truth be told, keeping a large cluster running does require some more tooling, ops skills or people to take care of it.
In practice you could hook into monitoring services like Nagios or Zabbix to get more information about your cluster and react to it, by for example downing a certain set of nodes because of some SLA rule etc...
It's hard to give a completely generic advice here; And in some apps, where you value consistency over everything else, manual downing may well be a feasible option. It all depends on your apps and use cases.

I'd highly recommend reading the thread "quorum based split brain resolution" on akka-user where Roland goes over the options that you have available.
It's worth keeping in mind that "everything is a trade-off"™. You will be basically trading off either availability or consistency in one way or another.
Reply all
Reply to author
Forward
0 new messages