--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.
The keyword here is "auto". Autodowning is an *incredibly braindead* algorithm for dealing with nodes coming out of service, and if you use it in production you more or less guarantee disaster, because that algorithm can't cope with cluster partition. You *do* need to deal with downing, but you have to get something smarter than that.Frankly, if you're already hooking into AWS, I *suspect* the best approach is to leverage that -- when a node goes offline, you have some code to detect that through the ECS APIs, react to it, and manually down that node. (I'm planning on something along those lines for my system, but haven't actually tried yet.) But whether you do that or something else, you've got to add *something* that does downing.I believe the official party line is "Buy a Lightbend Subscription", through which you can get their Split Brain Resolver, which is a fairly battle-hardened module for dealing with this problem. That's not strictly necessary, but you *do* need to have a reliable solution...
On Wed, Aug 3, 2016 at 8:42 PM, Eric Swenson <er...@swenson.org> wrote:
We have an akka-cluster/sharding application deployed an AWS/ECS, where each instance of the application is a Docker container. An ECS service launches N instances of the application based on configuration data. It is not possible to know, for certain, the IP addresses of the cluster members. Upon startup, before the AkkaSystem is created, the code currently polls AWS and determines the IP addresses of all the Docker hosts (which potentially could run the akka application). It sets these IP addresses as the seed nodes before bringing up the akka cluster system. The configuration for these has, up until yesterday always included the akka.cluster.auto-down-unreachable-after configuration setting. And it has always worked. Furthermore, it supports two very critical requirements:a) an instance of the application can be removed at any time, due to scaling or rolling updatesb) an instance of the application can be added at any time, due to scaling or rolling updatesOn the advice of an Akka expert on the Gitter channel, I removed the auto-down-unreachable-after setting, which, as documented, is dangerous for production. As a result the system no longer supports rolling updates. A rolling update occurs thus: a new version of the application is deployed (a new ECS task definition is created with a new Docker image). The ECS service launches a new task (Docker container running on an available host) and once that container becomes stable, it kills one of the remaining instances (cluster members) to bring the number of instances to some configured value.When this happens, akka-cluster becomes very unhappy and becomes unresponsive. Without the auto-down-unreachable-after setting, it keeps trying to talk to the old cluster members. which is no longer present. It appears to NOT recover from this. There is a constant barrage of messages of the form:[DEBUG] [08/04/2016 00:19:27.126] [ClusterSystem-cassandra-plugin-default-dispatcher-27] [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path sequence [/system/sharding/ExperimentInstance#-389574371] failed[DEBUG] [08/04/2016 00:19:27.140] [ClusterSystem-cassandra-plugin-default-dispatcher-27] [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path sequence [/system/sharding/ExperimentInstance#-389574371] failed[DEBUG] [08/04/2016 00:19:27.142] [ClusterSystem-cassandra-plugin-default-dispatcher-27] [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path sequence [/system/sharding/ExperimentInstance#-389574371] failed[DEBUG] [08/04/2016 00:19:27.143] [ClusterSystem-cassandra-plugin-default-dispatcher-27] [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path sequence [/system/sharding/ExperimentInstance#-389574371] failed[DEBUG] [08/04/2016 00:19:27.143] [ClusterSystem-cassandra-plugin-default-dispatcher-27] [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path sequence [/system/sharding/ExperimentInstance#-389574371] failedand of the form:
[WARN] [08/04/2016 00:19:16.787] [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry request for shard [5] homes from coordinator at [Actor[akka.tcp://ClusterSystem...@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]]. [1] buffered messages.[WARN] [08/04/2016 00:19:18.787] [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry request for shard [23] homes from coordinator at [Actor[akka.tcp://ClusterSystem...@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]]. [1] buffered messages.[WARN] [08/04/2016 00:19:18.787] [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry request for shard [1] homes from coordinator at [Actor[akka.tcp://ClusterSystem...@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]]. [1] buffered messages.[WARN] [08/04/2016 00:19:18.787] [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry request for shard [14] homes from coordinator at [Actor[akka.tcp://ClusterSystem...@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]]. [1] buffered messages.[WARN] [08/04/2016 00:19:18.787] [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry request for shard [5] homes from coordinator at [Actor[akka.tcp://ClusterSystem...@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]]. [1] buffered messages.
and then a message like this:
[WARN] [08/03/2016 23:50:34.690] [ClusterSystem-akka.remote.default-remote-dispatcher-11] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.3.100%3A2552-0] Association with remote system [akka.tcp://ClusterSystem@10.0.3.100:2552] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://ClusterSystem@10.0.3.100:2552]] Caused by: [Connection refused: /10.0.3.100:2552]
[WARN] [08/04/2016 00:19:16.787] [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry request for shard [5] homes from coordinator at [Actor[akka.tcp://ClusterSystem...@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]]. [1] buffered messages.[WARN] [08/04/2016 00:19:18.787] [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry request for shard [23] homes from coordinator at [Actor[akka.tcp://ClusterSystem...@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]]. [1] buffered messages.[WARN] [08/04/2016 00:19:18.787] [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry request for shard [1] homes from coordinator at [Actor[akka.tcp://ClusterSystem...@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]]. [1] buffered messages.[WARN] [08/04/2016 00:19:18.787] [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry request for shard [14] homes from coordinator at [Actor[akka.tcp://ClusterSystem...@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]]. [1] buffered messages.[WARN] [08/04/2016 00:19:18.787] [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry request for shard [5] homes from coordinator at [Actor[akka.tcp://ClusterSystem...@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]]. [1] buffered messages.
and then a message like this:
[WARN] [08/03/2016 23:50:34.690] [ClusterSystem-akka.remote.default-remote-dispatcher-11] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.3.100%3A2552-0] Association with remote system [akka.tcp://ClusterSystem@10.0.3.100:2552] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://ClusterSystem@10.0.3.100:2552]] Caused by: [Connection refused: /10.0.3.100:2552]
The keyword here is "auto". Autodowning is an *incredibly braindead* algorithm for dealing with nodes coming out of service, and if you use it in production you more or less guarantee disaster, because that algorithm can't cope with cluster partition. You *do* need to deal with downing, but you have to get something smarter than that.Frankly, if you're already hooking into AWS, I *suspect* the best approach is to leverage that -- when a node goes offline, you have some code to detect that through the ECS APIs, react to it, and manually down that node. (I'm planning on something along those lines for my system, but haven't actually tried yet.) But whether you do that or something else, you've got to add *something* that does downing.I believe the official party line is "Buy a Lightbend Subscription", through which you can get their Split Brain Resolver, which is a fairly battle-hardened module for dealing with this problem. That's not strictly necessary, but you *do* need to have a reliable solution...
On Wed, Aug 3, 2016 at 8:42 PM, Eric Swenson <er...@swenson.org> wrote:
We have an akka-cluster/sharding application deployed an AWS/ECS, where each instance of the application is a Docker container. An ECS service launches N instances of the application based on configuration data. It is not possible to know, for certain, the IP addresses of the cluster members. Upon startup, before the AkkaSystem is created, the code currently polls AWS and determines the IP addresses of all the Docker hosts (which potentially could run the akka application). It sets these IP addresses as the seed nodes before bringing up the akka cluster system. The configuration for these has, up until yesterday always included the akka.cluster.auto-down-unreachable-after configuration setting. And it has always worked. Furthermore, it supports two very critical requirements:a) an instance of the application can be removed at any time, due to scaling or rolling updatesb) an instance of the application can be added at any time, due to scaling or rolling updatesOn the advice of an Akka expert on the Gitter channel, I removed the auto-down-unreachable-after setting, which, as documented, is dangerous for production. As a result the system no longer supports rolling updates. A rolling update occurs thus: a new version of the application is deployed (a new ECS task definition is created with a new Docker image). The ECS service launches a new task (Docker container running on an available host) and once that container becomes stable, it kills one of the remaining instances (cluster members) to bring the number of instances to some configured value.When this happens, akka-cluster becomes very unhappy and becomes unresponsive. Without the auto-down-unreachable-after setting, it keeps trying to talk to the old cluster members. which is no longer present. It appears to NOT recover from this. There is a constant barrage of messages of the form:[DEBUG] [08/04/2016 00:19:27.126] [ClusterSystem-cassandra-plugin-default-dispatcher-27] [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path sequence [/system/sharding/ExperimentInstance#-389574371] failed[DEBUG] [08/04/2016 00:19:27.140] [ClusterSystem-cassandra-plugin-default-dispatcher-27] [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path sequence [/system/sharding/ExperimentInstance#-389574371] failed[DEBUG] [08/04/2016 00:19:27.142] [ClusterSystem-cassandra-plugin-default-dispatcher-27] [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path sequence [/system/sharding/ExperimentInstance#-389574371] failed[DEBUG] [08/04/2016 00:19:27.143] [ClusterSystem-cassandra-plugin-default-dispatcher-27] [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path sequence [/system/sharding/ExperimentInstance#-389574371] failed[DEBUG] [08/04/2016 00:19:27.143] [ClusterSystem-cassandra-plugin-default-dispatcher-27] [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path sequence [/system/sharding/ExperimentInstance#-389574371] failedand of the form:
[WARN] [08/04/2016 00:19:16.787] [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry request for shard [5] homes from coordinator at [Actor[akka.tcp://ClusterSystem...@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]]. [1] buffered messages.[WARN] [08/04/2016 00:19:18.787] [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry request for shard [23] homes from coordinator at [Actor[akka.tcp://ClusterSystem...@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]]. [1] buffered messages.[WARN] [08/04/2016 00:19:18.787] [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry request for shard [1] homes from coordinator at [Actor[akka.tcp://ClusterSystem...@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]]. [1] buffered messages.[WARN] [08/04/2016 00:19:18.787] [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry request for shard [14] homes from coordinator at [Actor[akka.tcp://ClusterSystem...@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]]. [1] buffered messages.[WARN] [08/04/2016 00:19:18.787] [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry request for shard [5] homes from coordinator at [Actor[akka.tcp://ClusterSystem...@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]]. [1] buffered messages.
and then a message like this:
[WARN] [08/03/2016 23:50:34.690] [ClusterSystem-akka.remote.default-remote-dispatcher-11] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.3.100%3A2552-0] Association with remote system [akka.tcp://ClusterSystem@10.0.3.100:2552] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://ClusterSystem@10.0.3.100:2552]] Caused by: [Connection refused: /10.0.3.100:2552]
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+unsubscribe@googlegroups.com.
It does do reassignment -- but it has to know to do that. Keep in mind that "down" is the master switch here: until the node is downed, the rest of the system doesn't *know* that NodeA should be avoided. I haven't dug into that particular code, but I assume from what you're saying that the allocation algorithm doesn't take unreachability into account when choosing where to allocate the shard, just up/down. I suspect that unreachability is too local and transient to use as the basis for these allocations.Keep in mind that you're looking at this from a relatively all-knowing global perspective, but each node is working from a very localized and imperfect one. All it knows is "I can't currently reach NodeA". It has no a priori way of knowing whether NodeA has been taken offline (so it should be avoided), or there's simply been a transient network glitch between here and there (so things are *mostly* business as usual). Downing is how you tell it, "No, really, stop using this node"; until then, most of the code assumes that the more-common transient situation is the case. It's *probably* possible to take unreachability into account in the case you're describing, but it doesn't surprise me if that's not true.Also, keep in mind that, IIRC, there are a few cluster singletons involved here, at least behind the scenes. If NodeA currently owns one of the key singletons (such as the ShardCoordinator), and it hasn't been downed, I imagine the rest of the cluster is going to *quickly* lock up, because the result is that nobody is authorized to make these sorts of allocation decisions.All that said -- keep in mind, I'm just a user of this stuff, and am talking at the edges of my knowledge. Konrad's the actual expert...
No, you are doing a great job explaining these! Maybe a guest blog post? (wink, wink, nudge, nudge ;) )