Sharding coordinator comms issues running on Kubernetes

228 views
Skip to first unread message

Stephen Kennedy

unread,
Aug 1, 2017, 9:11:07 AM8/1/17
to Akka User List
Hi,

Seeing some strange issues running Cluster Sharded actors in a kubernetes environment.

We are currently running a 3 node akka cluster, with our app running within a docker container within a kubernetes stateful set (similar to the akka-seed set described here). The nodes are all akka-http API servers, which run behind a HTTP load balancer, and use akka-persistence for our domain entities. So we use cluster sharding to ensure that each entity can only live on a single node at once.

So the cluster config of our akka nodes looks something like this:

akka {
  remote {
    enabled-transports = ["akka.remote.netty.tcp"]
    netty.tcp {
      hostname = ${POD_NAME}.api
      port = 2551
    }
  }

  cluster {
    seed-nodes = [
      "akka.tcp://actor-...@api-0.api:2551",
      "akka.tcp://actor-...@api-1.api:2551",
      "akka.tcp://actor-...@api-2.api:2551"
    ]
  }
}

Where "api" is the kubernetes service (which provides DNS mapping) and "api-0/1/2" are the consistent pod names that using a stateful set gives us. Using the default sharding config.

And within the code we have a number of calls to ClusterSharding.start - for each type of our sharded entity actors. We then only fire a message to these actors when an appropriate API call comes in.

Now when the nodes come up they all consistently connect to the cluster properly, and I see gossip messages suggesting they all know about each other, but we are then seeing communication issues on the Cluster Sharded actors.

As far as I can tell, if the first API request for a particular type of entity comes into api-1 or api-2, it often fails because that node is unable to communicate with the coordinator - which it seems to think is on api-0.

From api-1:
2017-08-01 12:22:15.253 DEBUG akka.actor.ActorSystemImpl - http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - Starting
2017-08-01 12:23:05.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-...@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
2017-08-01 12:23:07.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-...@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
2017-08-01 12:23:09.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-...@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]] after [15000 ms]. Sender[null] sent message of type "com.goodlord.server.domain.envelope.EnvelopeAggregate$Protocol$Get".
2017-08-01 12:23:10.344 DEBUG akka.actor.ActorSystemImpl - http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - 500 Internal Server Error

The trying to register log lines are then repeated forever and if more API calls come into this node (for this shard) the buffered count just goes up. Logging into the 

However, if we subsequently hit the API and the load balancer chooses api-0, then it works and also seems to initialize the coordinator (which then triggers api-1 to register which clears it's backlog):

From api-0:
2017-08-01 12:25:05.902 DEBUG akka.actor.ActorSystemImpl - http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - Starting
2017-08-01 12:25:05.932 INFO  a.c.s.ClusterSingletonManager - Singleton manager starting singleton actor [akka://actor-system/system/sharding/EnvelopeShardCoordinator/singleton]
2017-08-01 12:25:05.933 DEBUG akka.cluster.ddata.Replicator - Received Get for key [EnvelopeShardCoordinatorState]
2017-08-01 12:25:07.338 DEBUG a.c.sharding.DDataShardCoordinator - ShardRegion registered: [Actor[akka.tcp://actor-...@api-1.api:2551/system/sharding/EnvelopeShard#92546893]]
2017-08-01 12:25:07.339 DEBUG akka.cluster.ddata.Replicator - Received Update for key [EnvelopeShardCoordinatorState]
2017-08-01 12:25:07.341 DEBUG a.c.sharding.DDataShardCoordinator - The coordinator state was successfully updated with ShardRegionRegistered(Actor[akka.tcp://actor-...@api-1.api:2551/system/sharding/EnvelopeShard#92546893])
2017-08-01 12:25:07.341 DEBUG akka.cluster.ClusterRemoteWatcher - Watching: [akka://actor-system/system/sharding/EnvelopeShardCoordinator/singleton/coordinator -> akka.tcp://actor-...@api-1.api:2551/system/sharding/EnvelopeShard]
2017-08-01 12:25:07.345 DEBUG akka.cluster.ddata.Replicator - Received Update for key [EnvelopeShardCoordinatorState]
2017-08-01 12:25:07.348 DEBUG a.c.sharding.DDataShardCoordinator - The coordinator state was successfully updated with ShardHomeAllocated(22,Actor[akka.tcp://actor-...@api-1.api:2551/system/sharding/EnvelopeShard#92546893])

From api-1:

2017-08-01 12:25:07.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-...@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
2017-08-01 12:25:07.342 DEBUG akka.cluster.ClusterRemoteWatcher - Watching: [akka://actor-system/system/sharding/EnvelopeShard -> akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator]
2017-08-01 12:25:07.342 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [22] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.342 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [13] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [16] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [6] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [2] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [29] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [20] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.344 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [21] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [2] buffered messages.
2017-08-01 12:25:07.344 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [10] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.353 DEBUG akka.cluster.sharding.ShardRegion - Shard [22] located at [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]]
2017-08-01 12:25:07.354 DEBUG akka.cluster.sharding.ShardRegion - Shard [13] located at [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]]
2017-08-01 12:25:07.362 DEBUG akka.cluster.sharding.ShardRegion - Shard [16] located at [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]]

So it seems to me that the initial call to ClusterSharding.start doesn't actually do the registration of all shard regions on the cluster and that it is only when we subsequently fire a message to the shard region that the initialization occurs. 

Is this expected behaviour? Or have we got something wrong with the initialization of our app? 

Sure enough if I hack together some code to fire a random message to the shard on startup then all our problems go away.

Thanks in advance for any help here - don't have a great deal of experience with akka clustering so I'm sure we've done something stupid somewhere.

Cheers!
Stephen

Stephen Kennedy

unread,
Aug 1, 2017, 9:39:06 AM8/1/17
to Akka User List
As a follow up, I've now just ruled out any kubernetes networking issue by recreating the same behaviour running 3 nodes via docker-compose instead - again it only works if an API call hits the first node in seed-nodes list.


On Tuesday, 1 August 2017 14:11:07 UTC+1, Stephen Kennedy wrote:
Hi,

Seeing some strange issues running Cluster Sharded actors in a kubernetes environment.

We are currently running a 3 node akka cluster, with our app running within a docker container within a kubernetes stateful set (similar to the akka-seed set described here). The nodes are all akka-http API servers, which run behind a HTTP load balancer, and use akka-persistence for our domain entities. So we use cluster sharding to ensure that each entity can only live on a single node at once.

So the cluster config of our akka nodes looks something like this:

akka {
  remote {
    enabled-transports = ["akka.remote.netty.tcp"]
    netty.tcp {
      hostname = ${POD_NAME}.api
      port = 2551
    }
  }

  cluster {
    seed-nodes = [
      "akka.tcp://actor-system@api-0.api:2551",
      "akka.tcp://actor-system@api-1.api:2551",
      "akka.tcp://actor-system@api-2.api:2551"
    ]
  }
}

Where "api" is the kubernetes service (which provides DNS mapping) and "api-0/1/2" are the consistent pod names that using a stateful set gives us. Using the default sharding config.

And within the code we have a number of calls to ClusterSharding.start - for each type of our sharded entity actors. We then only fire a message to these actors when an appropriate API call comes in.

Now when the nodes come up they all consistently connect to the cluster properly, and I see gossip messages suggesting they all know about each other, but we are then seeing communication issues on the Cluster Sharded actors.

As far as I can tell, if the first API request for a particular type of entity comes into api-1 or api-2, it often fails because that node is unable to communicate with the coordinator - which it seems to think is on api-0.

From api-1:
2017-08-01 12:22:15.253 DEBUG akka.actor.ActorSystemImpl - http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - Starting
2017-08-01 12:23:05.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-system@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
2017-08-01 12:23:07.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-system@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
2017-08-01 12:23:09.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-system@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]] after [15000 ms]. Sender[null] sent message of type "com.goodlord.server.domain.envelope.EnvelopeAggregate$Protocol$Get".
2017-08-01 12:23:10.344 DEBUG akka.actor.ActorSystemImpl - http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - 500 Internal Server Error
The trying to register log lines are then repeated forever and if more API calls come into this node (for this shard) the buffered count just goes up. Logging into the 

However, if we subsequently hit the API and the load balancer chooses api-0, then it works and also seems to initialize the coordinator (which then triggers api-1 to register which clears it's backlog):

From api-0:
2017-08-01 12:25:05.902 DEBUG akka.actor.ActorSystemImpl - http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - Starting
2017-08-01 12:25:05.932 INFO  a.c.s.ClusterSingletonManager - Singleton manager starting singleton actor [akka://actor-system/system/sharding/EnvelopeShardCoordinator/singleton]
2017-08-01 12:25:05.933 DEBUG akka.cluster.ddata.Replicator - Received Get for key [EnvelopeShardCoordinatorState]
2017-08-01 12:25:07.338 DEBUG a.c.sharding.DDataShardCoordinator - ShardRegion registered: [Actor[akka.tcp://actor-sys...@api-1.api:2551/system/sharding/EnvelopeShard#92546893]]
2017-08-01 12:25:07.339 DEBUG akka.cluster.ddata.Replicator - Received Update for key [EnvelopeShardCoordinatorState]
2017-08-01 12:25:07.341 DEBUG a.c.sharding.DDataShardCoordinator - The coordinator state was successfully updated with ShardRegionRegistered(Actor[akka.tcp://actor-system@api-1.api:2551/system/sharding/EnvelopeShard#92546893])
2017-08-01 12:25:07.341 DEBUG akka.cluster.ClusterRemoteWatcher - Watching: [akka://actor-system/system/sharding/EnvelopeShardCoordinator/singleton/coordinator -> akka.tcp://actor-system@api-1.api:2551/system/sharding/EnvelopeShard]
2017-08-01 12:25:07.345 DEBUG akka.cluster.ddata.Replicator - Received Update for key [EnvelopeShardCoordinatorState]
2017-08-01 12:25:07.348 DEBUG a.c.sharding.DDataShardCoordinator - The coordinator state was successfully updated with ShardHomeAllocated(22,Actor[akka.tcp://actor-system@api-1.api:2551/system/sharding/EnvelopeShard#92546893])

From api-1:

2017-08-01 12:25:07.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-system@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
2017-08-01 12:25:07.342 DEBUG akka.cluster.ClusterRemoteWatcher - Watching: [akka://actor-system/system/sharding/EnvelopeShard -> akka.tcp://actor-system@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator]
2017-08-01 12:25:07.342 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [22] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.342 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [13] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [16] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [6] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [2] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [29] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [20] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.344 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [21] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [2] buffered messages.
2017-08-01 12:25:07.344 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [10] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.353 DEBUG akka.cluster.sharding.ShardRegion - Shard [22] located at [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]]
2017-08-01 12:25:07.354 DEBUG akka.cluster.sharding.ShardRegion - Shard [13] located at [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]]
2017-08-01 12:25:07.362 DEBUG akka.cluster.sharding.ShardRegion - Shard [16] located at [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]]

Patrik Nordwall

unread,
Aug 1, 2017, 9:56:33 AM8/1/17
to Akka User List
Looks like ClusterSharding.start is missing at startup of api-0. Are you sure that you do that when the ActorSystem has been started and not lazily on first request?

Enable debug level logging and you should see what it is doing. You should see log entries in api-0 at startup.

/Patrik
tis 1 aug. 2017 kl. 15:39 skrev Stephen Kennedy <sk4...@gmail.com>:
As a follow up, I've now just ruled out any kubernetes networking issue by recreating the same behaviour running 3 nodes via docker-compose instead - again it only works if an API call hits the first node in seed-nodes list.


On Tuesday, 1 August 2017 14:11:07 UTC+1, Stephen Kennedy wrote:
Hi,

Seeing some strange issues running Cluster Sharded actors in a kubernetes environment.

We are currently running a 3 node akka cluster, with our app running within a docker container within a kubernetes stateful set (similar to the akka-seed set described here). The nodes are all akka-http API servers, which run behind a HTTP load balancer, and use akka-persistence for our domain entities. So we use cluster sharding to ensure that each entity can only live on a single node at once.

So the cluster config of our akka nodes looks something like this:

akka {
  remote {
    enabled-transports = ["akka.remote.netty.tcp"]
    netty.tcp {
      hostname = ${POD_NAME}.api
      port = 2551
    }
  }

  cluster {
    seed-nodes = [
      "akka.tcp://actor-...@api-0.api:2551",
      "akka.tcp://actor-...@api-1.api:2551",
      "akka.tcp://actor-...@api-2.api:2551"
    ]
  }
}

Where "api" is the kubernetes service (which provides DNS mapping) and "api-0/1/2" are the consistent pod names that using a stateful set gives us. Using the default sharding config.

And within the code we have a number of calls to ClusterSharding.start - for each type of our sharded entity actors. We then only fire a message to these actors when an appropriate API call comes in.

Now when the nodes come up they all consistently connect to the cluster properly, and I see gossip messages suggesting they all know about each other, but we are then seeing communication issues on the Cluster Sharded actors.

As far as I can tell, if the first API request for a particular type of entity comes into api-1 or api-2, it often fails because that node is unable to communicate with the coordinator - which it seems to think is on api-0.

From api-1:
2017-08-01 12:22:15.253 DEBUG akka.actor.ActorSystemImpl - http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - Starting
2017-08-01 12:23:05.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-...@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
2017-08-01 12:23:07.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-...@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
2017-08-01 12:23:09.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-...@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]] after [15000 ms]. Sender[null] sent message of type "com.goodlord.server.domain.envelope.EnvelopeAggregate$Protocol$Get".
2017-08-01 12:23:10.344 DEBUG akka.actor.ActorSystemImpl - http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - 500 Internal Server Error
The trying to register log lines are then repeated forever and if more API calls come into this node (for this shard) the buffered count just goes up. Logging into the 

However, if we subsequently hit the API and the load balancer chooses api-0, then it works and also seems to initialize the coordinator (which then triggers api-1 to register which clears it's backlog):

From api-0:
2017-08-01 12:25:05.902 DEBUG akka.actor.ActorSystemImpl - http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - Starting
2017-08-01 12:25:05.932 INFO  a.c.s.ClusterSingletonManager - Singleton manager starting singleton actor [akka://actor-system/system/sharding/EnvelopeShardCoordinator/singleton]
2017-08-01 12:25:05.933 DEBUG akka.cluster.ddata.Replicator - Received Get for key [EnvelopeShardCoordinatorState]
2017-08-01 12:25:07.338 DEBUG a.c.sharding.DDataShardCoordinator - ShardRegion registered: [Actor[akka.tcp://actor-...@api-1.api:2551/system/sharding/EnvelopeShard#92546893]]
2017-08-01 12:25:07.339 DEBUG akka.cluster.ddata.Replicator - Received Update for key [EnvelopeShardCoordinatorState]
2017-08-01 12:25:07.341 DEBUG a.c.sharding.DDataShardCoordinator - The coordinator state was successfully updated with ShardRegionRegistered(Actor[akka.tcp://actor-...@api-1.api:2551/system/sharding/EnvelopeShard#92546893])
2017-08-01 12:25:07.341 DEBUG akka.cluster.ClusterRemoteWatcher - Watching: [akka://actor-system/system/sharding/EnvelopeShardCoordinator/singleton/coordinator -> akka.tcp://actor-...@api-1.api:2551/system/sharding/EnvelopeShard]
2017-08-01 12:25:07.345 DEBUG akka.cluster.ddata.Replicator - Received Update for key [EnvelopeShardCoordinatorState]
2017-08-01 12:25:07.348 DEBUG a.c.sharding.DDataShardCoordinator - The coordinator state was successfully updated with ShardHomeAllocated(22,Actor[akka.tcp://actor-...@api-1.api:2551/system/sharding/EnvelopeShard#92546893])

From api-1:

2017-08-01 12:25:07.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-...@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
2017-08-01 12:25:07.342 DEBUG akka.cluster.ClusterRemoteWatcher - Watching: [akka://actor-system/system/sharding/EnvelopeShard -> akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator]
2017-08-01 12:25:07.342 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [22] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.342 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [13] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [16] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [6] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [2] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [29] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [20] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.344 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [21] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [2] buffered messages.
2017-08-01 12:25:07.344 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [10] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.353 DEBUG akka.cluster.sharding.ShardRegion - Shard [22] located at [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]]
2017-08-01 12:25:07.354 DEBUG akka.cluster.sharding.ShardRegion - Shard [13] located at [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]]
2017-08-01 12:25:07.362 DEBUG akka.cluster.sharding.ShardRegion - Shard [16] located at [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]]
So it seems to me that the initial call to ClusterSharding.start doesn't actually do the registration of all shard regions on the cluster and that it is only when we subsequently fire a message to the shard region that the initialization occurs. 

Is this expected behaviour? Or have we got something wrong with the initialization of our app? 

Sure enough if I hack together some code to fire a random message to the shard on startup then all our problems go away.

Thanks in advance for any help here - don't have a great deal of experience with akka clustering so I'm sure we've done something stupid somewhere.

Cheers!
Stephen

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Stephen Kennedy

unread,
Aug 1, 2017, 10:45:02 AM8/1/17
to Akka User List
Thanks for the prompt reply. As hoped it has instantly made me realize the error of our ways.

Our application bootstrap code was over using lazy val's, so as you suggest we weren't actually calling ClusterSharding.start until that node needed the ActorRef.

So I can further my understanding, could you explain the logic regarding api-1 / api-2 assuming the coordinator is on api-0 - even though it hasn't registered a ShardRegion yet? 

Is it that all nodes in the cluster are required to call ClusterSharding.start? Reading docs a bit more it sounds like that is the case, as if we wanted to only run on a sub-group of nodes we should use roles.

Thanks again for helping out. 

On Tuesday, 1 August 2017 14:56:33 UTC+1, Patrik Nordwall wrote:
Looks like ClusterSharding.start is missing at startup of api-0. Are you sure that you do that when the ActorSystem has been started and not lazily on first request?

Enable debug level logging and you should see what it is doing. You should see log entries in api-0 at startup.

/Patrik
tis 1 aug. 2017 kl. 15:39 skrev Stephen Kennedy <sk4...@gmail.com>:
As a follow up, I've now just ruled out any kubernetes networking issue by recreating the same behaviour running 3 nodes via docker-compose instead - again it only works if an API call hits the first node in seed-nodes list.


On Tuesday, 1 August 2017 14:11:07 UTC+1, Stephen Kennedy wrote:
Hi,

Seeing some strange issues running Cluster Sharded actors in a kubernetes environment.

We are currently running a 3 node akka cluster, with our app running within a docker container within a kubernetes stateful set (similar to the akka-seed set described here). The nodes are all akka-http API servers, which run behind a HTTP load balancer, and use akka-persistence for our domain entities. So we use cluster sharding to ensure that each entity can only live on a single node at once.

So the cluster config of our akka nodes looks something like this:

akka {
  remote {
    enabled-transports = ["akka.remote.netty.tcp"]
    netty.tcp {
      hostname = ${POD_NAME}.api
      port = 2551
    }
  }

  cluster {
    seed-nodes = [
      "akka.tcp://actor-system@api-0.api:2551",
      "akka.tcp://actor-system@api-1.api:2551",
      "akka.tcp://actor-system@api-2.api:2551"
    ]
  }
}

Where "api" is the kubernetes service (which provides DNS mapping) and "api-0/1/2" are the consistent pod names that using a stateful set gives us. Using the default sharding config.

And within the code we have a number of calls to ClusterSharding.start - for each type of our sharded entity actors. We then only fire a message to these actors when an appropriate API call comes in.

Now when the nodes come up they all consistently connect to the cluster properly, and I see gossip messages suggesting they all know about each other, but we are then seeing communication issues on the Cluster Sharded actors.

As far as I can tell, if the first API request for a particular type of entity comes into api-1 or api-2, it often fails because that node is unable to communicate with the coordinator - which it seems to think is on api-0.

From api-1:
2017-08-01 12:22:15.253 DEBUG akka.actor.ActorSystemImpl - http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - Starting
2017-08-01 12:23:05.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-system@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
2017-08-01 12:23:07.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-system@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
2017-08-01 12:23:09.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-system@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]] after [15000 ms]. Sender[null] sent message of type "com.goodlord.server.domain.envelope.EnvelopeAggregate$Protocol$Get".
2017-08-01 12:23:10.344 DEBUG akka.actor.ActorSystemImpl - http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - 500 Internal Server Error
The trying to register log lines are then repeated forever and if more API calls come into this node (for this shard) the buffered count just goes up. Logging into the 

However, if we subsequently hit the API and the load balancer chooses api-0, then it works and also seems to initialize the coordinator (which then triggers api-1 to register which clears it's backlog):

From api-0:
2017-08-01 12:25:05.902 DEBUG akka.actor.ActorSystemImpl - http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - Starting
2017-08-01 12:25:05.932 INFO  a.c.s.ClusterSingletonManager - Singleton manager starting singleton actor [akka://actor-system/system/sharding/EnvelopeShardCoordinator/singleton]
2017-08-01 12:25:05.933 DEBUG akka.cluster.ddata.Replicator - Received Get for key [EnvelopeShardCoordinatorState]
2017-08-01 12:25:07.338 DEBUG a.c.sharding.DDataShardCoordinator - ShardRegion registered: [Actor[akka.tcp://actor-sys...@api-1.api:2551/system/sharding/EnvelopeShard#92546893]]
2017-08-01 12:25:07.339 DEBUG akka.cluster.ddata.Replicator - Received Update for key [EnvelopeShardCoordinatorState]
2017-08-01 12:25:07.341 DEBUG a.c.sharding.DDataShardCoordinator - The coordinator state was successfully updated with ShardRegionRegistered(Actor[akka.tcp://actor-system@api-1.api:2551/system/sharding/EnvelopeShard#92546893])
2017-08-01 12:25:07.341 DEBUG akka.cluster.ClusterRemoteWatcher - Watching: [akka://actor-system/system/sharding/EnvelopeShardCoordinator/singleton/coordinator -> akka.tcp://actor-system@api-1.api:2551/system/sharding/EnvelopeShard]
2017-08-01 12:25:07.345 DEBUG akka.cluster.ddata.Replicator - Received Update for key [EnvelopeShardCoordinatorState]
2017-08-01 12:25:07.348 DEBUG a.c.sharding.DDataShardCoordinator - The coordinator state was successfully updated with ShardHomeAllocated(22,Actor[akka.tcp://actor-system@api-1.api:2551/system/sharding/EnvelopeShard#92546893])

From api-1:

2017-08-01 12:25:07.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-system@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
2017-08-01 12:25:07.342 DEBUG akka.cluster.ClusterRemoteWatcher - Watching: [akka://actor-system/system/sharding/EnvelopeShard -> akka.tcp://actor-system@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator]
2017-08-01 12:25:07.342 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [22] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.342 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [13] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [16] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [6] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [2] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [29] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [20] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.344 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [21] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [2] buffered messages.
2017-08-01 12:25:07.344 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [10] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.353 DEBUG akka.cluster.sharding.ShardRegion - Shard [22] located at [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]]
2017-08-01 12:25:07.354 DEBUG akka.cluster.sharding.ShardRegion - Shard [13] located at [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]]
2017-08-01 12:25:07.362 DEBUG akka.cluster.sharding.ShardRegion - Shard [16] located at [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]]
So it seems to me that the initial call to ClusterSharding.start doesn't actually do the registration of all shard regions on the cluster and that it is only when we subsequently fire a message to the shard region that the initialization occurs. 

Is this expected behaviour? Or have we got something wrong with the initialization of our app? 

Sure enough if I hack together some code to fire a random message to the shard on startup then all our problems go away.

Thanks in advance for any help here - don't have a great deal of experience with akka clustering so I'm sure we've done something stupid somewhere.

Cheers!
Stephen

Patrik Nordwall

unread,
Aug 1, 2017, 3:36:22 PM8/1/17
to Akka User List
You're welcome. Yes, you are supposed to register (start) the sharding on all nodes (or nodes with a certain role). The coordinator is a Cluster Singleton, i.e. running on oldest node (when you have started it). Other nodes try to register to the coordinator at the oldest node.

/Patrik
      "akka.tcp://actor-...@api-0.api:2551",
      "akka.tcp://actor-...@api-1.api:2551",
      "akka.tcp://actor-...@api-2.api:2551"
    ]
  }
}

Where "api" is the kubernetes service (which provides DNS mapping) and "api-0/1/2" are the consistent pod names that using a stateful set gives us. Using the default sharding config.

And within the code we have a number of calls to ClusterSharding.start - for each type of our sharded entity actors. We then only fire a message to these actors when an appropriate API call comes in.

Now when the nodes come up they all consistently connect to the cluster properly, and I see gossip messages suggesting they all know about each other, but we are then seeing communication issues on the Cluster Sharded actors.

As far as I can tell, if the first API request for a particular type of entity comes into api-1 or api-2, it often fails because that node is unable to communicate with the coordinator - which it seems to think is on api-0.

From api-1:
2017-08-01 12:22:15.253 DEBUG akka.actor.ActorSystemImpl - http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - Starting
2017-08-01 12:23:05.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-...@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
2017-08-01 12:23:07.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-...@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
2017-08-01 12:23:09.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-...@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]] after [15000 ms]. Sender[null] sent message of type "com.goodlord.server.domain.envelope.EnvelopeAggregate$Protocol$Get".
2017-08-01 12:23:10.344 DEBUG akka.actor.ActorSystemImpl - http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - 500 Internal Server Error
The trying to register log lines are then repeated forever and if more API calls come into this node (for this shard) the buffered count just goes up. Logging into the 

However, if we subsequently hit the API and the load balancer chooses api-0, then it works and also seems to initialize the coordinator (which then triggers api-1 to register which clears it's backlog):

From api-0:
2017-08-01 12:25:05.902 DEBUG akka.actor.ActorSystemImpl - http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - Starting
2017-08-01 12:25:05.932 INFO  a.c.s.ClusterSingletonManager - Singleton manager starting singleton actor [akka://actor-system/system/sharding/EnvelopeShardCoordinator/singleton]
2017-08-01 12:25:05.933 DEBUG akka.cluster.ddata.Replicator - Received Get for key [EnvelopeShardCoordinatorState]
2017-08-01 12:25:07.338 DEBUG a.c.sharding.DDataShardCoordinator - ShardRegion registered: [Actor[akka.tcp://actor-...@api-1.api:2551/system/sharding/EnvelopeShard#92546893]]
2017-08-01 12:25:07.339 DEBUG akka.cluster.ddata.Replicator - Received Update for key [EnvelopeShardCoordinatorState]
2017-08-01 12:25:07.341 DEBUG a.c.sharding.DDataShardCoordinator - The coordinator state was successfully updated with ShardRegionRegistered(Actor[akka.tcp://actor-...@api-1.api:2551/system/sharding/EnvelopeShard#92546893])
2017-08-01 12:25:07.341 DEBUG akka.cluster.ClusterRemoteWatcher - Watching: [akka://actor-system/system/sharding/EnvelopeShardCoordinator/singleton/coordinator -> akka.tcp://actor-...@api-1.api:2551/system/sharding/EnvelopeShard]
2017-08-01 12:25:07.345 DEBUG akka.cluster.ddata.Replicator - Received Update for key [EnvelopeShardCoordinatorState]
2017-08-01 12:25:07.348 DEBUG a.c.sharding.DDataShardCoordinator - The coordinator state was successfully updated with ShardHomeAllocated(22,Actor[akka.tcp://actor-...@api-1.api:2551/system/sharding/EnvelopeShard#92546893])

From api-1:

2017-08-01 12:25:07.336 WARN  akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-...@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages.
2017-08-01 12:25:07.342 DEBUG akka.cluster.ClusterRemoteWatcher - Watching: [akka://actor-system/system/sharding/EnvelopeShard -> akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator]
2017-08-01 12:25:07.342 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [22] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.342 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [13] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [16] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [6] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [2] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [29] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [20] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
2017-08-01 12:25:07.344 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [21] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [2] buffered messages.
2017-08-01 12:25:07.344 WARN  akka.cluster.sharding.ShardRegion - Retry request for shard [10] homes from coordinator at [Actor[akka.tcp://actor-...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages.
Reply all
Reply to author
Forward
0 new messages