We are having an issue with shard allocation right now when we upgraded to 1.4. We are still using persistence mode for shard coordination
java.lang.IllegalArgumentException: requirement failed: Shard [83] already allocated: State(Map
lagom.defaults.cluster.join-self
is set to off. This is off by default for production. To view this discussion on the web visit https://groups.google.com/d/msgid/lagom-framework/b1464dc0-51dc-4569-b18a-024cd0bea23e%40googlegroups.com.--
You received this message because you are subscribed to a topic in the Google Groups "Lagom Framework Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lagom-framework/5bNYIxw9efA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lagom-framework+unsubscribe@googlegroups.com.
To post to this group, send email to lagom-framework@googlegroups.com.
Notice: This email is confidential and may contain copyright material of members of the Ocado Group. Opinions and views expressed in this message may not necessarily reflect the opinions and views of the members of the Ocado Group.
If you are not the intended recipient, please notify us immediately and delete all copies of this message. Please note that it is your responsibility to scan this message for viruses.
Fetch and Sizzle are trading names of Speciality Stores Limited and Fabled is a trading name of Marie Claire Beauty Limited, both members of the Ocado Group.
References to the “Ocado Group” are to Ocado Group plc (registered in England and Wales with number 7098618) and its subsidiary undertakings (as that expression is defined in the Companies Act 2006) from time to time. The registered office of Ocado Group plc is Buildings One & Two, Trident Place, Mosquito Way, Hatfield, Hertfordshire, AL10 9UL.
On 20 February 2018 at 20:55:36, Zachary Marois (zma...@cimpress.com) wrote:
The static seed list [ A, B, C ] could cause A to establish a second cluster if B and C do not respond in time (either because they are restarting also, or they are unhealthy, or there is a network partition).
A static list won't allow a second cluster to form if B and C are restarting. If they are restarting, they are not part of the cluster anymore and they will wait for A to form the cluster.
The other situation is indeed true. If there is a network partition when redeploying A, it won't 'see' B or C and will decide that it's safe to create a cluster. That's why it's important to have a good control of re-deployments.
I guess that you are first starting the cluster by deploying the first node alone, waiting for it to be up and add the others with a seed-node list excluding their own address. Is that correct?
That's was my point about the static list, you don't need to exclude B address from B's seed-node list if B is not the first on the list. B won't make a cluster alone if it's not the first one. It will wait until one of the other acknowledge that it's safe to join.
Am I misunderstanding the situation in which a new cluster could form? Daniel's explanation makes me worry that 3 clusters could form purely based on who responds first even if 3 clusters didn't already exist.
No, that's not how it happens. Let's consider the following situation (again A, B and C). You start the three nodes each with a static list [A, B, C].
What will happen is that B and C will be running, but not yet joining the cluster. A (the first one) will ping B and C to see if they are already part of a cluster. They will answer that they are not yet, so A will form the cluster. B and C continue to ping the other nodes (A, C or A, B) in an attempt to joint the cluster. At that point, A have already created the cluster and will give permission to B and C to join as well.
It's possible that B joins first and later C will ping A and B and get an acknowledgment from B that's ok to join, because B is now part of the cluster. In that sense, B is the seed for C, but only after it have joined A in the cluster. So, you can't have 3 cluster forming just because of deploying them all.
Regardless of seed node information, could existing akka-persistence sharding data, if it does not match the current cluster state, introduce it?
I don't think so. When your sharding data is corrupted (because of a previous split) it will fail to assing the shard and you will get timeouts instead.
During our rolling upgrade from 1.3.10 to 1.4.0, we had 8 nodes and deployed the upgrade to them in batches of 2 in-place (the infrastructure was not changed, just the docker container we are running on them). The seeds were configured as described above: every node had 7 seeds, all but themselves, specified by invoking Cluster.get(actorSystem).joinSeedNodes(...).
Are you calling joinSeedNodes(..) yourself? Lagom does the cluster formation automatically. No need to do it yourself.
We have not yet attempted a full cluster shutdown with blowing away sharding data (either deleting the journal entries or swapping to ddata). We could, but that would require the 1.4.0 upgrade, and as you point out, our problem does not seem 1.4.0/1.3.10 related. We finally (as of yesterday) have backups of our Cassandra database, so I should be able to attempt this upgrade with replicated production data (we couldn't reproduce with any other test data) in a safe place. I think this will be the first thing I try, because its the only reproducible evidence I have.
I indeed believe that the problem comes from the bootstrap logic, so it’s best to first sort that out before upgrading to 1.4.0 and moving to ddata.
Considering both of you are skeptical of my seed logic, I could also adjust it by:
- Swapping it out for something like ConstrucR (which I did not know about when we built our logic)
- Changing it to only use nodes that are in a cluster right now
- Changing it to use the same list of seed nodes, in the same order, on all instances.
You may be interested in this project. Make sure to go though the cluster bootstrap and service discovery section, specially the one about AWS.
https://developer.lightbend.com/docs/akka-management/current/index.html
https://developer.lightbend.com/docs/akka-management/current/bootstrap.html
https://developer.lightbend.com/docs/akka-management/current/discovery.html
Cheers,
Renato
Lagom will join the cluster if there are seed-nodes configured. That happens automatically when you creates an instance of Cluster
and there are seed-nodes declared. Sorry, I didn’t mean to say that Lagom will call joinSeedNodes
, but that it form the cluster automatically.
Your way of forming a cluster won’t conflict with Lagom because first Lagom, actually Akka, will try to form the cluster. You may see a warning that no seed-nodes where found and that you need to do it manually. Later, your own code will do the job.
There is a Lagom class class JoinClusterImpl
that is used on bootstrap to create the cluster whenever needed.
--
You received this message because you are subscribed to the Google Groups "Lagom Framework Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lagom-framewo...@googlegroups.com.
To post to this group, send email to lagom-f...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lagom-framework/688d801a-35e0-400f-8157-b3567f475266%40googlegroups.com.
--
You received this message because you are subscribed to a topic in the Google Groups "Lagom Framework Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lagom-framework/5bNYIxw9efA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lagom-framework+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lagom-framework/227bf179-3cd6-4197-afdb-608ffc5ac930%40googlegroups.com.To post to this group, send email to lagom-framework@googlegroups.com.
To unsubscribe from this group and all its topics, send an email to lagom-framewo...@googlegroups.com.
To post to this group, send email to lagom-f...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lagom-framework/227bf179-3cd6-4197-afdb-608ffc5ac930%40googlegroups.com.
--Daniel Stoner | Senior Software Engineer BSSCS | Ocado Technologydaniel...@ocado.com | Ext 7969 | www.ocadotechnology.comBuildings One & Two, Trident Place, Mosquito Way, Hatfield, Hertfordshire, AL10 9UL
persistence_id = "/sharding/{MyEntityName}Coordinator" AND partition_nr = 0;
--
Please join our new forum at https://discuss.lagomframework.com!
The lagom-framework Google Group will soon be put into read-only mode.
For details, see https://www.lagomframework.com/blog/announcing-discuss-lagomframework-forum.html.
---
You received this message because you are subscribed to a topic in the Google Groups "Lagom Framework Users [deprecated]" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lagom-framework/5bNYIxw9efA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lagom-framework+unsubscribe@googlegroups.com.
To post to this group, send email to lagom-framework@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lagom-framework/c94ae8c5-43b6-49f4-a888-f7b82262612e%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lagom-framework/c94ae8c5-43b6-49f4-a888-f7b82262612e%40googlegroups.com.
--Daniel Stoner | Senior Software Engineer BSSCS | Ocado Technologydaniel...@ocado.com | Ext 7969 | www.ocadotechnology.comBuildings One & Two, Trident Place, Mosquito Way, Hatfield, Hertfordshire, AL10 9UL