Avoiding Split Brain Scenarios

109 views
Skip to first unread message

Zachary Marois

unread,
Feb 19, 2018, 10:24:50 AM2/19/18
to Lagom Framework Users
I originally posted this to the Lagom Gitter. I didn't see a reply there, and feel it is better represented as a thread, so I'm moving it here. I think I'll post a similar question to the AKKA User list, as it probably isn't Lagom specific.

I'm on the team with @acchaulk, who reported to the gitter on 1/31:

We are having an issue with shard allocation right now when we upgraded to 1.4. We are still using persistence mode for shard coordination
java.lang.IllegalArgumentException: requirement failed: Shard [83] already allocated: State(Map

We resolved our scenario on 1/31 by rolling back to Lagom 1.3.10. We attempted to upgrade again on 2/7, had the same ramifications, and rolled back again.

I was working on drafting up a deeper description of what we saw and some of how our service is configured, when we saw more "Invalid replayed event..." occurrences begin during a subsequent software deployment staying on 1.3.10, which made us realize we have a broader misunderstanding of how this can occur.

Can you help us understand how we could possibly have multiple writers to an entity? Specifically, I'd think it has to be a Split Brain Scenario (as @renatocaval pointed out), but I don't understand how that could occur given our configuration. While the shard coordinator problems could be faster-failing in 1.4, our problems with split brain even in 1.3.10 seem like they could be the same root cause.

Our understanding is that a Split Brain Scenario can occur in two ways:
  1. During a network partition, nodes on either side are downed (either automatically or manually) while the other side remains up. This is pretty well identified by the docs on auto downing
  2. At startup, if a node refers to itself as a seed as the first in the seed list and cannot contact other seeds, it will form a new cluster. This is well described in the second paragraph under the link to ConstructR here.
We do not use a split brain resolver nor the built in auto-down; per the recommendation referred to here we deal with unreachable nodes manually (for now), so I cannot see how option 1 can occur.
Our cluster is configured to use all nodes except the current node as seeds. Only before the cluster is established, on our first deploy, do we let a node set itself as a seed. So I can't see how option 2 could occur.

We never identified a time in which the cluster member list returned a subset of nodes, and those nodes missing also returned a mutually exclusive, UP, set of nodes. It could have happened, but we never observed it.

Are we missing some mechanism that could cause a split brain scenario? Is there any other way that we could have multiple concurrent writers to an entity without a split brain scenario?

Daniel Stoner

unread,
Feb 20, 2018, 8:32:39 AM2/20/18
to Lagom Framework Users
If you walk through your persisted data you will note that it will identify where Shard [83] was allocated to at any given moment. This may help you to pin down precisely when the shard ended up allocated into 2 servers - and from there you can start to go back through precisely what those 2 servers thought the state of the world was directly before and after the end.

For our systems when we started to identify this exact same issue we ensured that we logged the cluster state from each of our servers on a regular (5second) interval. This way it became a simple investigation of 'Oh message 2032 allocated Shard[83] to Server Y but it was already on Server X as well - and this happened at 8:22', from there look at the 5 seconds after for the tell tale sign of 'Server Y is in a different cluster to Server X' and look 'sometime back' for when that started.

Your seed choice logic seems rather open to errors. Bear in mind that even if you specify 10 seed nodes, it is 'the first to respond' which becomes your seed. Hence it is quite easy to imagine that multiple servers all pointing at the same 3 seed nodes actually turn into 3 clusters.
Here is the basic approach we took (after a lot of trial and error):
A seed node should only be able to be chosen as a seed - if it is already in a cluster.
A seed node should only start a cluster (seed node = itself) if it has the earliest deployment time of all current servers (At least this thing all our servers can agree on and no newer server could ever 'become' the oldest instance)

Renato Cavalcanti

unread,
Feb 20, 2018, 9:29:03 AM2/20/18
to Lagom Framework Users

Hi Zachary,

You can only have two writers to an entity if you create two clusters or if you intantiate the entities yourself. The later is very unlilkely. You can't do it by accident, the Lagom API is pretty much in your way and make it non-obvious.

I start to believe that there are a few things mixed up here. Basically, it's not possible to get this message when using Lagom 1.4.0, but not when using Lagom 1.3.10. Somehow the bootstrap of you cluster is creating two cluster islands.

The way you describe your cluster bootstrap is also not clear for me. It seems a little bit hacky and unnecessary.

You don't need to remove the current node from the seed-nodes list. Here is an example:

Considering three nodes: A, B and C where A is the first seed-node. Each node can be configured with all address in the list of seed-nodes.

config for A: seed-nodes = [ A, B, C ]
config for B: seed-nodes = [ A, B, C ]
config for C: seed-nodes = [ A, B, C ]

The only important thing is that all nodes have node A as the first in the list.

Node A will behave differently because it's the first node. It's the only one allowed to join itself, but that only after trying to reach the other nodes.
The other nodes won't ack that A can form a cluster with them because they are not part of a cluster yet, so after some time, node A will decide to join itself and start the cluster. Later, B anc C will join A.

You should also make sure that Lagom's property lagom.defaults.cluster.join-self is set to off. This is off by default for production.

On gitter, @acchaulk have mentioned that you were performing a rolling update from 1.3.10 to 1.4.0. How this was being done exactly and how the seed-nodes were configured?

Have you tried to perform a full cluster shutdown, switch to ddata and re-boot the cluster?

Cheers,

Renato

Zachary Marois

unread,
Feb 20, 2018, 2:55:35 PM2/20/18
to Lagom Framework Users
Thanks for the replies.

So first, I'll try to better explain our seed node configuration:

I think I understand how a simple, static seed node list would work. A complication for us is that we are running on AWS Elastic Beanstalk, which means I do not know prior to our first deploy what our nodes are, and they can change rather easily (if we perform some types of upgrades or scale up/down the cluster). So we wanted a mechanism that would find the seed nodes at startup by querying AWS for the instances that are allocated for my application. I *could have* made that seed list the first three such nodes, sorted by IP, but I felt it was safer to exclude the self node to actually prevent split brain scenarios. My reasoning is that a cluster cannot form if a node does not point to itself as the first seed node on bootstrap (and join-self is set to false). I consider this approach similar to the kubernetes suggestion, but at startup instead of via the Kubernetes deployment orchestrator, and with the self-node removed. The static seed list [ A, B, C ] could cause A to establish a second cluster if B and C do not respond in time (either because they are restarting also, or they are unhealthy, or there is a network partition).

Am I misunderstanding the situation in which a new cluster could form? Daniel's explanation makes me worry that 3 clusters could form purely based on who responds first even if 3 clusters didn't already exist. Regardless of seed node information, could existing akka-persistence sharding data, if it does not match the current cluster state, introduce it?

During our rolling upgrade from 1.3.10 to 1.4.0, we had 8 nodes and deployed the upgrade to them in batches of 2 in-place (the infrastructure was not changed, just the docker container we are running on them). The seeds were configured as described above: every node had 7 seeds, all but themselves, specified by invoking Cluster.get(actorSystem).joinSeedNodes(...). We actually do log some cluster state on a regular basis: On each node, once a minute, we log the node's address, its upNumber, its status, the number of host it thinks should be in the cluster, the number of unreachable hosts, and the leader. During our upgrade/revert, we never recorded a time when less than 6 nodes were in the cluster. We did see 4 occurrences of unreachable nodes, but never that those unreachable nodes had formed a smaller cluster. Its possible we briefly formed a split-brain scenario during those 1 minute intervals, so I could turn up the check rate, but I'm skeptical that we could have established a new cluster briefly, then rejoined the main cluster.

We have not yet attempted a full cluster shutdown with blowing away sharding data (either deleting the journal entries or swapping to ddata). We could, but that would require the 1.4.0 upgrade, and as you point out, our problem does not seem 1.4.0/1.3.10 related. We finally (as of yesterday) have backups of our Cassandra database, so I should be able to attempt this upgrade with replicated production data (we couldn't reproduce with any other test data) in a safe place. I think this will be the first thing I try, because its the only reproducible evidence I have.

Considering both of you are skeptical of my seed logic, I could also adjust it by:
  • Swapping it out for something like ConstrucR (which I did not know about when we built our logic)
  • Changing it to only use nodes that are in a cluster right now
  • Changing it to use the same list of seed nodes, in the same order, on all instances.
Thanks for the ideas
Zach

Daniel Stoner

unread,
Feb 21, 2018, 5:22:28 AM2/21/18
to Zachary Marois, Lagom Framework Users
Renato has more likely the more up to date awareness on the precise logic behind cluster joining (My experience comes from tackling the situation in a much earlier version of Akka clustering and we also deployed onto AWS [That was a number of months of fun before we had a stable cluster!]).

I cannot recommend enough that you really walk through your persisted events and identify what happened and precisely when.
Akka is persisting lots of cluster state related things all of the time so that new members don't try and steal a shard from another instance of their own cluster.

When you go through that and see Shard[83] being allocated a second time you will have the time that this occurred and the instance on which this occurred to within a minute or two. You can then investigate what the state of your servers were at the time and identify what it is that caused the issue.

Thanks kindly,
Daniel Stoner

--
You received this message because you are subscribed to a topic in the Google Groups "Lagom Framework Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lagom-framework/5bNYIxw9efA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lagom-framework+unsubscribe@googlegroups.com.
To post to this group, send email to lagom-framework@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lagom-framework/b1464dc0-51dc-4569-b18a-024cd0bea23e%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Daniel Stoner | Senior Software Engineer BSSCS | Ocado Technology
Buildings One & Two, Trident Place, Mosquito Way, Hatfield, Hertfordshire, AL10 9UL


Notice:  This email is confidential and may contain copyright material of members of the Ocado Group. Opinions and views expressed in this message may not necessarily reflect the opinions and views of the members of the Ocado Group. 

 

If you are not the intended recipient, please notify us immediately and delete all copies of this message. Please note that it is your responsibility to scan this message for viruses. 

 

Fetch and Sizzle are trading names of Speciality Stores Limited and Fabled is a trading name of Marie Claire Beauty Limited, both members of the Ocado Group.

 

References to the “Ocado Group” are to Ocado Group plc (registered in England and Wales with number 7098618) and its subsidiary undertakings (as that expression is defined in the Companies Act 2006) from time to time.  The registered office of Ocado Group plc is Buildings One & Two, Trident Place, Mosquito Way, Hatfield, Hertfordshire, AL10 9UL.

Renato

unread,
Feb 21, 2018, 5:35:29 AM2/21/18
to Lagom Framework Users
Hi Zachary,

see answers in-line...

On 20 February 2018 at 20:55:36, Zachary Marois (zma...@cimpress.com) wrote:

The static seed list [ A, B, C ] could cause A to establish a second cluster if B and C do not respond in time (either because they are restarting also, or they are unhealthy, or there is a network partition).

A static list won't allow a second cluster to form if B and C are restarting. If they are restarting, they are not part of the cluster anymore and they will wait for A to form the cluster.

The other situation is indeed true. If there is a network partition when redeploying A, it won't 'see' B or C and will decide that it's safe to create a cluster. That's why it's important to have a good control of re-deployments.

I guess that you are first starting the cluster by deploying the first node alone, waiting for it to be up and add the others with a seed-node list excluding their own address. Is that correct?

That's was my point about the static list, you don't need to exclude B address from B's seed-node list if B is not the first on the list. B won't make a cluster alone if it's not the first one. It will wait until one of the other acknowledge that it's safe to join.



Am I misunderstanding the situation in which a new cluster could form? Daniel's explanation makes me worry that 3 clusters could form purely based on who responds first even if 3 clusters didn't already exist. 

No, that's not how it happens. Let's consider the following situation (again A, B and C). You start the three nodes each with a static list [A, B, C].

What will happen is that B and C will be running, but not yet joining the cluster. A (the first one) will ping B and C to see if they are already part of a cluster. They will answer that they are not yet, so A will form the cluster. B and C continue to ping the other nodes (A, C or A, B) in an attempt to joint the cluster. At that point, A have already created the cluster and will give permission to B and C to join as well.

It's possible that B joins first and later C will ping A and B and get an acknowledgment from B that's ok to join, because B is now part of the cluster. In that sense, B is the seed for C, but only after it have joined A in the cluster. So, you can't have 3 cluster forming just because of deploying them all.


Regardless of seed node information, could existing akka-persistence sharding data, if it does not match the current cluster state, introduce it?

I don't think so. When your sharding data is corrupted (because of a previous split) it will fail to assing the shard and you will get timeouts instead.



During our rolling upgrade from 1.3.10 to 1.4.0, we had 8 nodes and deployed the upgrade to them in batches of 2 in-place (the infrastructure was not changed, just the docker container we are running on them). The seeds were configured as described above: every node had 7 seeds, all but themselves, specified by invoking Cluster.get(actorSystem).joinSeedNodes(...). 

Are you calling joinSeedNodes(..) yourself? Lagom does the cluster formation automatically. No need to do it yourself. 

We have not yet attempted a full cluster shutdown with blowing away sharding data (either deleting the journal entries or swapping to ddata). We could, but that would require the 1.4.0 upgrade, and as you point out, our problem does not seem 1.4.0/1.3.10 related. We finally (as of yesterday) have backups of our Cassandra database, so I should be able to attempt this upgrade with replicated production data (we couldn't reproduce with any other test data) in a safe place. I think this will be the first thing I try, because its the only reproducible evidence I have.

I indeed believe that the problem comes from the bootstrap logic, so it’s best to first sort that out before upgrading to 1.4.0 and moving to ddata.

Considering both of you are skeptical of my seed logic, I could also adjust it by:
  • Swapping it out for something like ConstrucR (which I did not know about when we built our logic)
  • Changing it to only use nodes that are in a cluster right now
  • Changing it to use the same list of seed nodes, in the same order, on all instances.

You may be interested in this project. Make sure to go though the cluster bootstrap and service discovery section, specially the one about AWS.

https://developer.lightbend.com/docs/akka-management/current/index.html

https://developer.lightbend.com/docs/akka-management/current/bootstrap.html

https://developer.lightbend.com/docs/akka-management/current/discovery.html

Cheers,

Renato


Zachary Marois

unread,
Feb 21, 2018, 10:05:21 AM2/21/18
to Lagom Framework Users
Great deduction on our mechanism for establishing a cluster; we do deploy a single node alone when establishing a cluster. We acknowledge that we have improvements to make with this pattern, and your suggestion about not excluding a self node if it isn't the first entry in the list (or just moving it out of the first position) is a great idea for this.

We are calling joinSeedNodes(...) ourselves and explicitly not setting any config value for seeds. Without setting the seed nodes in the config, I'm not sure how Lagom could form the cluster. We do this because we don't have the seeds until we start the java process. We could have implemented this seed lookup prior to starting the java process by determining our seeds via the command line instead of in the java process and setting via system properties. We set akka.remote.netty.tcp.hostname this way. Do you think the mechanism Lagom uses to form a cluster could conflict with us calling joinSeedNodes? I don't see any Lagom code invoking joinSeedNodes, so I don't quite know where to look to analyze its logic for forming a cluster to determine this.

Thanks for the reinforcement of the independence of our problem from 1.4.0. We still might upgrade to 1.4.0, which seems to at least make the problem more reproducible.

We will look into Akka Management. One less thing to manage ourselves if possible.

Thanks a ton,

Zach

Zachary Marois

unread,
Feb 21, 2018, 10:21:25 AM2/21/18
to Lagom Framework Users
And thanks Daniel prodding us to look harder at the sharding data. We had looked at it, but didn't quite know how to interpret it. We will continue to look at it. If you have any idea if it is documented how that data is structured, let us know. We also found it odd that the sharding allocation errors we saw varied per deployment. On one it would be shards for some read side processors, on the next it would be the persistent entity. I don't think we reported this situation before, and I don't have enough hard evidence to describe or support it now. I can add such evidence if we see it in future tests.

Renato

unread,
Feb 21, 2018, 10:29:54 AM2/21/18
to Zachary Marois, Lagom Framework Users

Lagom will join the cluster if there are seed-nodes configured. That happens automatically when you creates an instance of Cluster and there are seed-nodes declared. Sorry, I didn’t mean to say that Lagom will call joinSeedNodes, but that it form the cluster automatically.

Your way of forming a cluster won’t conflict with Lagom because first Lagom, actually Akka, will try to form the cluster. You may see a warning that no seed-nodes where found and that you need to do it manually. Later, your own code will do the job.

There is a Lagom class class JoinClusterImpl that is used on bootstrap to create the cluster whenever needed.

--
You received this message because you are subscribed to the Google Groups "Lagom Framework Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lagom-framewo...@googlegroups.com.
To post to this group, send email to lagom-f...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lagom-framework/688d801a-35e0-400f-8157-b3567f475266%40googlegroups.com.

Daniel Stoner

unread,
Feb 21, 2018, 10:59:43 AM2/21/18
to Zachary Marois, Lagom Framework Users
Re: Documentation about the data structure of persisted data - Not that I know of and it's inherently journal impl dependant (e.g. cassandra, dynamodb, jdbc are all different).
I would clone your production event datastore and try to start up a single instance locally pointing to it.

It should fail pretty instantly with Shard[83] is already allocated and you can then put break points into your persistence implementation code (I assume you use a community journal implementation for your AWS datastore?) wherever it is reading - and see what those items look like. If it's easier put a log statement in the right place to log out every item read from the journal and you'll find exactly the one which breaks things - then go back from there and work out how you end up with 2 allocations to different IPs.

We implemented a Json serializer for both our entities and Akka's internal entities (this was far from trivial) and so it was made a little easier that we could just read the rows from our store.

--
You received this message because you are subscribed to a topic in the Google Groups "Lagom Framework Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lagom-framework/5bNYIxw9efA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lagom-framework+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Daniel Stoner | Senior Software Engineer BSSCS | Ocado Technology
Buildings One & Two, Trident Place, Mosquito Way, Hatfield, Hertfordshire, AL10 9UL

Zachary Marois

unread,
Feb 21, 2018, 11:27:55 AM2/21/18
to Lagom Framework Users
Yeah, we are using (via Lagom) the cassandra journal implementation. I'm working on cloning our production cassandra cluster now for testing this (either locally, or in a safe place).

Thanks.
To unsubscribe from this group and all its topics, send an email to lagom-framewo...@googlegroups.com.
To post to this group, send email to lagom-f...@googlegroups.com.



--
Daniel Stoner | Senior Software Engineer BSSCS | Ocado Technology
Buildings One & Two, Trident Place, Mosquito Way, Hatfield, Hertfordshire, AL10 9UL

Zachary Marois

unread,
Feb 27, 2018, 11:19:20 AM2/27/18
to Lagom Framework Users
Alright, I have some test results. Some background on the state in which I was running my tests:

I was able to get my production data restored to a safe place for testing. I was always running against this data store with all my entity journal entries, as well as any data my read side processors write, but in some tests I deleted the entity and RSP sharding journal entries and snapshots. During all my tests, I was running load at 10 requests per second - 2 writes per entity, no reads, which is roughly our total production load, but a lower ratio of writes/entity than we generally have, and a greater write/read ratio than we generally have. This pattern likely didn't write enough data to identify a split brain scenario in the entity data alone, nor did I read entities after passivation to see the replay filter's warnings of a past-split-brain scenario. Any "reproductions" I had were purely that I saw a "shard already allocated" error and/or had significant amount of failing requests. The initial 1.4 upgrade didn't seem to introduce any issue, but some subsequent (sometimes first, sometimes second) application restart would. I was running with 8 nodes in my clusters (at full capacity; I'd drop lower during deploys). 

I was able to reproduce the "shard already allocated" issue along with significant request failures consistently on Lagom 1.4 using Persistence sharding state store.

I was also able to reproduce on Lagom 1.4 with a statically-set seed list (via config properties) and my seed provider not used. To do this, I used my seed provider while running 1.3.10, got a cluster of 8, then deployed a new version with my seed provider stripped out and a hard-coded list of seeds, then upgraded to 1.4 with the same seed list.

I was also able to reproduce by shutting down my cluster, deleting sharding data (snapshots and messages for the entity coordinator and my read side processor coordinators), and starting up a fresh cluster. One peculiarity with this attempt: before deleting the sharding data, I had journal entries up to sequence_nr 28377. After shutting down, blowing away the shard data, and starting a new cluster, the new cluster started journal data at 27000, which is suspiciously close to the end of where I deleted. Seems like I could have missed some indication to the cluster of where the journal data got to, but I can't find where that would be, unless there is more journal data than keyed by
persistence_id = "/sharding/{MyEntityName}Coordinator" AND partition_nr = 0;

Next, I tried DData. I shut down my cluster completely and started it up per the config described in the 1.4 Migration Guide. I then restarted my app nine separate times. During these restarts, I did not see any indication that there were misallocated shards (I'd expect the log to be different than before, as it isn't reading the same data structure, but still to be a warning or error), nor did I see systemic errors reported to my load test (I saw ~30 errors during these nine restarts, which still isn't perfect, but doesn't seem to indicate a systemic sharding issue).

Some other tests I haven't run enough are:
  • To run my load tests with cases that should catch some multiple-writers to entities, introduce a split brain scenarios to confirm that they occur, and confirm they catch no such scenarios when using DData.
  • To introduce akka cluster management instead of our seed provider or the static seeds
  • To reproduce without our prod data (I have trouble seeing how a large amount of data in the same keyspace, but not for the entities I am running load to are affecting this)
  • To reproduce with a local cluster so I can debug or at least see the sharding data better
  • To strip out as much of my app as possible and still reproduce, with the goal of figuring out what we are doing that could interact with the shard settings, and potentially to show how it could be reproduced with a sample application.
Those are in the order I'd expect our likelyhood is to actually test. If I can confirm we are stable with DData, we'll probably upgrade to that without the rest, although I'm still concerned that I'm misunderstanding something, even if it is just in how our configuration is interacting with Persistence sharding state store.

Thanks for all the help and ideas.

Zach

Zachary Marois

unread,
Mar 15, 2018, 12:00:10 PM3/15/18
to Lagom Framework Users [deprecated]
So, we ran some more tests, namely bullets 1 (make sure we can accurately identify a reproduction of multiple writers, and test with DData) and 3 (reproduce without existing data) above. We were also able to isolate the issue to when we were deploying multiple servers roughly concurrently, namely, that we couldn't reproduce with 4 servers deployed at 25% at a time nor 8 servers deployed at 12.5% at a time, but we could reproduce with 4 servers at 50% or 8 servers at 25% (our original scenario).

While we now have more info about when this is happening, and that DData at least improves it, we aren't any closer to why. At this point, we are just planning to move to DData.

Daniel Stoner

unread,
Mar 15, 2018, 12:54:27 PM3/15/18
to Zachary Marois, Lagom Framework Users [deprecated]
It is possible to subscribe to 'Cluster Events' e.g. joining, leaving, etc.
If you subscribe to all the events - every node will report what it thinks the other nodes are doing at any given moment in time. You will see what node A thinks the cluster looks like and can compare with what node B thinks the cluster looks like.

Checkout the documentation here:
https://doc.akka.io/docs/akka/2.5/cluster-usage.html

Given you can re-create the situation, I suspect you will undoubtedly find that when you deploy 2 instances at the same moment, they do not agree about who they are in cluster with.
That then allows you to drill into 'Oh so they thought that was the situation' and work from there - if it is something that your curiosity plans to fix :)

Switching to DData may perhaps just hide the fact you have the problem - but you may become very aware of it in other situations later down the development journey.

--
Please join our new forum at https://discuss.lagomframework.com!
The lagom-framework Google Group will soon be put into read-only mode.
For details, see https://www.lagomframework.com/blog/announcing-discuss-lagomframework-forum.html.
---
You received this message because you are subscribed to a topic in the Google Groups "Lagom Framework Users [deprecated]" group.
To unsubscribe from this group and all its topics, send an email to lagom-framework+unsubscribe@googlegroups.com.
To post to this group, send email to lagom-framework@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lagom-framework/c94ae8c5-43b6-49f4-a888-f7b82262612e%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Daniel Stoner | Senior Software Engineer BSSCS | Ocado Technology
Buildings One & Two, Trident Place, Mosquito Way, Hatfield, Hertfordshire, AL10 9UL

Zachary Marois

unread,
Mar 15, 2018, 1:43:08 PM3/15/18
to Lagom Framework Users [deprecated]
Yeah, we are subscribing to and logging those events. We are also logging the cluster state every minute, but we aren't logging the cluster state on every one of those events, so that would be a good improvement. We are also only logging counts of nodes that are up/down, not which nodes. All useful improvements.

For what it is worth, during all our tests, we actually did identify a single entity that had a single invalid event, so there was still a split brain. It was just significantly less likely, resolved itself, and did not block the sharding coordinator for all future requests.



--
Daniel Stoner | Senior Software Engineer BSSCS | Ocado Technology
Buildings One & Two, Trident Place, Mosquito Way, Hatfield, Hertfordshire, AL10 9UL

Reply all
Reply to author
Forward
0 new messages