Hi,
So I believe I am hitting the corner case specified in the weighted quorum documentation page:
Note
Warning: If a group partitions at the moment when
the weight change message is delivered, all partitioned components that
deliver weight change messages in the transitional view will become
non-primary components. Partitions that deliver messages in the regular
view will go through quorum computation with the applied weight when
the following transitional view is delivered.
In other words, there is a corner case where the entire
cluster can become non-primary component, if the weight changing message
is sent at the moment when partitioning takes place. Recovering from
such a situation should be done either by waiting for a re-merge or by
inspecting which partition is most advanced and by bootstrapping it as a
new Primary Component.
I believe I am hitting this case when running into networking issues between aws regions which knocks out 2 of my nodes from a 5 node cluster. No network issues being reported between the remaining three but I lose the primary component/quorum sometimes when the partitioning occurs.
If I am indeed hitting this corner case, is there anything else I can do to prevent this scenario from occuring during partitioning? Or is the only way to avoid this scenario is to avoid partitioning in the first place? I already increased my timeout values according to this page, do I just need to make those timeouts higher?