On Mon, Oct 22, 2012 at 9:03 PM, Ilias Bertsimas <
awar...@gmail.com> wrote:
> I had some network issues last week and lots of cluster negotiations and
> configuration changes, maybe something happened there that made the cluster
> inconsistent.
Possibly. The one question I forgot is that you could startup the
failed nodes in isolated mode (ie not connected to cluster and also
don't let application write to them) to investigate how they really
differ. Is one row missing? Is the whole table missing? Are the rows
there but primary keys out of sync? Are there also other tables that
have diverged or just this one?
> Another question is why they did not try to perform an SST and just aborted
> all together, does it have to do with that only 1 node was left and lost
> PRIMARY mode due to that ?
No, this is the intended behavior whether it happens to one node or
all slaves. The point is just to shut down the node and leave it to
you to decide what to do. For instance, you may want to look at the
nodes to investigate what has happened, and you may want to manually
salvage some data. Note that Galera only knows that the databases have
diverged, it cannot know which database is "better", since this is a
subjective judgement by you. So the action is just to shut down the
failed node and leave it to you to investigate and take action.
If you then decide that the single node that is still alive has the
"good" database, you could just restart the failed nodes, they will
wipe out their current database, do full SST after which we hope they
really are identical with the donor node and everything will be fine
again.
Hint: You may want to take a copy of the database of failed nodes and
keep them for some months into the future. If a user complains that
his data has gone missing, you could go into your saved copies to see
if it is there. This could provide clues to what has happened and
when, or at least just allow you to manually "copy-paste" the lost
data into the running cluster.
henrik