On Wed, May 28, 2014 at 3:26 PM, Andrei Lukovenko <
al...@cordeo.ru> wrote:
> Hello,
>
> First of all, thank you for response.
>
> Regarding the definition of the split-brain I am still not convinced. In
> my example both instances A and B consider themselves masters. Both of them
> are able to serve clients, including writes. If it is not a split-brain,
> then what is?..
Split brain conditions must be evaluated from the point of view of who
should be the source of authority in a distributed system.
In this case, it is the set of Sentinel instances, so as long as there
is no split-brain condition in the Sentinels themselves, the split
brain condition you see in the Redis instances is not a problem
because of the Sentinel property to always (with a delay) set the
logical configuration as the instances configuration.
> The sequence described above is not imaginary. I've actually seen this
> exact situation during my tests, it is very real, and what I really want is
> to find a way to prevent repeating this in production.
Probably what you observed is what I described in the previous email?
That's definitely possible.
1) A failover starts.
2) The Sentinel sends SLAVEOF NO ONE to the slave.
3) The Sentinel gets killed before getting the acknowledge.
4) The Sentinel restarts with the old config (which is correct since
the previous failover was not technically finished, and the Sentinel
never advertised the new master).
At this point you have two masters if you check the instances, but for
Sentinel the master is still the old one.
After some time (8 seconds, which is, four times the configuration
broadcasting period) it should detect that one of the slaves is
misconfigured, and reconfigure it accordingly, if this does not happen
there is a bug.
All this, of course, in Sentinel >= 2.8.
Sentinel shipped with 2.6 is broken and deprecated. Actually in the
latest 2.6 branch it is a dummy binary that warns you to use 2.8.
> So far it seems that sentinel is able to change (and actually save on
> disk) configuration of an instance (master or slave), but does not change
> it's own configuration. Is that correct?
Yes and no. It does not save the new configuration on purpose, because
it still did not received the acknowledge.
But here what is interesting is that, it saves the updated
configuration (with fsync) always *before* of advertising the new
configuration to clients and other Sentinels.
If it is not able to get the ack, it will reconfigure the new master
again back to slave.
If this does not happen, than there is a bug in the implementation,
but the designed semantics is very clear, the problem is if you find a
case where because of an implementation bug things does not work as
expected.
I'm trying to reproduce right now. Thanks for posting, it is vital
that we try to remove all the bugs in order to end with a system that
acts like the specification claims.
Salvatore