On Thu, Apr 2, 2015 at 6:33 PM, Felix Gallo <
felix...@gmail.com> wrote:
> Hi Salvatore --
>
> Thanks for the condensation, it was helpful. So if I am understanding the
> spec right:
>
> 1. Starting at time of network partition, any sequence of client reads
> could be false from a cluster-majority perspective until the partition is
> healed -- either 'stale' (progress has been made on the other side of the
> split brain) or 'false future' (doomed progress has been made in a
> minority).
Exactly. An important thing is that this behavior is time bound. As
soon as the minority side feels to be isolated (node-timeout time
elapsed without contact with the majority) it stops accepting writes.
Moreover if the partition heals before the node-timeout time, no
stale read or lost write happens.
> 2. Similarly, during a partition, any sequence of client writes could be
> destined to be ignored (doomed progress being made in a minority).
Yes, this is like "1" basically. You can lose writes even if they are
sent in the majority partition in case of very complex sequence of
successive partitions.
For example you are in a partition with the majority of masters,
including master M2, having its slaves S2a and S2b partitioned in the
minority.
A client writes to M2, then the network makes M2 no longer available
in the minority partition, but S2a and S2b back again.
It is possible to use "WAIT" in order to make sure a write is
propagated at least, for example, to one slave, however my feeling is
that if you can't tolerate this kind of issues Redis is not for you.
Before Cluster the same happened in this way:
1. Client is writing to a master.
2. Master-Slave link breaks.
3. More writes received by master ...
4. Master is no longer available.
5. Sentinel (or any other system) failovers the master with the slave
that miss some write.
Basically there are a number of different sequences of partitions you
can invent that will violate write safety because of the use of
asynchronous replication and last-failover-wins.
Those failure modes can be made a lot less likely using a combination
of Redis Cluster features and Redis normal features: configuration of
appropriate node timeout, feature of Redis to stop writes if not
enough slaves are online, WAIT, and so forth.
> 3. At partition time, it's further possible that some extra set of writes
> are lost by dying or minority-partitioned masters, because they acknowledge
> local writes to clients before they replicate (for speed).
Yes, this is always true even without partitions, but just considering
single master failures. With async replication is always possible that
the client get acknowledged before the replication propagates the
write.
In practical terms is very hard to trigger this in real-world
simulations, since once the event loop is re-entered, the write
reaches the sockets of the slaves and the clients (for the ACK), so
the window is usually small.
> 4. Recovery can take a long time with large keys and stop progress for the
> duration of key transfer.
No, recovery is pretty immediate. It's resharding that may have issues
with large keys, but this is, at least currently, an use triggered
operation.
On failures you'll see the slave immediately impersonating the master
after the FAIL state is reached for the master.
> And so from a mental model / reasoning perspective, I end up with:
>
> A. Redis-cluster comes with the possibility that some arbitrarily large
> subset of your clients could enter into an arbitrarily long false alternate
> data universe.
It's more time bound then arbitrarily long, because of two features:
1. Minority side stops accepting queries.
2. Slaves can be asked to don't failover after a given amount of
disconnection time from the master.
> B. The lifetime of that universe, while expected to be very short in
> practice, does depend on network bandwidth and reliability, suggesting
> operationally that Redis-cluster should be deployed in the same data center
> to minimize risk.
See above, you can bound it a lot, but I anyway agree with your conclusion.
What you can do is actually to have a multiple-DC setup where slaves
are only there in order to be mass-promoted using "CLUSTER FAILOVER
TAKEOVER" in the event of a disaster, but usually would not be used at
all.
> C. Recovering from that universe could take an arbitrarily long time.
If we consider recovering also to be able to talk with an available
cluster yes, since, for example, clients in the minority partition
will be unable to talk with the nodes for al the time the partition
lasts.
However the time they'll experiment the stale reads or acked lost
writes is at max node-timeout.
Note that there is nothing forcing us to don't go into protection
mode, in the minority side, even before node-timeout. There is no such
option for this but would be possible to say, for the sake of failure
detection the unavailability time is 30 seconds, but you go into
protection after 2 seconds and stop accepting queries in the minority
side. Looks useful and I was tempted to add this multiple times.
> I'm having a tough time coming up with a safe/usable use case in which the
> cluster is accepting any kind of causal writes and not acting as a straight
> cache. I feel like I have to be missing something.
TLDR: from this point of view, it is exactly like Redis master-slave,
but safer because there is a majority to check in order to enter
protection mode, and to better orchestrate fail overs.
Using WAIT it is possible to ensure that at least N slaves have a copy
of the data, which makes the write a lot safer. However it is still
possible to find an attacker-chosen sequence of partitions that will
still failover the wrong salve, but in practical terms it is ways
safer. However with WAIT you are using synchronous replication, which
is not very Redis-level performance experience, so I would use this
only in specific points in the application where a given write is
particularly important.
Cheers,
Salvatore