Redis Cluster status & ETA

Salvatore Sanfilippo

unread,

Jan 9, 2014, 12:36:34 PM1/9/14

to Redis DB

Dear Redis community,

2014 just started, and it was a time where Redis Cluster was supposed
to reach stability. While a stable release is still not planned for
the next days, at this point the process to get Redis Cluster at
increasing stages of stability, to the point we'll be confident to
call it a production-ready product, has started. This email summarizes
what is missing from Redis Cluster in order to be complete and what
are the next steps that will happen in the next weeks to reach
3.0.0-stable.
Note that production-ready will be a moving target as usually, and
different users may start to use Redis Cluster at different times. For
this reason the roadmap that the Redis 3.0 release will follow is a
bit different compared to the one of the previous releases.
This is indeed the first argument addressed by this email.

Cluster roadmap
===

Usually the Redis roadmap for a new release is the following:

1) Freeze the development of new features unless they are considered
to have low impact on the rest of the code base.
2) Wait the release to get stable and stable, releasing Release
Candidate versions of Redis.
3) When the critical bug reports are starting to be reported very
rarely, and for weeks they stop to be reported at all so that the next
RC version just provides minor fixes, I call the release stable.

This instead of development -> freeze -> RCs -> stable we'll add a new
step, that is, betas, so it will look like this:

freeze -> Betas -> RCs -> stable.

In this way we'll consolidate betas as usable and upgradable (when the
next beta is available) releases of Redis ASAP, to favor early
adoption of Redis Cluster for environments where it was tested and
considered stable enough for the task at hand.

The first beta will be version 2.9.51, the first RC as usually will be 2.9.101.

Cluster ETA
===

In the rest of this email you'll find a detail of what is currently
missing. However before enumerating all the missing features and bore
you to death, I'll go to the point, where Redis Cluster will be stable
enough to be used in your production environment?

This is an hard question since software does not follow a
pre-scheduled plan to get stable ;-) But as exposed before, there will
be different levels of stability available to you ASAP.
To start, I'll release a new beta of Redis Cluster every month. This
is the plan:

10 Feb: all the missing points listed above already implemented,
beta-1 released.
1 March: beta-2
1 April: beta-3 or RC1

After this point a beta (or RC based on feedbacks and bug reports)
will be released monthly, until no critical bug is reported for a few
weeks. We'll call it 3.0.0. This is likely to happen before June.

Cluster clients!
===

The real critical thing in the Cluster ETA is IMHO client libraries.
Redis Cluster needs some more work but at this point it is an
incremental process that will go forward easily. Instead the client
landscape is a bit lacking at this point.
I suggest organizations interested in Redis Cluster in investing some
money / development time in donations to their client of choice
developer(s) in order to speedup the process.

It will not help to have a Redis Cluster working well server side
without good clients. While Redis clients tend to be not super
complex, Redis Cluster clients are a bit more articulated and require
some testing and care.

And now a list of issues that are work in progress:

Enhancement: read-only access to slave nodes.
Issue: https://github.com/antirez/redis/issues/1406
===

One thing missing (but fortunately a few lines of code away) is
reading from slaves.
As detailed in the issue linked above, this will be fixed with an
additional READONLY command that says the node that we no longer want
read-after-write consistency for this session and will be happy to
potentially read stale data.
The client will be only redirected if it issues a read about an
hashslot not served by the node, and in that specific case the
redirection message will not just list a single ip / port pair, but
the master and all the slaves for this hash slot.

Cluster-wide backups/restores.
===

The ability to take snapshots of the whole cluster in form of RDB
files with attached informations about the served hash slots is
critical for safe operations.
This way it will be possible to restore the cluster at a latter time
if needed in any setup where there are at least the same number of
masters.

Migration tool
===

This is another important concern for everybody migrating from a
different sharding setup (for example client-based sharding or
Twemproxy) to Redis Cluster.
The migration tool should get the address of the old Redis instances,
and the address of the cluster, and will copy every key from the old
instances to the cluster via SCAN + MIGRATE.

Redis-trib resharding / check / fix enhancements
===

Redis-trib currently is able to do things like checking the cluster,
and trying to fix it when inconsistencies are found. This must be
improved.
For example when instances restart and find keys that mismatch the
assigned slots, the instances mark the slots as 'migrating',
Redis-trib is already albe to fix some cases of slots/keys
inconsistencies, but should be able to deal with any possible mess.

This is an example of what it is able to do already:

$ ./redis-trib.rb fix 127.0.0.1:7001

[snip of long output]

[OK] All nodes agree about slots configuration.
>>> Check for open slots...
[WARNING] Node 127.0.0.1:7001 has slots in importing state.
[WARNING] Node 127.0.0.1:7003 has slots in migrating state.
[WARNING] The following slots are open: 16
>>> Fixing open slot 16
Set as migrating in: 127.0.0.1:7003
Set as importing in: 127.0.0.1:7001
Moving slot 16 from 127.0.0.1:7003 to 127.0.0.1:7001:
...........................
>>> Check slots coverage...
[OK] All 16384 slots covered.

Here I created this issue by blocking a resharding that was in
progress. redis-trib fix was able to fix the problem correctly in this
base case, but it should deal with more complex problems as well.

Failover user-tunable settings and other improvements
https://github.com/antirez/redis/issues/1411
===

Redis Cluster failover is still not using replication offsets in order
to promote the replica that is likely to have diverged less compared
to the master (less chances of data loss during failures), and this
must be addressed.
Here the trick is just to delay the failover attempt of all the slaves
but the one with the greater replication offset of a few hundred
milliseconds. Note that the other slaves will try to get promoted
anyway eventually.

Another related issue is that currently, a slave will not try to get
promoted at all if it has data that are older than NODE_TIMEOUT*10
milliseconds, this means that it is possible that after a master
fails, and all the slaves are, for example, restarted, the cluster
can't continue without human intervention, since there is no slave
considered to be ok enough to get promoted. This should be changed to
a user-configurable parameter, so that the user can specify even "0"
as the max disconnection time with the slave in order to still attempt
to get promoted, and favor availability over limiting divergence.

Replicas migration
===

This is a very simple but extremely powerful concept that I would love
to implement before the stable release, even if it actually an
incremental improvement that can be provided later.
The idea is the following: in a master-slave setup, you may imagine
the actual ability to survive to failures while still preserving every
hash slot is limited to the number of replicas you have for a given
hash slot.

So if you have just 1 slave for ever master, if you are unlucky and
both the master and slave of the same hash slot will explode, the
cluster will not be able to continue at all.
However in practice instances don't fail always at the same time, so
you may configure a Redis Cluster with 10 masters, and a slave for
every master for 9 slaves, but instead assign 3 slaves to the last
master number 10.
Now what happens if the master numer 1 or its slave fail? A promotion
will happen if needed, and the cluster will continue with just 1
instance left for a set of hash slots.
However with "replicas migration" slaves of master number 10 will
notice this and will migrate one replica from number 10 to master
number 1, providing more safety.

Testing
===

Useless to say, on top of this, we need more testing: I'll do a lot of
manual testing as I'm already doing, but also no stable release of
Redis Cluster will happen before we have automated unit tests for
Redis Cluster.
Probably I'll use a different test infrastructure / code in order to
test Redis Cluster, and the normal test will not start the cluster
test by default if not explicitly required.
However we need basic cluster testing for sure.

Here the idea is to avoid having instances on-demand like the current
test is doing, but instead to spawn N instances at startup that are
easily addressable by instance number by the testing code, and setup
different scenarios to run different tests.

Ok, that's all for now, please if you need more clarifications send a
reply to this message and I'll do my best to address your concerns.
I understand Redis Cluster was announced too early, but the actual
amount of work that went into it was not big until recent times...
this is why it took so much time.
However now that I play with it, I believe it will have a profound
impact in the Redis ecosystem, if we are able to provide a solid
implementation. I'll do my best to deliver something solid, and with
such a goal I can't rush too much, however with the betas -> RCs ->
Stable phases we'll try to have different stability levels at
different times suitable by different kind of users and use cases.

Thanks for your patience and help,
Salvatore

--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

To "attack a straw man" is to create the illusion of having refuted a
proposition by replacing it with a superficially similar yet
unequivalent proposition (the "straw man"), and to refute it
— Wikipedia (Straw man page)

Andy McCurdy

unread,

Jan 9, 2014, 2:07:35 PM1/9/14

to redi...@googlegroups.com

I haven't been following the cluster development closely. Has anyone written docs indicating how a good client lib will interact with cluster yet?

-andy

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/redis-db.
> For more options, visit https://groups.google.com/groups/opt_out.

Salvatore Sanfilippo

unread,

Jan 9, 2014, 6:39:19 PM1/9/14

to Redis DB

Hello Andy,

the best we have currently for a Redis Cluster client library developer is:

1) To read the cluster specification.
2) With this knowledge in mind, read the soruce code of redis-rb-cluster

This is not optimal, but redis-rb-cluster is designed to be easy to
understand, and only includes (and will include) the basic semantics
that every client should implement for this reason.
Note that actually just reading "1" should be enough to implement a
client, but IMHO "2" is a great companion to verify the
implementation.

Btw about the cluster specification, it is not needed to understand
everything but just the part about the interaction with clients (slots
and redirections).
However the specification is simple to understand in its whole content.

Ideally what should we have? "1" is fine. "2" Should be replaced by a
description of what it does in english + pseudo code, in order to work
as a general guide to write a Redis Cluster client.

Cheers,
Salvatore

Gaetano Mendola

unread,

Jan 11, 2014, 7:46:19 AM1/11/14

to redi...@googlegroups.com

From what I have read in the documentation I do not have clear the key partition strategy.

The example partitioning is:

Node A contains hash slots from 0 to 5500.
Node B contains hash slots from 5501 to 11000.
Node C contains hash slots from 11001 to 16384.

and the documentation states that if a Node D is added then hash will be moved from nodes A, B and C to D.

So I guess the new partitioning is not going to be:

Node A contains hash slots from 0 to 4100.
Node B contains hash slots from 4101 to 8200.
Node C contains hash slots from 8201 to 12300.
Node D contains hash slots from 12301 to 16384.

because this way as you can see some hashes from C will be moved to D,

some from B to C and some from A to B.

If the partitioning policy is not this one is the partitioning "schema" something shared

between nodes ?

Gaetano

Salvatore Sanfilippo

unread,

Jan 28, 2014, 9:49:15 AM1/28/14

to Redis DB

Hello Gaetano, sorry for the delay in the reply,

On Sat, Jan 11, 2014 at 1:46 PM, Gaetano Mendola <men...@gmail.com> wrote:
> From what I have read in the documentation I do not have clear the key
> partition strategy.

Ok, I'll try to clarify

> The example partitioning is:
>
> Node A contains hash slots from 0 to 5500.
> Node B contains hash slots from 5501 to 11000.
> Node C contains hash slots from 11001 to 16384.
>
> and the documentation states that if a Node D is added then hash will be
> moved from nodes A, B and C to D.
>
> So I guess the new partitioning is not going to be:
>
> Node A contains hash slots from 0 to 4100.
> Node B contains hash slots from 4101 to 8200.
> Node C contains hash slots from 8201 to 12300.
> Node D contains hash slots from 12301 to 16384.

Your guess is wrong! ;-) (And I believe you did not read the
documentation carefully).

The partitioning does no need to a contiguous split, just a few hash
slots from A, B, C will be moved to D.

So D may get 4000-5500 + 6000-7000 + 12000-13500 (just an example).

It is a simple as that: the whole key space is divided in keys having
16k different "colors" (aka hash slots).
Every node can get any subset of those colors. A node may just be
responsible of hash slots 1, 1000, 10001, and 2222.
No contiguity is needed at all.

Resharding works by moving an hash slot from one server to the other,
so you can go from a given setup A, to a given setup B, moving hash
slots one after the other from any set of servers.

Cheers,

Salvatore

--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

To "attack a straw man" is to create the illusion of having refuted a
proposition by replacing it with a superficially similar yet
unequivalent proposition (the "straw man"), and to refute it

-- Wikipedia (Straw man page)

Reply all

Reply to author

Forward