Redis Cluster: an interesting problem about moving hash slots around

2,543 views
Skip to first unread message

Salvatore Sanfilippo

unread,
Oct 6, 2011, 8:57:19 AM10/6/11
to Redis DB
Hi all!

Today I found an interesting race condition in Redis Cluster.
If you are familiar with the design you know that the key space is
split into 4096 parts.
Every part is called an "hash slot", and every node has a routing
table to map every hash slot with a cluster node.

This way if a client sends a query to a node that is not responsible
for the keys mentioned in the query, it gets a -MOVED message
redirecting it to the right node.

However we also have the ability to reconfigure the cluster while it
is running. So for instance I've hash slot 100 that is assigned to
node A. And I want to move it to node B.
This is accomplished (and redis-trib is already able to do it for you
automatically) with the following steps.

1) Node A hash slot 100 is marked as "Migrating to B" (using the
CLUSTER SETSLOT <slot> MIGRATING <ndoe> command).
2) Node B hash slot 100 is marked as "Importing from A" (using the
CLUSTER SETSLOT <slot> IMPORTING <node> command).
3) An external client, usually redis-trib, starts using the commands
CLUSTER GETKEYSINSLOT and the MIGRATE command to atomically move keys
from A to B.

What is interesting is that while the hash slot is set as "Migrating
to B", node A will reply to all the requests about this hash slot of
keys that are *still* present in the hash slot,
but if a request is about a key that is in hash slot 100 but is not
found inside the key space, it generates a "-ASK" error, that is like
"-MOVED" but means: please only ask this exact query to the specified
node, but don't update your table about it. Ask new queries about hash
slot 100 to me again.

This way all the new keys about hash slot 100 are created directly in
B, but A handles all the queries about keys that are still in A.
At the same time redis-trib moves keys from A to B. Eventually all the
keys are moved and the hash slot configuration is consolidated to the
new one, using other CLUSTER SETSLOT subcommands.

So far this is pretty cool. But there is a subtle problem about this.

The Problem
===

When the cluster is stable, that is, there no resharding in progress,
a client may ask a query to a random node.
There is only one node that will reply to queries related to a
specific hash slot. All the other nodes will redirect the client to
this node.

However when rehashing is in progress there are two nodes that will
reply to queries for a given hash slot, that is, the MIGRATING node
and the IMPORTING node.
If the client is a "smart" client with an internal routing table, it
starts every connection to a cluster asking for the slot->node map,
and makes sure to update the table when -MOVED messages are received.
But there are also clients that are not smart, without a table, or
even clients that are smart but perhaps don't update the table since a
lot of time since they are idle, and the cluster moved a lot of hash
slots recently. But to make things simpler let's just focus on the
stupid client that has no internal map. It just send queries to a
random node among a list of configured nodes, expecting to get
redirected if the wrong node was selected.

Such a simple client is only able to deal with -MOVED and -ASK
redirections. And the two messages are handled in the same way, that
is, just asking to the node specified in the redirection message.
It is easy to see how this client may create a race condition, like that:

1) We are migrating slot 100 from A to B.
2) Node A will only accept queries about slot 100 that are related to
keys already in the key space. Otherwise it will reply with -ASK.
3) Node B instead will accept all queries about hash slot 100.
4) Our stupid client need to perform an LPUSH against a key in hash
slot 100. It picks a random client.
5) If it picks "C" or "D" it will be redirected to "A". "A" in turn
will redirect it to "B" with -ASK if the key is not present in A key
space.
6) If it picks "B" directly the query will be accepted by "B", but
what about if "A" already had that key?

RACE!

The cure
===

How to fix that problem? There are different ways.

Way 1:

1) Node B is importing hash slot 100.
2) Node B receives a query about key "foo" in hash slot 100. If it
already has "foo" in the data space, the query is served. If "foo" is
not present the cluster message bus is used to ask "A" if the key is
there or not (and the client is blocked in the meanwhile). If the key
is in node A then node B replies with -MOVED to redirect to "A".
Otherwise it just serves the client. Note that there is no race here,
even if the key is migrated during the redirection, then A will again
point the client to B, and eventually the query will be executed.

This works and allows us to still have completely stupid clients.
However we are violating one of the main design principia of Redis
Cluster: don't proxy stuff for clients.
All the rest of the cluster works just letting the client follow the
right way as indicated by nodes. This time instead we need to ask
nodes for informations.

Way 2 (my preferred one):

1) Node B is importing hash slot 100.
2) Node B receies a query about key "foo" in hash slot 100. If it
already hash "foo" the query is served. Otherwise it issues a "-MOVED"
to redirect the client to A.
3) Node B however will serve the query if the client started the chat
using the command "ASKING", that indicates that this query was issued
after being redirected by a -ASK message. If the client comes from a
-ASK redirection we are sure we can serve the client.

So in the case above what happens is that all the smart clients will
have no problems, after a -ASK redirection they will send:

ASKING
LPUSH foo bar

ASKING sets a flag that is cleared after the command is executed.

If a client is dummy (no internal routing tables caching) but still is
able to remember that after a -ASK redirection it should start the
next query with ASKING, everything is fine as well.

A completely stupid client that is not able to start the chat with
ASKING will simply ping/pong from A to B until the hash slot migration
is completed, and will finally be served.

The reason I like solution #2 more, even if this will make clients a
bit more complicated, are:

1) It is faster. Since during rehashing there is always the "ask A,
get -ASK, ask B" routing, it is bad that we need to play the RTT again
since node "B" has to query "A" again.
2) It does not violate our design principia of letting clients
following their ways.
3) The implementation is simpler.
4) Eventually all the clients will be smart, or at least smart enough
to use the ASKING command.

However this is an "interesting" problem and I would love to hear your
feedbacks and ideas.

Cheers,
Salvatore

--
Salvatore 'antirez' Sanfilippo
open source developer - VMware

http://invece.org
"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele

Andrea Campi

unread,
Oct 6, 2011, 9:41:00 AM10/6/11
to redi...@googlegroups.com

On Oct 6, 2011, at 2:57 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:

> A completely stupid client that is not able to start the chat with
> ASKING will simply ping/pong from A to B until the hash slot migration
> is completed, and will finally be served.
>
>

> 4) Eventually all the clients will be smart, or at least smart enough
> to use the ASKING command.

FWIW I strongly agree. The incentive is big for clients to become smart enough to send ASKING, natural selection will make sure of that :)

Or you can just declare that clients that don't send ASKING are not 100% conformant.

Andrea

catwell

unread,
Oct 6, 2011, 9:58:06 AM10/6/11
to Redis DB
I like solution 2 but I think there is something simpler: what if you
sort the keys in a hash slot before migrating them? This way node B
could
easily know if it is supposed to have the key or not by comparing it
to
the latest key it received.

(PS: I sent this message by email earlier but apparently it didn't go
through...)

catwell

unread,
Oct 6, 2011, 9:59:04 AM10/6/11
to Redis DB
> (PS: I sent this message by email earlier but apparently it didn't go
> through...)

Please disregard this, I know why now. It's f***ing Google's fault.

Salvatore Sanfilippo

unread,
Oct 6, 2011, 10:15:34 AM10/6/11
to redi...@googlegroups.com
On Thu, Oct 6, 2011 at 3:58 PM, catwell <catwell...@catwell.info> wrote:
> I like solution 2 but I think there is something simpler: what if you
> sort the keys in a hash slot before migrating them? This way node B
> could
> easily know if it is supposed to have the key or not by comparing it
> to
> the latest key it received.

I like this idea in principle, especially since we happen to have
already a structure with keys indexed by hash slot (in order to
implement CLUSTER GETKEYSINSLOT) but sub-ordered lexicographically
(basically we are using a sorted set internally).

But I don't understand how this would work.

For instance have keys A, B, C.... , Z

We already got key B. We receive a query about H, while we are sure
that we still did not received H we don't know if H is or not in A.

So I think this solution can't be applied to this problem, or I'm
missing something here?

Thanks for the idea anyway,

catwell

unread,
Oct 6, 2011, 10:29:24 AM10/6/11
to Redis DB
This is true, I didn't see it that way.

What I was thinking was:

1) sort the keys on A
2) send the *keys* to B (keep the clients working on A)
3) send the *values* to B in order

With that scheme you always know which key (or rather the
related value) belongs to which node. But it requires
separating the transmission of the keys from the transmission
of the values.

Pedro Melo

unread,
Oct 6, 2011, 10:31:29 AM10/6/11
to redi...@googlegroups.com
Hi,

On Thu, Oct 6, 2011 at 1:57 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
[...]


> The reason I like solution #2 more, even if this will make clients a
> bit more complicated, are:
>
> 1) It is faster. Since during rehashing there is always the "ask A,
> get -ASK, ask B" routing, it is bad that we need to play the RTT again
> since node "B" has to query "A" again.
> 2) It does not violate our design principia of letting clients
> following their ways.
> 3) The implementation is simpler.
> 4) Eventually all the clients will be smart, or at least smart enough
> to use the ASKING command.
>
> However this is an "interesting" problem and I would love to hear your
> feedbacks and ideas.

Instead of recommending way #2, I'll just say that any solution where
the server has to proxy requests to other nodes is clearly inferior.

I don't know if #2 is the best solution. It does seem to solve this
current problem though.

One question: this ASKING stuff, this would be a new command that
would set a flag in the client connection structure that would be
cleared on the next command, right? Would you consider adding generic
support for prefix flags like

ASKING LPUSH aaa bbb

instead?

It would allow for other flags to come up in the future (BLOCKING
would be my favorite).

Bye,
--
Pedro Melo
@pedromelo
http://www.simplicidade.org/
http://about.me/melo
xmpp:me...@simplicidade.org
mailto:me...@simplicidade.org

Salvatore Sanfilippo

unread,
Oct 6, 2011, 10:31:52 AM10/6/11
to redi...@googlegroups.com
Oh I get it now. Sending keys to key would require blocking. I like it
more as an incremental process completely performed by redis-trib.

Actually redis-trib gets keys in groups of 10 (or 100, it is easy
configurable) using CLUSTER GETKEYSINSLOT <slotid> <count>, so the
process is very incremental and key-oriented instead to be slot
oriented.

Cheers,
Salvatore

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.

Salvatore Sanfilippo

unread,
Oct 6, 2011, 10:38:43 AM10/6/11
to redi...@googlegroups.com
On Thu, Oct 6, 2011 at 4:31 PM, Pedro Melo <me...@simplicidade.org> wrote:

> Instead of recommending way #2, I'll just say that any solution where
> the server has to proxy requests to other nodes is clearly inferior.

I Pedro, the above is a good way to look at it I think.

> I don't know if #2 is the best solution. It does seem to solve this
> current problem though.

In general everything telling the node that we come from a -ASK
redirection will work, but this new command seems the simpler way.

> One question: this ASKING stuff, this would be a new command that
> would set a flag in the client connection structure that would be
> cleared on the next command, right? Would you consider adding generic
> support for prefix flags like
>
> ASKING LPUSH aaa bbb

The reason I don't like too much the idea of the prefix is that it
needs adding more state to the client in order to support the cluster.
For instance if you want to write a cluster client as a wrapper of an
already existing client you can just trap the -ASK error, reconnect,
send the "ASKING" command, and then call the usual client again to
resend the command.
Implementing it as a prefix will change the structure of the client
that requires to support prefixing in some way.

> instead?
>
> It would allow for other flags to come up in the future (BLOCKING
> would be my favorite).

The idea is to take the command semantics very simple as it is
currently and move all the advanced stuff in Lua, so that we could
have a Redis.block_for_keys("ciao") in Lua that will reissue the
script once the key is available. Otherwise I fear we end with some
kind of ad-hoc language, as prefixing after all is the initial step to
build a language ;)

catwell

unread,
Oct 6, 2011, 10:42:39 AM10/6/11
to Redis DB
On Oct 6, 4:31 pm, Salvatore Sanfilippo <anti...@gmail.com> wrote:

> Oh I get it now. Sending keys to key would require blocking. I like it
> more as an incremental process completely performed by redis-trib.

I have not looked enough into Redis Cluster to be very helpful here
but I don't think what I propose would require blocking clients. It
just means you have to transfer all the keys before you transfer
all the values.

The "sorting" part was confusing. The basic idea is that if you want
B to know what keys A has in slot S, A just has to give the list of
the keys in slot S to B *before* starting the migration.

Jeremy Zawodny

unread,
Oct 6, 2011, 11:05:28 AM10/6/11
to redi...@googlegroups.com
On Thu, Oct 6, 2011 at 5:57 AM, Salvatore Sanfilippo <ant...@gmail.com> wrote:

Way 1:

1) Node B is importing hash slot 100.
2) Node B receives a query about key "foo" in hash slot 100. If it
already has "foo" in the data space, the query is served. If "foo" is
not present the cluster message bus is used to ask "A" if the key is
there or not (and the client is blocked in the meanwhile). If the key
is in node A then node B replies with -MOVED to redirect to "A".
Otherwise it just serves the client. Note that there is no race here,
even if the key is migrated during the redirection, then A will again
point the client to B, and eventually the query will be executed.

This works and allows us to still have completely stupid clients.
However we are violating one of the main design principia of Redis
Cluster: don't proxy stuff for clients.
All the rest of the cluster works just letting the client follow the
right way as indicated by nodes. This time instead we need to ask
nodes for informations.

What I like about that solution is:

(1) It doesn't rely on the client being "smart" in any way.  The smarts are in the cluster.

(2) We get to think of the cluster as a unified system and focus on the data.

(3) In the case where the redis cluster nodes are "near" each other but some network clients may be farther away, the clients don't pay the round-trip penalty for fetching data that is being migrated.

(4) In the case of a very popular key in the hash slot, node B could probably migrate it the first time a client asks for it, making it faster for many other clients to ask for it w/out having to re-proxy for each of them.

Of those, the most important point to me is #1.
 
Way 2 (my preferred one):

1) Node B is importing hash slot 100.
2) Node B receies a query about key "foo" in hash slot 100. If it
already hash "foo" the query is served. Otherwise it issues a "-MOVED"
to redirect the client to A.
3) Node B however will serve the query if the client started the chat
using the command "ASKING", that indicates that this query was issued
after being redirected by a -ASK message. If the client comes from a
-ASK redirection we are sure we can serve the client.

What I like about this solution is:

(1) The cluster code can remain a bit more simple.

What I worry about in this solution is:

(1) I've seen "redirect loops" in web browsers. To guard against them, most browsers have a limit on how many times they'll follow circular redirects.  Every client or app developer will have to deal with that issue.  Do they set a limit?  Add a delay?  Exponential backoff?  How should this be exposed in the client APIs?

(2) In the case of a popular key in slot 100 and a time when node A and node B may experience a bit of trouble communicating, clients could "storm" both of them needlessly.

I'm not arguing for either at this point, just listing reactions to both ideas while the discussion continues.

Jeremy

Jak Sprats

unread,
Oct 6, 2011, 11:06:33 AM10/6/11
to Redis DB
Hi Salvatore,

The proxy solution seems like the worse option, the ASKING solution
seems pretty simple for clients to implement.

I have a semi unrelated question, something I have never understood
about the cluster design. Migrating a key from one node to another is
blocking (to avoid race conditions). If someone has a huge ZSET (1
million entries) does that mean the node will block while it sends 1
million keys (several million bytes) over the wire? Is that correct?
Could it be an issue?

- jak

On Oct 6, 8:31 am, Salvatore Sanfilippo <anti...@gmail.com> wrote:
> Oh I get it now. Sending keys to key would require blocking. I like it
> more as an incremental process completely performed by redis-trib.
>
> Actually redis-trib gets keys in groups of 10 (or 100, it is easy
> configurable) using CLUSTER GETKEYSINSLOT <slotid> <count>, so the
> process is very incremental and key-oriented instead to be slot
> oriented.
>
> Cheers,
> Salvatore
>
>
>
>
>
> On Thu, Oct 6, 2011 at 4:29 PM, catwell <catwell-goo...@catwell.info> wrote:
> > This is true, I didn't see it that way.
>
> > What I was thinking was:
>
> > 1) sort the keys on A
> > 2) send the *keys* to B (keep the clients working on A)
> > 3) send the *values* to B in order
>
> > With that scheme you always know which key (or rather the
> > related value) belongs to which node. But it requires
> > separating the transmission of the keys from the transmission
> > of the values.
>
> > --
> > You received this message because you are subscribed to the Google Groups "Redis DB" group.
> > To post to this group, send email to redi...@googlegroups.com.
> > To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/redis-db?hl=en.

Javier Guerra Giraldez

unread,
Oct 6, 2011, 11:10:19 AM10/6/11
to redi...@googlegroups.com
On Thu, Oct 6, 2011 at 10:06 AM, Jak Sprats <jaks...@gmail.com> wrote:
> If someone has a huge ZSET (1
> million entries) does that mean the node will block while it sends 1
> million keys (several million bytes) over the wire?

AFAIK, just setting a big value currently stalls other queries, since
Redis is single-threading and all operations are atomic.

--
Javier

Salvatore Sanfilippo

unread,
Oct 6, 2011, 4:23:24 PM10/6/11
to redi...@googlegroups.com
On Thu, Oct 6, 2011 at 4:42 PM, catwell <catwell...@catwell.info> wrote:

> I have not looked enough into Redis Cluster to be very helpful here
> but I don't think what I propose would require blocking clients. It
> just means you have to transfer all the keys before you transfer
> all the values.

Sorry I was not clear, you don't need to block clients, but you need
to block the server to send the keys as Redis is single threaded, and
you can't do this easily in an incremental way as the dataset is
changing continuously.

That's why I like the idea of a migration process that is key-by-key,
so that the only blocking operation is MIGRATE (more about this in
another email to reply to Jak's concerns).

Salvatore

Salvatore Sanfilippo

unread,
Oct 6, 2011, 4:28:28 PM10/6/11
to redi...@googlegroups.com
On Thu, Oct 6, 2011 at 5:06 PM, Jak Sprats <jaks...@gmail.com> wrote:
> Hi Salvatore,
>
> The proxy solution seems like the worse option, the ASKING solution
> seems pretty simple for clients to implement.
>
> I have a semi unrelated question, something I have never understood
> about the cluster design. Migrating a key from one node to another is
> blocking (to avoid race conditions). If someone has a huge ZSET (1
> million entries) does that mean the node will block while it sends 1
> million keys (several million bytes) over the wire? Is that correct?
> Could it be an issue?

Hello Jak,

If we have a big sorted set, this is going to be a blocking operation indeed.
Now 1 million key is still ok in *some* application, as this will just
take like 3 seconds or less (already tested), but it is like the
acceptable limit, or it is past over the limit if you have many keys
that size.

Now what we already planned to improve about that is to just block
that *single* key, but still the serialization process must be done in
a blocking way (we can just do the I/O in background using our usual
event-driven way, while the key is blocked, that is, if other clients
will arrive asking for this key we'll block just that clients instead
of the whole instance).

That said the main use case with this design is clearly the "many many
keys, relatively small keys".
If the application also needs a few very big objects, it is much
better to take this big objects into a different non-clustered server.
Another possibility could be to reserve an hash slot for big keys,
like: keys with a given prefix will always hash to a given hash slot.
This could be interesting in theory as one can rehash everything
without touching hash slot 1.

Basically I think we'll not start caring too much about the moving big
key use case, but this is something that could be cool to address in
the next releases of Redis Cluster.

Cheers,
Salvatore

Jan Oberst

unread,
Oct 6, 2011, 5:31:43 PM10/6/11
to redi...@googlegroups.com
If we have a big sorted set, this is going to be a blocking operation indeed.
Now 1 million key is still ok in *some* application, as this will just
take like 3 seconds or less (already tested), but it is like the
acceptable limit, or it is past over the limit if you have many keys
that size.

Now what we already planned to improve about that is to just block
that *single* key, but still the serialization process must be done in
a blocking way (we can just do the I/O in background using our usual
event-driven way, while the key is blocked, that is, if other clients
will arrive asking for this key we'll block just that clients instead
of the whole instance).

That said the main use case with this design is clearly the "many many
keys, relatively small keys".
If the application also needs a few very big objects, it is much
better to take this big objects into a different non-clustered server.
Another possibility could be to reserve an hash slot for big keys,
like: keys with a given prefix will always hash to a given hash slot.
This could be interesting in theory as one can rehash everything
without touching hash slot 1.

I think your second solution sounds good and makes sense. Proxying calls around gets very complicated and inefficient.

How about using a clever version of SYNC for master/slave replication to do the MIGRATE from machine A to machine B? In my experience, setting and syncing a slave is really quite fast, and at no point does the master block writing or reading keys (or does it? not quite sure about that, but it's still pretty fast). You guys have already discussed the topic on this list last summer, but I couldn't really figure out what happened to it.

The MIGRATE command could just use a version of that replication code to slowly bring machine B up to speed. Machine B issues the SYNC command to machine A, and once B is "live" and up to speed, all that's left to do is for machine A to reply with -MOVED to requests for the keys we just moved and point them to machine B. After this is finished, it is announced that machine B has the slot and all traffic starts going to machine B.

This approach would still avoid race conditions, because no two machines will ever serve the same key at the same time. Once the slot is finally switched from A to B, both machines will have the exact same data set. The "switching" takes virtually no time because all the migration work already happened in the background.

Do you think that would be a viable option at all? I haven't spent too much time thinking about migration concepts like this, so forgive me if this tactic has already been eliminated.


Besides that, I'm trying to estimate how well the "smart client" concept works in certain cases. We're using Redis on PHP quite heavily and the shared state between two calls is limited the the persistent connection to the Redis server. We do thousands of very short requests per second and our PHP processes don't live all that long, so every few minutes or seconds they would have to figure out the whole cluster. I'm not sure how efficient it would be to implement a 'smart' client for this kind of application. I guess a PHP plug-in could save some state between PHP requests, but in general it complicates things quite a bit.

Best,
Jan 

Bohu TANG

unread,
Oct 7, 2011, 12:14:27 AM10/7/11
to redi...@googlegroups.com
the way #2  seems like our solution on our data-center,we call it "User-Driven and Data-Auto-Drift",but we use "Range-Hash" for global routing-table. Routes as:
<0-127>@slot0
<128-255>@slot1

when after using the "CLUSTER SETSLOT" command,the routing-table is:
---------------------------------------------------------
<0-63>@slot0
---------------------------------------------------------
<64-127>@slot0,Migrating to slot1
---------------------------------------------------------
<128-255>@slot1


using a 2-bits  to mark "migrating range hash","1bit status-map" like this:
-----------
64,status
-----------
65,status
-----------
... ...
-----------
127,status
-----------

There are four combinations of status:
0,will be migrated
1,has been migrated

1)If one client queries 'A' (at first presents in the slot0),but now the route is "<64-127>@slot0,Migrating to slot1"
2)We must check it from "1bit state-map",first queries the status is "0",so migrate it,and update status to "1"
3)Next queries 'A',its status maybe "1",this mean it has been migrated
4)When all records in "1bit status-map" status is "1",the route "<64-127>@slot0,Migrating to slot1" into effect
5)now the global routing-table is:
---------------------------------------------------------
<0-63>@slot0
---------------------------------------------------------
<64-127>@slot1
---------------------------------------------------------
<128-255>@slot1

free the "1bit status-map" struct,all data is automatically drift to complete!
Not smart client, but requires a global routing table.

Bohu TANG

unread,
Oct 7, 2011, 12:39:37 AM10/7/11
to redi...@googlegroups.com
Sorry,NOTE "using  1 bit  to mark "migrating range hash","1bit status-map" like this:"

Josiah Carlson

unread,
Oct 7, 2011, 3:22:01 AM10/7/11
to redi...@googlegroups.com
Between the two, I'd say that I prefer #2 better.

Regards,
- Josiah

On Thu, Oct 6, 2011 at 5:57 AM, Salvatore Sanfilippo <ant...@gmail.com> wrote:

Salvatore Sanfilippo

unread,
Oct 7, 2011, 3:36:02 AM10/7/11
to redi...@googlegroups.com
On Thu, Oct 6, 2011 at 11:31 PM, Jan Oberst <ma...@janoberst.com> wrote:

> The MIGRATE command could just use a version of that replication code to
> slowly bring machine B up to speed. Machine B issues the SYNC command to
> machine A, and once B is "live" and up to speed, all that's left to do is
> for machine A to reply with -MOVED to requests for the keys we just moved
> and point them to machine B. After this is finished, it is announced that
> machine B has the slot and all traffic starts going to machine B.

I and Pieter tried to think at this approach as well, the problem is
that SYNC will transfer a point-in-time (in the past) version of the
dataset. This means that we need to accumulate writes against keys in
A in the meantime. It is very easy to end with an access pattern that
will make the slot moving a never ending process.

The current approach instead has the good point that once you start
migrating a slot all the new data is accumulated in the receiving
node, and data that is sent to the original node (for instance an
LPUSH against an already existing key) does not need any tracking,
we'll just move it when reach that key while iterating the key space.

About the speed, MIGRATE even if not in the background (yet, but we
can move it in the background blocking just one key per time) uses the
same format of SYNC, that is the Redis RDB format, so we expect it to
be very fast as well.

> Besides that, I'm trying to estimate how well the "smart client" concept
> works in certain cases. We're using Redis on PHP quite heavily and the
> shared state between two calls is limited the the persistent connection to
> the Redis server. We do thousands of very short requests per second and
> our PHP processes don't live all that long, so every few minutes or seconds
> they would have to figure out the whole cluster. I'm not sure how efficient
> it would be to implement a 'smart' client for this kind of application. I
> guess a PHP plug-in could save some state between PHP requests, but in
> general it complicates things quite a bit.

There are two things that can be done in this case.

1) Well the obvious way is to ask for "cluster nodes" every time a new
client is created. This is the preferred approach.

2) If you want to avoid incurring in the additional RTT needed to call
cluster nodes, just serialize the previous client routes table in a
session var or something like that, and use something like

$redis->restoreClusterTable($my_old_table).

Make sure to update it from time to time...

Thanks for the feedback,
Salvatore

> Best,
> Jan


>
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to
> redis-db+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/redis-db?hl=en.
>

--

Salvatore Sanfilippo

unread,
Oct 7, 2011, 3:53:14 AM10/7/11
to redi...@googlegroups.com
On Thu, Oct 6, 2011 at 5:05 PM, Jeremy Zawodny <Jer...@zawodny.com> wrote:

> What I like about that solution is:

> (1) It doesn't rely on the client being "smart" in any way.  The smarts are
> in the cluster.

This is definitely an approach, the Cassandra one for instance. I'm
not debating if it is better this approach or the reverse, but the
problem I see is that Redis Cluster is like the "microkernel" of the
clusters, that is, it tries to just do the core functions that can't
be done outside efficiently (that is, to keep state) and let the
client do all the work to retrieve the data.
So this would be contrary to our current design even if not bad per se.

> (2) We get to think of the cluster as a unified system and focus on the
> data.

You can think at Redis Cluster as an unified system from the point of
view of coherent routes, but not as a system that will do for the
client the work to get the data. But there are good things about that
as well. This makes the amount of tradeoffs that you can do client
side more interesting, so Redis Cluster is one implementation
server-side, but different clients may stress one or the other feature
to reach different speed/safety tradeoffs.

> (3) In the case where the redis cluster nodes are "near" each other but some
> network clients may be farther away, the clients don't pay the round-trip
> penalty for fetching data that is being migrated.

Unfortunately this is not the case as it is required to ask the node
authoritative for that hash slot to start (A), and then ask (B).
But the good thing is that this only happens during rehashing, and for
1 part of 4096 at a time of the key space.

I see it in this way, either every cluster node should be able to
proxy full requests, or it is better if we don't do it at all, this is
the main reason why I'm biased for solution #2. The good thing is that
we have an already working and fast binary-protocol speaking link
between nodes, so it will be easy for us to revert to solution #1 if
#2 will not work well enough.

Thanks for the feedbacks,

Adrien Arculeo

unread,
Oct 7, 2011, 5:10:30 AM10/7/11
to redi...@googlegroups.com
Salvatore,

As I understand it A sends a batch of keys of its choice to B that B migrates. Why not B sending a request for keys on the cluster bus to A that get sent to B during the next batch? In the mean time, B blocks the client.

What might be wrong with this behaviour?

Tanti saluti

Adrien

From: "Salvatore Sanfilippo" <ant...@gmail.com>
To: redi...@googlegroups.com
Sent: Friday, 7 October, 2011 9:53:14 AM
Subject: Re: Redis Cluster: an interesting problem about moving hash slots around

Matthew Ranney

unread,
Oct 11, 2011, 9:00:10 PM10/11/11
to redi...@googlegroups.com
On Thu, Oct 6, 2011 at 9:53 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
> (1) It doesn't rely on the client being "smart" in any way.  The smarts are
> in the cluster.

This is definitely an approach, the Cassandra one for instance. I'm
not debating if it is better this approach or the reverse, but the
problem I see is that Redis Cluster is like the "microkernel" of the
clusters, that is, it tries to just do the core functions that can't
be done outside efficiently (that is, to keep state) and let the
client do all the work to retrieve the data.

My perspective as a client author is that I think we'll all be happier if the cluster can proxy things around if it needs to.  It's hard enough getting a client to correctly handle all of the issues around reconnection, waiting for the db to load, etc.  If all clients are now REQUIRED to handle redirects properly in order to use Cluster, then I think this makes everything less reliable.  If you proxy inside the cluster, then there is only one place that this has to work properly.

Of course, I'll strive to implement all of the "smart" client behaviors as quickly and reliably as possible.  The more work we push back to the clients, the harder this is going to be to get right.

Salvatore Sanfilippo

unread,
Oct 12, 2011, 5:38:32 AM10/12/11
to redi...@googlegroups.com
On Wed, Oct 12, 2011 at 3:00 AM, Matthew Ranney <matt....@gmail.com> wrote:
> My perspective as a client author is that I think we'll all be happier if
> the cluster can proxy things around if it needs to.  It's hard enough
> getting a client to correctly handle all of the issues around reconnection,
> waiting for the db to load, etc.  If all clients are now REQUIRED to handle
> redirects properly in order to use Cluster, then I think this makes
> everything less reliable.  If you proxy inside the cluster, then there is
> only one place that this has to work properly.
> Of course, I'll strive to implement all of the "smart" client behaviors as
> quickly and reliably as possible.  The more work we push back to the
> clients, the harder this is going to be to get right.

Hello Matthew,

actually the proposed design where the clients handle part of the
protocol is absolutely reliable: the cluster will never accept a
misplaced query. IMHO it is not a good point of view to watch this as
Redis client author, as what we want in the end is a system that works
well, and a system is composed of client and server. For this system
to work at its best we need to exploit the parallelism of the cluster,
and make sure that clients will try hard to hit the right server. If
we do that using a proxy clients will not do an hard time to make sure
to hit the good node, implementation will be much more complex, and
cluster performances poor if the client is not good. Also the client
in this case must use some special way to get updates about cluster
configuration changes.

The redirect solution instead allows to incrementally update the
client config, but still does not require for the config to be
updated, and simplifies the overall structure a lot since proxying
queries in an even-driven server is a lot more complexity then it is
for a client to be redirected to the right node.

However this part of the cluster is no longer something that I'm
currently designed, it is written on the stone *unless* experience
will prove that it does not work well.

Cheers,

Salvatore Sanfilippo

unread,
Oct 12, 2011, 5:40:13 AM10/12/11
to redi...@googlegroups.com
On Fri, Oct 7, 2011 at 11:10 AM, Adrien Arculeo <aarc...@1024degres.com> wrote:
> As I understand it A sends a batch of keys of its choice to B that B
> migrates. Why not B sending a request for keys on the cluster bus to A that
> get sent to B during the next batch? In the mean time, B blocks the client.

Hi Adrien,

This is not how it works. Resharding is performed asking node A a few
key name in the hash slot we want to move, and then calling MIGRATE
against this keys.

I hope to be able to write some documentation about the design of
Redis Cluster, at least the part that is already implemented in a
detailed way, and the part still to implement in a more generic way,
so that users have access to the design specification.

Cheers,

Jak Sprats

unread,
Oct 12, 2011, 1:18:27 PM10/12/11
to Redis DB
Hi Salvatore,

> If we have a big sorted set, this is going to be a blocking operation indeed.
> Now 1 million key is still ok in *some* application, as this will just
> take like 3 seconds or less (already tested), but it is like the
> acceptable limit, or it is past over the limit if you have many keys
> that size.

> Now what we already planned to improve about that is to just block
> that *single* key, but still the serialization process must be done in
> a blocking way (we can just do the I/O in background using our usual
> event-driven way, while the key is blocked, that is, if other clients
> will arrive asking for this key we'll block just that clients instead
> of the whole instance).

OK, that makes a lot of sense, so the big keys will be in a transition
state for a pretty predictable time span (e.g. 1million keys=>3 secs)
and the rest of the system is not blocked at all.

So that will work for a rev1 for sure. Pretty simple solution, makes
lots of sense, thanks for clearing it up.

- jak
Reply all
Reply to author
Forward
0 new messages