Today I found an interesting race condition in Redis Cluster.
If you are familiar with the design you know that the key space is
split into 4096 parts.
Every part is called an "hash slot", and every node has a routing
table to map every hash slot with a cluster node.
This way if a client sends a query to a node that is not responsible
for the keys mentioned in the query, it gets a -MOVED message
redirecting it to the right node.
However we also have the ability to reconfigure the cluster while it
is running. So for instance I've hash slot 100 that is assigned to
node A. And I want to move it to node B.
This is accomplished (and redis-trib is already able to do it for you
automatically) with the following steps.
1) Node A hash slot 100 is marked as "Migrating to B" (using the
CLUSTER SETSLOT <slot> MIGRATING <ndoe> command).
2) Node B hash slot 100 is marked as "Importing from A" (using the
CLUSTER SETSLOT <slot> IMPORTING <node> command).
3) An external client, usually redis-trib, starts using the commands
CLUSTER GETKEYSINSLOT and the MIGRATE command to atomically move keys
from A to B.
What is interesting is that while the hash slot is set as "Migrating
to B", node A will reply to all the requests about this hash slot of
keys that are *still* present in the hash slot,
but if a request is about a key that is in hash slot 100 but is not
found inside the key space, it generates a "-ASK" error, that is like
"-MOVED" but means: please only ask this exact query to the specified
node, but don't update your table about it. Ask new queries about hash
slot 100 to me again.
This way all the new keys about hash slot 100 are created directly in
B, but A handles all the queries about keys that are still in A.
At the same time redis-trib moves keys from A to B. Eventually all the
keys are moved and the hash slot configuration is consolidated to the
new one, using other CLUSTER SETSLOT subcommands.
So far this is pretty cool. But there is a subtle problem about this.
The Problem
===
When the cluster is stable, that is, there no resharding in progress,
a client may ask a query to a random node.
There is only one node that will reply to queries related to a
specific hash slot. All the other nodes will redirect the client to
this node.
However when rehashing is in progress there are two nodes that will
reply to queries for a given hash slot, that is, the MIGRATING node
and the IMPORTING node.
If the client is a "smart" client with an internal routing table, it
starts every connection to a cluster asking for the slot->node map,
and makes sure to update the table when -MOVED messages are received.
But there are also clients that are not smart, without a table, or
even clients that are smart but perhaps don't update the table since a
lot of time since they are idle, and the cluster moved a lot of hash
slots recently. But to make things simpler let's just focus on the
stupid client that has no internal map. It just send queries to a
random node among a list of configured nodes, expecting to get
redirected if the wrong node was selected.
Such a simple client is only able to deal with -MOVED and -ASK
redirections. And the two messages are handled in the same way, that
is, just asking to the node specified in the redirection message.
It is easy to see how this client may create a race condition, like that:
1) We are migrating slot 100 from A to B.
2) Node A will only accept queries about slot 100 that are related to
keys already in the key space. Otherwise it will reply with -ASK.
3) Node B instead will accept all queries about hash slot 100.
4) Our stupid client need to perform an LPUSH against a key in hash
slot 100. It picks a random client.
5) If it picks "C" or "D" it will be redirected to "A". "A" in turn
will redirect it to "B" with -ASK if the key is not present in A key
space.
6) If it picks "B" directly the query will be accepted by "B", but
what about if "A" already had that key?
RACE!
The cure
===
How to fix that problem? There are different ways.
Way 1:
1) Node B is importing hash slot 100.
2) Node B receives a query about key "foo" in hash slot 100. If it
already has "foo" in the data space, the query is served. If "foo" is
not present the cluster message bus is used to ask "A" if the key is
there or not (and the client is blocked in the meanwhile). If the key
is in node A then node B replies with -MOVED to redirect to "A".
Otherwise it just serves the client. Note that there is no race here,
even if the key is migrated during the redirection, then A will again
point the client to B, and eventually the query will be executed.
This works and allows us to still have completely stupid clients.
However we are violating one of the main design principia of Redis
Cluster: don't proxy stuff for clients.
All the rest of the cluster works just letting the client follow the
right way as indicated by nodes. This time instead we need to ask
nodes for informations.
Way 2 (my preferred one):
1) Node B is importing hash slot 100.
2) Node B receies a query about key "foo" in hash slot 100. If it
already hash "foo" the query is served. Otherwise it issues a "-MOVED"
to redirect the client to A.
3) Node B however will serve the query if the client started the chat
using the command "ASKING", that indicates that this query was issued
after being redirected by a -ASK message. If the client comes from a
-ASK redirection we are sure we can serve the client.
So in the case above what happens is that all the smart clients will
have no problems, after a -ASK redirection they will send:
ASKING
LPUSH foo bar
ASKING sets a flag that is cleared after the command is executed.
If a client is dummy (no internal routing tables caching) but still is
able to remember that after a -ASK redirection it should start the
next query with ASKING, everything is fine as well.
A completely stupid client that is not able to start the chat with
ASKING will simply ping/pong from A to B until the hash slot migration
is completed, and will finally be served.
The reason I like solution #2 more, even if this will make clients a
bit more complicated, are:
1) It is faster. Since during rehashing there is always the "ask A,
get -ASK, ask B" routing, it is bad that we need to play the RTT again
since node "B" has to query "A" again.
2) It does not violate our design principia of letting clients
following their ways.
3) The implementation is simpler.
4) Eventually all the clients will be smart, or at least smart enough
to use the ASKING command.
However this is an "interesting" problem and I would love to hear your
feedbacks and ideas.
Cheers,
Salvatore
--
Salvatore 'antirez' Sanfilippo
open source developer - VMware
http://invece.org
"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele
On Oct 6, 2011, at 2:57 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
> A completely stupid client that is not able to start the chat with
> ASKING will simply ping/pong from A to B until the hash slot migration
> is completed, and will finally be served.
>
>
> 4) Eventually all the clients will be smart, or at least smart enough
> to use the ASKING command.
FWIW I strongly agree. The incentive is big for clients to become smart enough to send ASKING, natural selection will make sure of that :)
Or you can just declare that clients that don't send ASKING are not 100% conformant.
Andrea
I like this idea in principle, especially since we happen to have
already a structure with keys indexed by hash slot (in order to
implement CLUSTER GETKEYSINSLOT) but sub-ordered lexicographically
(basically we are using a sorted set internally).
But I don't understand how this would work.
For instance have keys A, B, C.... , Z
We already got key B. We receive a query about H, while we are sure
that we still did not received H we don't know if H is or not in A.
So I think this solution can't be applied to this problem, or I'm
missing something here?
Thanks for the idea anyway,
On Thu, Oct 6, 2011 at 1:57 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
[...]
> The reason I like solution #2 more, even if this will make clients a
> bit more complicated, are:
>
> 1) It is faster. Since during rehashing there is always the "ask A,
> get -ASK, ask B" routing, it is bad that we need to play the RTT again
> since node "B" has to query "A" again.
> 2) It does not violate our design principia of letting clients
> following their ways.
> 3) The implementation is simpler.
> 4) Eventually all the clients will be smart, or at least smart enough
> to use the ASKING command.
>
> However this is an "interesting" problem and I would love to hear your
> feedbacks and ideas.
Instead of recommending way #2, I'll just say that any solution where
the server has to proxy requests to other nodes is clearly inferior.
I don't know if #2 is the best solution. It does seem to solve this
current problem though.
One question: this ASKING stuff, this would be a new command that
would set a flag in the client connection structure that would be
cleared on the next command, right? Would you consider adding generic
support for prefix flags like
ASKING LPUSH aaa bbb
instead?
It would allow for other flags to come up in the future (BLOCKING
would be my favorite).
Bye,
--
Pedro Melo
@pedromelo
http://www.simplicidade.org/
http://about.me/melo
xmpp:me...@simplicidade.org
mailto:me...@simplicidade.org
Actually redis-trib gets keys in groups of 10 (or 100, it is easy
configurable) using CLUSTER GETKEYSINSLOT <slotid> <count>, so the
process is very incremental and key-oriented instead to be slot
oriented.
Cheers,
Salvatore
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
> Instead of recommending way #2, I'll just say that any solution where
> the server has to proxy requests to other nodes is clearly inferior.
I Pedro, the above is a good way to look at it I think.
> I don't know if #2 is the best solution. It does seem to solve this
> current problem though.
In general everything telling the node that we come from a -ASK
redirection will work, but this new command seems the simpler way.
> One question: this ASKING stuff, this would be a new command that
> would set a flag in the client connection structure that would be
> cleared on the next command, right? Would you consider adding generic
> support for prefix flags like
>
> ASKING LPUSH aaa bbb
The reason I don't like too much the idea of the prefix is that it
needs adding more state to the client in order to support the cluster.
For instance if you want to write a cluster client as a wrapper of an
already existing client you can just trap the -ASK error, reconnect,
send the "ASKING" command, and then call the usual client again to
resend the command.
Implementing it as a prefix will change the structure of the client
that requires to support prefixing in some way.
> instead?
>
> It would allow for other flags to come up in the future (BLOCKING
> would be my favorite).
The idea is to take the command semantics very simple as it is
currently and move all the advanced stuff in Lua, so that we could
have a Redis.block_for_keys("ciao") in Lua that will reissue the
script once the key is available. Otherwise I fear we end with some
kind of ad-hoc language, as prefixing after all is the initial step to
build a language ;)
Way 1:
1) Node B is importing hash slot 100.
2) Node B receives a query about key "foo" in hash slot 100. If it
already has "foo" in the data space, the query is served. If "foo" is
not present the cluster message bus is used to ask "A" if the key is
there or not (and the client is blocked in the meanwhile). If the key
is in node A then node B replies with -MOVED to redirect to "A".
Otherwise it just serves the client. Note that there is no race here,
even if the key is migrated during the redirection, then A will again
point the client to B, and eventually the query will be executed.
This works and allows us to still have completely stupid clients.
However we are violating one of the main design principia of Redis
Cluster: don't proxy stuff for clients.
All the rest of the cluster works just letting the client follow the
right way as indicated by nodes. This time instead we need to ask
nodes for informations.
Way 2 (my preferred one):
1) Node B is importing hash slot 100.
2) Node B receies a query about key "foo" in hash slot 100. If it
already hash "foo" the query is served. Otherwise it issues a "-MOVED"
to redirect the client to A.
3) Node B however will serve the query if the client started the chat
using the command "ASKING", that indicates that this query was issued
after being redirected by a -ASK message. If the client comes from a
-ASK redirection we are sure we can serve the client.
AFAIK, just setting a big value currently stalls other queries, since
Redis is single-threading and all operations are atomic.
--
Javier
> I have not looked enough into Redis Cluster to be very helpful here
> but I don't think what I propose would require blocking clients. It
> just means you have to transfer all the keys before you transfer
> all the values.
Sorry I was not clear, you don't need to block clients, but you need
to block the server to send the keys as Redis is single threaded, and
you can't do this easily in an incremental way as the dataset is
changing continuously.
That's why I like the idea of a migration process that is key-by-key,
so that the only blocking operation is MIGRATE (more about this in
another email to reply to Jak's concerns).
Salvatore
Hello Jak,
If we have a big sorted set, this is going to be a blocking operation indeed.
Now 1 million key is still ok in *some* application, as this will just
take like 3 seconds or less (already tested), but it is like the
acceptable limit, or it is past over the limit if you have many keys
that size.
Now what we already planned to improve about that is to just block
that *single* key, but still the serialization process must be done in
a blocking way (we can just do the I/O in background using our usual
event-driven way, while the key is blocked, that is, if other clients
will arrive asking for this key we'll block just that clients instead
of the whole instance).
That said the main use case with this design is clearly the "many many
keys, relatively small keys".
If the application also needs a few very big objects, it is much
better to take this big objects into a different non-clustered server.
Another possibility could be to reserve an hash slot for big keys,
like: keys with a given prefix will always hash to a given hash slot.
This could be interesting in theory as one can rehash everything
without touching hash slot 1.
Basically I think we'll not start caring too much about the moving big
key use case, but this is something that could be cool to address in
the next releases of Redis Cluster.
Cheers,
Salvatore
If we have a big sorted set, this is going to be a blocking operation indeed.
Now 1 million key is still ok in *some* application, as this will just
take like 3 seconds or less (already tested), but it is like the
acceptable limit, or it is past over the limit if you have many keys
that size.
Now what we already planned to improve about that is to just block
that *single* key, but still the serialization process must be done in
a blocking way (we can just do the I/O in background using our usual
event-driven way, while the key is blocked, that is, if other clients
will arrive asking for this key we'll block just that clients instead
of the whole instance).
That said the main use case with this design is clearly the "many many
keys, relatively small keys".
If the application also needs a few very big objects, it is much
better to take this big objects into a different non-clustered server.
Another possibility could be to reserve an hash slot for big keys,
like: keys with a given prefix will always hash to a given hash slot.
This could be interesting in theory as one can rehash everything
without touching hash slot 1.
Regards,
- Josiah
On Thu, Oct 6, 2011 at 5:57 AM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
> The MIGRATE command could just use a version of that replication code to
> slowly bring machine B up to speed. Machine B issues the SYNC command to
> machine A, and once B is "live" and up to speed, all that's left to do is
> for machine A to reply with -MOVED to requests for the keys we just moved
> and point them to machine B. After this is finished, it is announced that
> machine B has the slot and all traffic starts going to machine B.
I and Pieter tried to think at this approach as well, the problem is
that SYNC will transfer a point-in-time (in the past) version of the
dataset. This means that we need to accumulate writes against keys in
A in the meantime. It is very easy to end with an access pattern that
will make the slot moving a never ending process.
The current approach instead has the good point that once you start
migrating a slot all the new data is accumulated in the receiving
node, and data that is sent to the original node (for instance an
LPUSH against an already existing key) does not need any tracking,
we'll just move it when reach that key while iterating the key space.
About the speed, MIGRATE even if not in the background (yet, but we
can move it in the background blocking just one key per time) uses the
same format of SYNC, that is the Redis RDB format, so we expect it to
be very fast as well.
> Besides that, I'm trying to estimate how well the "smart client" concept
> works in certain cases. We're using Redis on PHP quite heavily and the
> shared state between two calls is limited the the persistent connection to
> the Redis server. We do thousands of very short requests per second and
> our PHP processes don't live all that long, so every few minutes or seconds
> they would have to figure out the whole cluster. I'm not sure how efficient
> it would be to implement a 'smart' client for this kind of application. I
> guess a PHP plug-in could save some state between PHP requests, but in
> general it complicates things quite a bit.
There are two things that can be done in this case.
1) Well the obvious way is to ask for "cluster nodes" every time a new
client is created. This is the preferred approach.
2) If you want to avoid incurring in the additional RTT needed to call
cluster nodes, just serialize the previous client routes table in a
session var or something like that, and use something like
$redis->restoreClusterTable($my_old_table).
Make sure to update it from time to time...
Thanks for the feedback,
Salvatore
> Best,
> Jan
>
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to
> redis-db+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/redis-db?hl=en.
>
--
> What I like about that solution is:
> (1) It doesn't rely on the client being "smart" in any way. The smarts are
> in the cluster.
This is definitely an approach, the Cassandra one for instance. I'm
not debating if it is better this approach or the reverse, but the
problem I see is that Redis Cluster is like the "microkernel" of the
clusters, that is, it tries to just do the core functions that can't
be done outside efficiently (that is, to keep state) and let the
client do all the work to retrieve the data.
So this would be contrary to our current design even if not bad per se.
> (2) We get to think of the cluster as a unified system and focus on the
> data.
You can think at Redis Cluster as an unified system from the point of
view of coherent routes, but not as a system that will do for the
client the work to get the data. But there are good things about that
as well. This makes the amount of tradeoffs that you can do client
side more interesting, so Redis Cluster is one implementation
server-side, but different clients may stress one or the other feature
to reach different speed/safety tradeoffs.
> (3) In the case where the redis cluster nodes are "near" each other but some
> network clients may be farther away, the clients don't pay the round-trip
> penalty for fetching data that is being migrated.
Unfortunately this is not the case as it is required to ask the node
authoritative for that hash slot to start (A), and then ask (B).
But the good thing is that this only happens during rehashing, and for
1 part of 4096 at a time of the key space.
I see it in this way, either every cluster node should be able to
proxy full requests, or it is better if we don't do it at all, this is
the main reason why I'm biased for solution #2. The good thing is that
we have an already working and fast binary-protocol speaking link
between nodes, so it will be easy for us to revert to solution #1 if
#2 will not work well enough.
Thanks for the feedbacks,
> (1) It doesn't rely on the client being "smart" in any way. The smarts areThis is definitely an approach, the Cassandra one for instance. I'm
> in the cluster.
not debating if it is better this approach or the reverse, but the
problem I see is that Redis Cluster is like the "microkernel" of the
clusters, that is, it tries to just do the core functions that can't
be done outside efficiently (that is, to keep state) and let the
client do all the work to retrieve the data.
Hello Matthew,
actually the proposed design where the clients handle part of the
protocol is absolutely reliable: the cluster will never accept a
misplaced query. IMHO it is not a good point of view to watch this as
Redis client author, as what we want in the end is a system that works
well, and a system is composed of client and server. For this system
to work at its best we need to exploit the parallelism of the cluster,
and make sure that clients will try hard to hit the right server. If
we do that using a proxy clients will not do an hard time to make sure
to hit the good node, implementation will be much more complex, and
cluster performances poor if the client is not good. Also the client
in this case must use some special way to get updates about cluster
configuration changes.
The redirect solution instead allows to incrementally update the
client config, but still does not require for the config to be
updated, and simplifies the overall structure a lot since proxying
queries in an even-driven server is a lot more complexity then it is
for a client to be redirected to the right node.
However this part of the cluster is no longer something that I'm
currently designed, it is written on the stone *unless* experience
will prove that it does not work well.
Cheers,
Hi Adrien,
This is not how it works. Resharding is performed asking node A a few
key name in the hash slot we want to move, and then calling MIGRATE
against this keys.
I hope to be able to write some documentation about the design of
Redis Cluster, at least the part that is already implemented in a
detailed way, and the part still to implement in a more generic way,
so that users have access to the design specification.
Cheers,