Deterministic SCAN

811 views
Skip to first unread message

Sebastian Ertel

unread,
Jan 9, 2014, 7:55:40 AM1/9/14
to redi...@googlegroups.com
Hi Redisers,

I have a question regarding the SCAN function:
What is the purpose of its random nature?
Is it because the keys in redis are stored as a set and there it is not possible to create a deterministic SCAN order?

I use Redis in a research project as an intermediate storage. The key space is quite big (> 2 Million) and what I want is that N processes (on different machines) retrieve a certain part of the keys for further processing. I exploited Redis's sequential request handling nature and created the following LUA script that performs partial scans and stores the state (cursor) at the redis server:

local db = ARGV[1] or 11
local count = ARGV[2] or 100
local s=0
redis.call('SELECT', '1')
if redis.call('EXISTS', 'scan_cursor') == 1 then
  s=redis.call('GET', 'scan_cursor') 
end
redis.call('SET', 'scan_cursor', s+count) 
redis.call('SELECT', db)
if s+count > redis.call('DBSIZE') then
  return nil
else 
  local result = redis.call('SCAN', s, 'COUNT', count) 
  return redis.call('MGET', unpack(result[2]))
end

Any process can continue the scan which is fine by my application.
However, as the documentation of the SCAN function says, a SCAN can return items twice. This renders the SCAN function unusable for me. My current solution takes all keys and puts them in a separate list to achieve the same effect. This has the drawback though that I need to copy the whole key space which is most of the space in the database.

Is there any other way to accomplish what I want? 
If nobody alters the state in the DB during the scan operation then it would be great to have a more deterministic response.

I would be happy to hear some thoughts on that.

Cheers,
Sebastian

Josiah Carlson

unread,
Jan 9, 2014, 11:27:05 AM1/9/14
to redi...@googlegroups.com
Replies inline,

On Thu, Jan 9, 2014 at 4:55 AM, Sebastian Ertel <sebasti...@gmail.com> wrote:
Hi Redisers,

I have a question regarding the SCAN function:
What is the purpose of its random nature?
Is it because the keys in redis are stored as a set and there it is not possible to create a deterministic SCAN order?

The keys are stored in a hash table. There *is* a trivially deterministic SCAN order, but it requires that the keyspace not change during evaluation. Without possibly substantial overhead for limited discernible gain, it isn't really possible to offer no repeats when periodically iterating over a subset of the keyspace. If you are having difficulties undersanding this, imagine your client waiting hours between SCAN steps.

I use Redis in a research project as an intermediate storage. The key space is quite big (> 2 Million) and what I want is that N processes (on different machines) retrieve a certain part of the keys for further processing. I exploited Redis's sequential request handling nature and created the following LUA script that performs partial scans and stores the state (cursor) at the redis server:

local db = ARGV[1] or 11
local count = ARGV[2] or 100
local s=0
redis.call('SELECT', '1')
if redis.call('EXISTS', 'scan_cursor') == 1 then
  s=redis.call('GET', 'scan_cursor') 
end
redis.call('SET', 'scan_cursor', s+count) 
redis.call('SELECT', db)
if s+count > redis.call('DBSIZE') then
  return nil
else 
  local result = redis.call('SCAN', s, 'COUNT', count) 
  return redis.call('MGET', unpack(result[2]))
end

Any process can continue the scan which is fine by my application.
However, as the documentation of the SCAN function says, a SCAN can return items twice. This renders the SCAN function unusable for me. My current solution takes all keys and puts them in a separate list to achieve the same effect. This has the drawback though that I need to copy the whole key space which is most of the space in the database.

Is there any other way to accomplish what I want? 
If nobody alters the state in the DB during the scan operation then it would be great to have a more deterministic response.

If no one is manipulating the DB during the scan operation, and Redis is not in the process of rehashing (due to previous growth/shrinking of the main keyspace), then SCAN should not return repeated keys. The repeated keys are primarily due to rehashing during main hash table resizing.

Alternatively, in this situation you could just run "KEYS *" on one of your clients, then handle work distribution by RPUSHing packed blocks of your keys back to a Redis list for consumption by your BLPOP-executing clients. There could also be other faster solutions that can reduce the size of your SCAN or KEYS result, but still get you blocks of keys... are your keys of a certain pattern? Can you show us a few example keys?

 - Josiah

I would be happy to hear some thoughts on that.

Cheers,
Sebastian

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/groups/opt_out.

Sebastian Ertel

unread,
Jan 9, 2014, 3:30:19 PM1/9/14
to redi...@googlegroups.com
During the time of scanning no alterations are performed. So it might be the rehashing effects that I'm seeing.
Here is the simple test program (Jedis Client) that I'm running:

Jedis jedis = new Jedis("127.0.0.1");
jedis.select(11);
jedis.flushAll();
jedis.scriptFlush();
String scanScript = "local db = ARGV[1] or 11 "
+ "local count = ARGV[2] or 100 " + "local s=0 "
+ "redis.call('SELECT', '1') "
+ "if redis.call('EXISTS', 'scan_cursor') == 1 then "
+ "s=redis.call('GET', 'scan_cursor') " 
+ "end "
+ "redis.call('SET', 'scan_cursor', s+count) "
+ "redis.call('SELECT', db) "
+ "if s+count > redis.call('DBSIZE') then "
+ "return nil "
+ "else "
+ "local result = redis.call('SCAN', s, 'COUNT', count) "
+ "return redis.call('MGET', unpack(result[2])) " + "end";
String scriptID = jedis.scriptLoad(scanScript);
for (int i = 0; i < 100; i++) jedis.set("key" + i, "value" + i);
Object result = jedis.evalsha(scriptID, 0, "11", "20");
while (result != null) {
System.out.println(result);
System.out.println(((List)result).size());
result = jedis.evalsha(scriptID, 0, "11", "20");
}
jedis.disconnect();


Sample output looks like so:

[value82, value97, value83, value25, value43, value45, value84, value75, value11, value59, value85, value15, value50, value32, value40, value96, value94, value44, value86, value23]

[value44, value86, value23, value3, value37, value58, value92, value26, value8, value38, value89, value60, value6, value29, value28, value81, value21, value1, value69, value52]

[value85, value15, value50, value32, value40, value96, value94, value44, value86, value23, value3, value37, value58, value92, value26, value8, value38, value89, value60, value6]

[value89, value60, value6, value29, value28, value81, value21, value1, value69, value52, value17, value55, value70, value93, value13, value16, value47, value4, value65, value64]

[value25, value43, value45, value84, value75, value11, value59, value85, value15, value50, value32, value40, value96, value94, value44, value86, value23, value3, value37, value58]

The order seems deterministic but the starting point does not seem to be. I'm also pretty sure that SRANDMEMBER is being used during the SCAN because set calls do not work afterwards anymore.

- Sebastian

Josiah Carlson

unread,
Jan 9, 2014, 6:42:30 PM1/9/14
to redi...@googlegroups.com
SRANDMEMBER is not being called by your script at all. It's not there, and Redis doesn't just perform arbitrary operations just to mess with you. SET doesn't work in your script because you are performing an operation (SCAN) that doesn't necessarily have consistent output. So when you try to write, Redis detects this write after a non-consistent read, and prevents the write because it would break Lua scripting assumptions. For reasons why, read the first paragraph of my post here: https://groups.google.com/d/msg/redis-db/UwBfuOQQN_4/j3Y-mwMvg5YJ . It's about ZSCAN, but the implications are all the same between SCAN, ZSCAN, SSCAN, and HSCAN.

 - Josiah

Sebastian Ertel

unread,
Jan 10, 2014, 3:41:18 AM1/10/14
to redi...@googlegroups.com
I did not say that it is called by my script but some non-deterministic operation is used by SCAN which is the reason why a write operation fails afterwards because the scripting assumptions are violated (which I know). I thought I had read somewhere that SCAN internally calls SRANDMEMBER. If it doesn't then nevermind.

All I'm saying is that this non-deterministic operation might be the reason for the non-deterministic output that I'm seeing. Is that correct? What is this non-deterministic operation? Has it to do with the rehashing that you mentioned? Why would redis rehash if no new entries are being created? Can you please explain the internals on the simple example that I gave above? The program only inserts 100 keys, so there should not be too much to do for redis and the internal state should just stay the same.

I'm really thankful for your comments but I just can't figure out a use case for a non-deterministic SCAN function. I do understand that a single SCAN over some keys can be useful but then why would I want this cursor concept? I could just start another SCAN without any cursor and hope to get different values this time.

- Sebastian

Josiah Carlson

unread,
Jan 10, 2014, 2:41:52 PM1/10/14
to redi...@googlegroups.com
Replies inline.

On Fri, Jan 10, 2014 at 12:41 AM, Sebastian Ertel <sebasti...@gmail.com> wrote:
I did not say that it is called by my script but some non-deterministic operation is used by SCAN which is the reason why a write operation fails afterwards because the scripting assumptions are violated (which I know). I thought I had read somewhere that SCAN internally calls SRANDMEMBER. If it doesn't then nevermind.

I'm not seeing it in the docs. And a quick scan of the source in db.c and dict.c show that there is no call to any random source at all.

All I'm saying is that this non-deterministic operation might be the reason for the non-deterministic output that I'm seeing. Is that correct?

No.
 
What is this non-deterministic operation?

SCAN. As I said in the other thread, the order of results returned by SCAN is a function of several things, none of which are necessarily consistent on a fully up-to-date slave, a master replaying an AOF, or even two masters identically configured that were started up with two identical RDBs. So you *can't* make *any* assumptions about the results returned by SCAN, other than: barring certain types of writes, you should pick up essentially every key from the keyspace that matches the optional pattern.
 
Has it to do with the rehashing that you mentioned?

Maybe. Maybe not.
 
Why would redis rehash if no new entries are being created?

It could be a holdover from a previous rehashing that hadn't finished. Rehashing in Redis occurs over time.
 
Can you please explain the internals on the simple example that I gave above? The program only inserts 100 keys, so there should not be too much to do for redis and the internal state should just stay the same.

For the simplicity of implementation and explanation of SCAN (and other related functions), SCAN is considered a random operation for the several reasons I've already explained. It won't allow you to write after reading, because Redis is trying to prevent you from doing things that will be inconsistent. Why are you getting different orders when you call? I can think of several possible reasons, but I can't tell you for sure, because I don't know all of the things that Redis does with hashes.

I'm really thankful for your comments but I just can't figure out a use case for a non-deterministic SCAN function.

It replaces KEYS for people who want to examine their keyspace (find high-memory keys, keys that should be deleted but weren't, ...), but who don't want all other commands to be delayed by seconds or minutes in some cases in order to return the list of keys. RANDOMKEY was a way of sampling the keyspace, but it's not nearly good enough for people who really need to examine the keyspace.
 
I do understand that a single SCAN over some keys can be useful but then why would I want this cursor concept?

To try to examine all keys. SCAN is a replacement (in some cases) for KEYS, like SSCAN is a replacement for SMEMBERS, HSCAN is a replacement for HGETALL, and ZSCAN is a replacement for certain ZRANGE calls. This is the incremental reading of certain data that may be large, and thus heavy to read completely. That's why the commands were introduced.
 
I could just start another SCAN without any cursor and hope to get different values this time.

Hoping is not the same as guaranteeing. SCAN and friends have certain guarantees if called with cursors. Those guarantees are listed in the documentation for SCAN - primarily that if a key is consistently in or out of the keyspace during a full iteration, the key will be returned at least once. If the key is inconsistently present, then it may or may not show up. Calling SCAN repeatedly without a cursor is likely to return one of a group of keys at the start of the hash table for most calls. 

Sebastian Ertel

unread,
Jan 13, 2014, 3:18:48 AM1/13/14
to redi...@googlegroups.com
I think what I do not understand hides behind the following of your lines:
"SCAN is a function of several things, none of which are necessarily consistent on a fully up-to-date slave, a master replaying an AOF, or even two masters identically configured that were started up with two identical RDBs"

I don't have any distributed redis setup. I barely have a single redis server running. Therefore, it ought to be possible in that case to provide some more guarantees. However, I do understand that it might be too much to explain every detail here. Maybe I find some time to implement what I envision into redis and share it with you.

Thanks for the discussion,
Sebastian

Josiah Carlson

unread,
Jan 13, 2014, 1:54:19 PM1/13/14
to redi...@googlegroups.com
On Mon, Jan 13, 2014 at 12:18 AM, Sebastian Ertel <sebasti...@gmail.com> wrote:
I think what I do not understand hides behind the following of your lines:
"SCAN is a function of several things, none of which are necessarily consistent on a fully up-to-date slave, a master replaying an AOF, or even two masters identically configured that were started up with two identical RDBs"

I don't have any distributed redis setup. I barely have a single redis server running. Therefore, it ought to be possible in that case to provide some more guarantees.

Actually no. If you were to execute BGSAVE on your server and copy the resulting RDB to another server, then if you were able to execute your script on both servers at the same time, you would have different results. This is part of what the current semantics are intending to prevent.
 
However, I do understand that it might be too much to explain every detail here. Maybe I find some time to implement what I envision into redis and share it with you.

If you are talking about making writing after a SCAN call allowed, I can guarantee that Salvatore is going to reject your proposal for exactly the reason I mentioned earlier in this email.

Why do you want to call SCAN? What do you hope to do with the results?

 - Josiah

Sebastian Ertel

unread,
Jan 14, 2014, 3:26:31 AM1/14/14
to redi...@googlegroups.com
On Jan 13, 2014, at 7:54 PM, Josiah Carlson <josiah....@gmail.com> wrote:


On Mon, Jan 13, 2014 at 12:18 AM, Sebastian Ertel <sebasti...@gmail.com> wrote:
I think what I do not understand hides behind the following of your lines:
"SCAN is a function of several things, none of which are necessarily consistent on a fully up-to-date slave, a master replaying an AOF, or even two masters identically configured that were started up with two identical RDBs"

I don't have any distributed redis setup. I barely have a single redis server running. Therefore, it ought to be possible in that case to provide some more guarantees.

Actually no. If you were to execute BGSAVE on your server and copy the resulting RDB to another server, then if you were able to execute your script on both servers at the same time, you would have different results. This is part of what the current semantics are intending to prevent.

And again you switch my setup into a distributed system but apparently my setup is totally trivial and has nothing to do with any copying or such. There is only a single instance on one single node that I use as a cache.

Just for the sake of this conversation: I'm familiar with consistency problems in distributed systems (I know that it almost always requires some sort of synchronization which would violate the asynchronous nature of most of Redis' calls).

But let me refer to your example:
If I save a database, which is in a state S, running on server s1 via BGSAVE to disk and then copy that state S to another server s2 and start another Redis instance there which loads state S then, under the assumption that there were no further modifications to the keyspace on s1 after the BGSAVE command, a deterministic SCAN could be started on both instances and would deliver the exact same result. For that I could just use the SORT function over the keyspace which is already present in Redis. 

However, even if the keyspace changed on s1 after BGSAVE, any SCAN that would work according to a sorted keyspace would deliver a deterministic result. The determinism here is not about consistency but barely on the guarantees of the SCAN function to never return duplicate keys which makes the cursor concept much more intuitive and probably finds much more use cases.

 
However, I do understand that it might be too much to explain every detail here. Maybe I find some time to implement what I envision into redis and share it with you.

If you are talking about making writing after a SCAN call allowed, I can guarantee that Salvatore is going to reject your proposal for exactly the reason I mentioned earlier in this email.

I don't think I said that. See my comment above to better understand what I mean.
In any case, this "experiment" can only have two outcomes: Either you think the delivered deterministic SCAN function is reasonable or I finally understand what the problem with such a function in Redis is, because as of now I just don't understand it.


Why do you want to call SCAN? What do you hope to do with the results?

My use case is simple data parallelism: I want to split the key space into N parts and assign each part to a worker. Duplicate or missing keys are not allowed for my use case.

- Sebastian

You received this message because you are subscribed to a topic in the Google Groups "Redis DB" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/redis-db/daK49HGBrm0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to redis-db+u...@googlegroups.com.

Salvatore Sanfilippo

unread,
Jan 14, 2014, 4:02:20 AM1/14/14
to Redis DB
On Thu, Jan 9, 2014 at 1:55 PM, Sebastian Ertel
<sebasti...@gmail.com> wrote:

> What is the purpose of its random nature?

That order costs space and time, especially in the context of
iterating unordered collections with a cursor.

SCAN is an exercise in trying to do less but enough in order to
maximize the gain in space and time, at the point that it is a single
64 bit cursor.

Cheers,
Salvatore

--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

To "attack a straw man" is to create the illusion of having refuted a
proposition by replacing it with a superficially similar yet
unequivalent proposition (the "straw man"), and to refute it
— Wikipedia (Straw man page)

Sebastian Ertel

unread,
Jan 14, 2014, 4:31:48 AM1/14/14
to redi...@googlegroups.com
Now that answer, I totally understand. Thanks for that.

So it is actually possible. Do you think it would be hard to actually provide this functionality for specific databases?
Say in my redis.conf, I specify that database 10's keys should be sorted. Then all there is additionally to store is the sorting information over the keys. Alternatively, one could create the additional information via another flavor of the SORT command.

Redis is very efficient but at the trade-off of guarantees. In the benchmarks you show that reading a large number of keys is very fast. However, in order to do so I always need to know the keys, e.g. the keys can be either deterministically derived from something or are stored somewhere else. If this is not the case and I want to explore the contents of Redis then I think I'm in trouble. In that case I'm left with the KEYS * option which retrieves or at least duplicates all keys and an uncertain SCAN operation.

Cheers,
Sebastian

Salvatore Sanfilippo

unread,
Jan 14, 2014, 4:36:49 AM1/14/14
to Redis DB
Unfortunately adding such a feature is not planned. Inside scripts you
have KEYS that is marked as "Sort" in the command table, so when
called from Lua scripts the return value gets sorted.
Otherwise you can still SCAN and pass N keys in batch to the Lua
script. Or you use a sorted set where you insert what you need sorted.
And so forth.
Depending on the problem to solve it is likely possible to find a
decent solution with the tools we have, or likely the task is not the
perfect task for something unstructured such as Redis is.

Regards,
Salvatore
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/redis-db.
> For more options, visit https://groups.google.com/groups/opt_out.



Sebastian Ertel

unread,
Jan 14, 2014, 4:49:27 AM1/14/14
to redi...@googlegroups.com
Thanks, that actually answered another question of mine with respect to the sorted output of the KEYS command in LUA scripts. I currently retrieve all keys via KEYS and store them into a list (all in a LUA script of course) to achieve what I want. However, I have a huge key space and therefore it almost doubles the size of the DB.

Would however be any interest into such an addition to Redis or would this violate one of its design decisions? If not then I might decide to take a crack on that as soon as I find some spare cycles.

Thanks again for the clarifying answers,
Sebastian

Salvatore Sanfilippo

unread,
Jan 14, 2014, 4:53:34 AM1/14/14
to Redis DB
Currently there is no plan to have keys sorted... in the future it is
not clear what will happen since Redis Cluster instances are forced to
take keys sorted in a specific way. Currently only instances in
cluster mode pay the cost for this... basically in the short term no
change is planned about that. I believe that if you have to duplicate
your key names into a list, likely there are other possible solutions.
Maybe you can try to show in simple terms what you are trying to
achieve in the hope that there is a better schema.

Regards,
Salvatore

On Tue, Jan 14, 2014 at 10:49 AM, Sebastian Ertel

Sebastian Ertel

unread,
Jan 14, 2014, 10:03:07 AM1/14/14
to redi...@googlegroups.com
I have a rather large set of keys. The keys are strings, e.g. words or even multiple words. I run a benchmark so I can not change that data set. Redis is only meant as a temporary storage.

After sorting the key space, I want to split the data in N partitions such that N workers can retrieve them and perform further processing of the data in parallel.

I also read the section on memory optimization from the documentation. However, I do not think that hashes can help here.
Currently my Lua script will retrieve all keys via KEYS * and put them as a list into a new field called "ordered_keys". From there each of the N workers can retrieve their partition of the key space, e.g. worker 1 will be assigned keys 1 up to 1000 etc.

I do not know how to preserve the sorting information without duplicating the keyspace. Do you have any idea?

Thanks for your help,
Sebastian

Salvatore Sanfilippo

unread,
Jan 14, 2014, 10:16:26 AM1/14/14
to Redis DB
You should probably simulate a poor's man tuple space.

You put your keys in database 0. When a new worker needs some new key
to process, it sends a Lua script that moves it in database 1 for
processing (as a form of locking).
There are no race condiitons here, the key either is moved (with the
MOVE command) to the other DB or was already taken by another worker.

After processing the key is further moved into database 2.

In database 1, while the key is being processed, there is always an
hash associated with the key with meta-data about it, like when the
key was locked. So for example when a key is locked for too much time,
it is returned in the database 0 so that another worker can process
it, so you handle crashes workers well here. This work of
incrementally move keys from DB 1 to 0 can be performed by the
workers, from time to time (for example every 100 keys processed they
check a random entry in DB 1 to see if it is stale and requires to be
moved).

I understand I probably did not covered all the details but sounds
like a setup that could work for you.

Salvatore

On Tue, Jan 14, 2014 at 4:03 PM, Sebastian Ertel

Salvatore Sanfilippo

unread,
Jan 14, 2014, 10:18:46 AM1/14/14
to Redis DB
p.s. please note how this works also in the context where Redis is
just a node in the path of a more complex flow.

You could have other workers moving keys to Redis DB 0 from another
entity. And when they are processed into DB 2, yet another worker to
move keys from DB 2 to another external entity.

This external entity may be an on disk database, a filesystem, whatever.

Sebastian Ertel

unread,
Jan 14, 2014, 10:38:31 AM1/14/14
to redi...@googlegroups.com
Moving keys from one database to the other is definitely an interesting approach.

It would definitely work when every worker retrieves all its keys at once. Seems like I need as many databases as I want to have chunks that I want to be retrieved by all workers. Hmm, if there is no limit on the number of databases then this might be a valid option.

It would double the sorting costs (one global sort and then a sort when retrieving all keys from a particular "partition-db") but certainly solve the space issue.

I will try that out, thanks a lot!

Cheers,
Sebastian
Reply all
Reply to author
Forward
0 new messages