Any way to use strings as the scores in a Redis sorted set (zset)?

656 views
Skip to first unread message

Timothy Burgess

unread,
Oct 18, 2014, 4:26:22 PM10/18/14
to redi...@googlegroups.com

Or maybe the question should be: What's the best way to represent a string as a number, such that sorting their numeric representations would give the same result as if sorted as strings? I devised a way that could sort up to 9 characters per string, but it seems like there should be a much better way.

In advance, I don't think using Redis's lexicographical commands will work. (See the following example.)

Example: Suppose I want to presort all of the names linked to some ID. Ideally I would have the IDs as the zset's members, and the numeric representation of each name would be the zset's scores.

Does that make sense? Or am I going about it wrong?

binoy...@gmail.com

unread,
Oct 19, 2014, 7:05:03 AM10/19/14
to redi...@googlegroups.com
Experiment with Hashcodes of each string, see what values they show.

Secondly you can create a Zset for each ID and the names are members of that zset

Itamar Haber

unread,
Oct 19, 2014, 7:43:22 AM10/19/14
to redi...@googlegroups.com
You can also refer to this earlier discussion in the mailing list: https://groups.google.com/d/msg/redis-db/naP9H2WHK_I/KQc2Jn4oFvkJ - but I do think that you can use ZLEX* if you construct your set's members accordingly (e.g. `name:id`)

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.



--

Itamar Haber | Chief Developers Advocate
Redis Watch Newsletter - Editor and Janitor
Redis Labs - Enterprise-Class Redis for Developers

Mobile: +1 (415) 688 2443
Mobile (IL): +972 (54) 567 9692
Email: ita...@redislabs.com
Skype: itamar.haber

Blog  |  Twitter  |  LinkedIn


Timothy Burgess

unread,
Oct 19, 2014, 8:58:24 AM10/19/14
to redi...@googlegroups.com
That is really clever!  Seems like it should work without issue.  Thanks a ton!

Josiah Carlson

unread,
Oct 19, 2014, 2:54:46 PM10/19/14
to redi...@googlegroups.com
If you are looking for cleaner code, I use the function linked below in github.com/josiahcarlson/rom to generate scores from string prefixes/suffixes to be used in prefix/suffix matching:

Note that the function properly distinguishes between strings of nulls and strings that are shorter than 7 characters.

 - Josiah
Message has been deleted

Timothy Burgess

unread,
Oct 20, 2014, 8:47:14 AM10/20/14
to redi...@googlegroups.com
Well, I just sat down and started working on the project that this question was about, and immediately realized this won't work for what I want to do, or at least not as efficiently as I would like. I would like to be able to use ZINTERSTORE to intersect a SET of IDs with the ZSET, and this would require the ZSET to use the IDs as its members.


On Sunday, October 19, 2014 7:43:22 AM UTC-4, Itamar Haber wrote:

Itamar Haber

unread,
Oct 20, 2014, 10:42:08 AM10/20/14
to redi...@googlegroups.com
In that case, Josiah's approach makes even more sense as it will allows you to use "pure" IDs as the sorted set's members, thus allowing for set operations. You can pick up and use the _prefix function as is (if you're working in Python that is) as long as the 7-character "limit" is acceptable in your use case.

Perhaps I'm stating the obvious, but in some cases it is easier & cheaper to sort only the end result (e.g. the result of an intersect is usually smaller than the inputs). Furthermore, sorting can be done either server- or client-side - your choice should be based on the use case's requirements and then on the functionality offered by your tools of choice. Put differently - why not intersect the ID sets in Redis and then sort them by name? Is there a specific reason for keeping the IDs sorted by name in the first place (apart from saving CPU cycles)?

Timothy Burgess

unread,
Oct 20, 2014, 5:18:46 PM10/20/14
to redi...@googlegroups.com
I suppose I'll start from the beginning and explain what I'm trying to do.

1)  Users create any resource with arbitrary properties via a RESTful API, where the properties of that resource can be of any Redis data type.  I should probably note that users don't normally need to bother with the API, as most calls to it are handled automatically by the app.  Users typically only specify the data types (and even then, some properties' data types are determined automatically).

2)  The keys representing each property look something like <namespace>:<property>:<id>, but for certain data types (like strings and numbers) it defaults to Redis hashes in the form of <namespace>:<property>:<id.substring(0,4)> for the key and <id.substring(4)> for the field.

3)  The API works similarly to Redis.  For example, suppose you have some list called "exampleList".  You could POST some value to /<id>/exampleList?_bUnshift=1 to put it at the beginning of the list, or DELETE /<id>/exampleList?_start=1&_stop=3 to trim it.

4)  Users can retrieve multiple resources at once via queries.  So for example, suppose you wanted to retrieve all the properties of red shirts of a certain brand, that request might look like: GET /query?type=shirt&color=red&brand=foo&_limit=10.  Or if you only want to retrieve a few specific properties: GET /query/name,description,price?type=shirt&color=red&brand=foo&_limit=10.  This would return the latest 10 resources matching those properties.  Or if you wanted to sort by name: GET /query?type=shirt&color=red&brand=foo&_limit=10&_sortBy=name.

All of that is fairly standard, basic stuff, I believe.  But hopefully that's enough info to understand my use case, and I'll try to explain my reasoning for wanting to be able to somehow use strings as scores in ZSETs.

On the backend, the queries are handled using SINTER.  When some resource is created or its properties are changed, there are sets in Redis whose keys look like <namespace>:by:<property>:<value> and contain all the IDs of the property:value pairs.  So for instance, suppose some resource whose ID is 5 has a property called "color" and its value is "red".  The Redis set at key <namespace>:by:color:red would contain the ID "5".  And the call to Redis for the above query would look something like "SINTER <namespace>:by:type:shirt <namespace>:by:color:red <namespace>:by:brand:foo" and it would return all of the matching IDs.

Now for the problem of sorting by strings... I anticipate there sometimes being an incredibly large number of IDs resulting from the intersection, and so I would rather not have to retrieve the (string) property for each ID if it's feasible to have everything presorted (which I seem to be able to do up to 9 characters) and then intersected using ZINTERSTORE.

I am aware of Redis's SORT command, but wouldn't it be considerably slower than using ZINTERSTORE?  Plus, I think I would need to restructure certain property types (see point 2 above) to make it work, and I would prefer to not do that.

Does all of that make sense?  If I'm completely overlooking something or if this is simply a bad design or if there's a flaw in my logic somewhere, please let me know!

Thanks,

Tim

--
You received this message because you are subscribed to a topic in the Google Groups "Redis DB" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/redis-db/ifmotqIDwi8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to redis-db+u...@googlegroups.com.

Josiah Carlson

unread,
Oct 21, 2014, 5:44:38 PM10/21/14
to redi...@googlegroups.com
On Mon, Oct 20, 2014 at 2:18 PM, Timothy Burgess <timothy.b...@gmail.com> wrote:
I suppose I'll start from the beginning and explain what I'm trying to do.

1)  Users create any resource with arbitrary properties via a RESTful API, where the properties of that resource can be of any Redis data type.  I should probably note that users don't normally need to bother with the API, as most calls to it are handled automatically by the app.  Users typically only specify the data types (and even then, some properties' data types are determined automatically).

2)  The keys representing each property look something like <namespace>:<property>:<id>, but for certain data types (like strings and numbers) it defaults to Redis hashes in the form of <namespace>:<property>:<id.substring(0,4)> for the key and <id.substring(4)> for the field.

3)  The API works similarly to Redis.  For example, suppose you have some list called "exampleList".  You could POST some value to /<id>/exampleList?_bUnshift=1 to put it at the beginning of the list, or DELETE /<id>/exampleList?_start=1&_stop=3 to trim it.

4)  Users can retrieve multiple resources at once via queries.  So for example, suppose you wanted to retrieve all the properties of red shirts of a certain brand, that request might look like: GET /query?type=shirt&color=red&brand=foo&_limit=10.  Or if you only want to retrieve a few specific properties: GET /query/name,description,price?type=shirt&color=red&brand=foo&_limit=10.  This would return the latest 10 resources matching those properties.  Or if you wanted to sort by name: GET /query?type=shirt&color=red&brand=foo&_limit=10&_sortBy=name.

All of that is fairly standard, basic stuff, I believe.  But hopefully that's enough info to understand my use case, and I'll try to explain my reasoning for wanting to be able to somehow use strings as scores in ZSETs.

On the backend, the queries are handled using SINTER.  When some resource is created or its properties are changed, there are sets in Redis whose keys look like <namespace>:by:<property>:<value> and contain all the IDs of the property:value pairs.  So for instance, suppose some resource whose ID is 5 has a property called "color" and its value is "red".  The Redis set at key <namespace>:by:color:red would contain the ID "5".  And the call to Redis for the above query would look something like "SINTER <namespace>:by:type:shirt <namespace>:by:color:red <namespace>:by:brand:foo" and it would return all of the matching IDs.

This is pretty standard searching in Redis, though your #2 is a little funky. What is more common (in the #2 case) is to just store data as:

<namespace>:id: -> {<property>: <value>, ...}

Then at least for properties, pulling all of them is trivial for a given id (HGETALL), and you can optionally sort by those fields after (as you mention).


Now for the problem of sorting by strings... I anticipate there sometimes being an incredibly large number of IDs resulting from the intersection, and so I would rather not have to retrieve the (string) property for each ID if it's feasible to have everything presorted (which I seem to be able to do up to 9 characters) and then intersected using ZINTERSTORE.

I am aware of Redis's SORT command, but wouldn't it be considerably slower than using ZINTERSTORE?

Some testing tells me that SORT is going to be more expensive for your use-case (and compared to sort vs. ZINTERSTORE/ZUNIONSTORE generally), if only because Redis has to perform main hash lookups (plus type checks) in the SORT call. Here are a few timings of doing 100 sorts on a SET of 10,000 items in varying conditions.

SORT plain 0.197018861771
ZINTERSTORE 0.384052991867
SORT no extra 0.53533411026
SORT extra first 2.41256999969
SORT extra last 0.579102039337
SORT extra no ziplist 0.8206179142

In "SORT plain", we are just sorting data in the list with "SORT <key> STORE <other key>".

All of the rest of the sort versions use "SORT <key> BY ex:*->foo STORE <other key>".

In the "SORT no extra" version, hashes that contain the data to sort by *only* contain the one field/property to sort by.

In "SORT extra first", 200 unrelated field names and values are added to the data hashes, then the sort by field.

In "SORT extra last", 200 unrelated field names and values are added to the data hashes after the sort by field has already been added.

In "SORT extra no ziplist", one field of 65 characters was added, disabling ziplist hash encoding for the hashes (which is based on the 64 character limit in my Redis config)


Looking at the timings, it is interesting to note the substantial difference between no extra entries in the hash, and having 200 entries in the hash that were added first (this has to do with the ziplist encoding of small hashes), compared to having the sort column added first, vs. a regular hash, and even vs. no external data.

All in all, using a ZSET for sorting in this case will be a bit faster in general, assuming that you are happy with your sort prefix.
 
Plus, I think I would need to restructure certain property types (see point 2 above) to make it work, and I would prefer to not do that.

Aside from minimizing the number of keys in Redis (which can help reduce memory to some extent), why are you sharding by id prefix (other sharding methods may offer better memory savings)?

Does all of that make sense?  If I'm completely overlooking something or if this is simply a bad design or if there's a flaw in my logic somewhere, please let me know!

As long as you are okay with only being able to sort by a prefix of a string, you are just fine.

If your data is of a specific type (like plain text as opposed to binary), you may be able to sort by longer prefixes via pre-encoding your data. Say, for example, you know that your data is all alpha-numeric plus spaces; you could get down to 6 bits per character, for a 10 character prefix to sort by. If you only care about alpha plus space, ignoring case, you can get 5 bits, then you can sort on a 12 character prefix easily, or 13 characters with work..

Or... if you have a special pre-trained arithmetic coder (trained on example prefix data), you could get a varying-length prefix sort (more characters for common prefixes, fewer characters for less common prefixes), but that's probably a bit too much work for the situation (you can squeeze a couple extra characters in your prefix comparisons).

 - Josiah

Timothy Burgess

unread,
Oct 21, 2014, 6:19:22 PM10/21/14
to redi...@googlegroups.com
Hey Josiah, thanks for the detailed reply!

> This is pretty standard searching in Redis, though your #2 is a little funky. What is more common (in the #2 case) is to just store data as:
> <namespace>:id: -> {<property>: <value>, ...}
> Aside from minimizing the number of keys in Redis (which can help reduce memory to some extent), why are you sharding by id prefix (other sharding methods may offer better memory savings)?

I've read that storing data using that method, where possible, is the most efficient way and can reduce memory usage quite a bit with the proper config, which makes sense to me.  What other methods would you recommend?  To me, it doesn't make sense to use the typical property:value hashes unless I were using the SORT command, which I don't think I will actually use.  Plus, I obviously can't store other data types (sets, lists, bitmaps, etc.) within said hash table, which are quite common.

I don't anticipate string searches being common, at least not for anything important, so I think a 12 character case-insensitive alphanumeric prefix will work just fine.  How can this be achieved?

Thanks a ton for your help!

Tim

Josiah Carlson

unread,
Oct 21, 2014, 7:25:14 PM10/21/14
to redi...@googlegroups.com
On Tue, Oct 21, 2014 at 3:19 PM, Timothy Burgess <timothy.b...@gmail.com> wrote:
Hey Josiah, thanks for the detailed reply!

> This is pretty standard searching in Redis, though your #2 is a little funky. What is more common (in the #2 case) is to just store data as:
> <namespace>:id: -> {<property>: <value>, ...}
> Aside from minimizing the number of keys in Redis (which can help reduce memory to some extent), why are you sharding by id prefix (other sharding methods may offer better memory savings)?

I've read that storing data using that method, where possible, is the most efficient way and can reduce memory usage quite a bit with the proper config, which makes sense to me.  What other methods would you recommend?  To me, it doesn't make sense to use the typical property:value hashes unless I were using the SORT command, which I don't think I will actually use.  Plus, I obviously can't store other data types (sets, lists, bitmaps, etc.) within said hash table, which are quite common.

It's the *prefix* of the id that is the problem. In order to store data in a hash as a ziplist, you must tell Redis the maximum size of the ziplist. But by using the *prefix* of the id, your hashes will vary in size depending on the total number of items you are storing. More specifically...

1234 -> into hash 1234 (one of these)
1234? -> into hash 1234 (10 of these)
1234?? -> into hash 1234 (100 of these)
1234??? -> into hash 1234 (1000 of these)

So, if you have ids up to 1235000 (without skipping any), your hash will be 1111 elements in size. And as your number of digits increases, your fully-prefixed hashes will grow. On the other side of things, you will also have hashes like 1, 12, 123, etc., with only a single element.


If instead you used a shard/entry generator function like:

def shard_pair(id, size=512):
shard, entry = divmod(id, size)
return (shard, entry)

Then none of your shards will ever be larger than the size you want, and you will get sequential fills (similar to the existing sharding method you are using, which actually does sequential fills while rotating through 9000 of the 9999 different shards). This also lets you use the hash-max-ziplist-entries and hash-max-ziplist-value configuration options and have them work consistently.

 
I don't anticipate string searches being common, at least not for anything important, so I think a 12 character case-insensitive alphanumeric prefix will work just fine.  How can this be achieved?

I'm guessing you really mean sorting here. I only made promises for 12 character alpha-only, but it turns out that case-insensitive alpha-numeric can be done with 64 bit doubles too, and only using 63 bits of it :).

First you translate your string to an integer (you need 63 bit integers at least here):

CHARACTERS = 'abcdefghijklmnopqrstuvwxyz0123456789 '
LOOKUP = dict((c, i+1) for i, c in enumerate(CHARACTERS))

def prefix_to_bigint(st):
    # get the lowercased 12 character prefix
    st = st.lower()[:12]
    if len(st) < 12:
        # extend out to 12 characters if too short
        st += (12 - len(st)) * '\0'
    # generate the integer from the string
    value = 0
    for ch in st:
        value *= 38
        value += LOOKUP.get(ch, 0)
    return value

Then you translate your integer into an IEEE 754 FP double via bigint_to_float() here: https://gist.github.com/josiahcarlson/8459874 , remembering to provide safe=False (or only translating the 'safe=False' side of the if statement).

This functionality will not work in a language that does not have 64 bit integers (Javascript being one in particular, as all numbers in Javascript are IEEE 754 fp doubles). You can test to see if your PHP has 64 bit integers by checking PHP_INT_SIZE.

 - Josiah

Timothy Burgess

unread,
Oct 21, 2014, 7:52:32 PM10/21/14
to redi...@googlegroups.com
Thanks again for the quick reply!!

> So, if you have ids up to 1235000 (without skipping any), your hash will be 1111 elements in size. And as your number of digits increases, your fully-prefixed hashes will grow. On the other side of things, you will also have hashes like 1, 12, 123, etc., with only a single element.

I probably should have mentioned that every ID is actually the time the resource was created, using JavaScript's Date object and checked for uniqueness: (new Date()).getTime().  Whether or not this was a good decision... I guess I'll find out haha.  I did it mainly for the following reasons: 1) It's extremely common for the client to want to know the time some resource was created 2) For the sharding reason you explained, although your method is obviously much better.  I might convert everything to your method if you think it would be worth the effort.

> This functionality will not work in a language that does not have 64 bit integers (Javascript being one in particular, as all numbers in Javascript are IEEE 754 fp doubles). You can test to see if your PHP has 64 bit integers by checking PHP_INT_SIZE.

I will definitely try out the algorithm you outlined.  I am using Node, but surely there's a way to somehow make Redis interpret the proper value via JavaScript??

Thanks again!

Tim
...

Josiah Carlson

unread,
Oct 22, 2014, 1:56:50 AM10/22/14
to redi...@googlegroups.com
On Tue, Oct 21, 2014 at 4:52 PM, Timothy Burgess <timothy.b...@gmail.com> wrote:
Thanks again for the quick reply!!

> So, if you have ids up to 1235000 (without skipping any), your hash will be 1111 elements in size. And as your number of digits increases, your fully-prefixed hashes will grow. On the other side of things, you will also have hashes like 1, 12, 123, etc., with only a single element.

I probably should have mentioned that every ID is actually the time the resource was created, using JavaScript's Date object and checked for uniqueness: (new Date()).getTime().  Whether or not this was a good decision... I guess I'll find out haha.

Convenient: yes. Good decision: future hazy, ask again later.

Some drawbacks to using millisecond-resolution times generated by clients for ids:
1. Your old sharding method puts all properties of a given name generated over the course of ~11.57 days in a single hash, at which point it starts writing to a new hash (and will swap every ~11.57 days from then on)
2. The sharding method I described before will basically put all properties of a given name generated over .5 seconds in a single hash
3. It is not clear whether any sharding method would be practical here, as id density is the single most important factor of determining how to shard
4. Maybe 1 in 100 clients will have a system clock within 5 seconds of reality (or your server), and easily 5% of your clients will have system clocks at least 5 minutes out of sync
5. You have to verify that the id is available

Given #5, as well as 1-4, there is no reason why you couldn't just increment a counter in Redis to get your (guaranteed unique) id, which eliminates all issues with sharding (at least using the method I described), ids being generated in the past (or future), and verifying that the id is available. If you need to be able to have a single-call insert operation, a simple Lua script could handle the increment, the data updating in Redis, and returning the generated id.
 
I did it mainly for the following reasons: 1) It's extremely common for the client to want to know the time some resource was created 2) For the sharding reason you explained, although your method is obviously much better.  I might convert everything to your method if you think it would be worth the effort.

Your method, especially with timestamps, is not anywhere close to what you were trying for unless you get <1000 inserts per 11.5 day period.

> This functionality will not work in a language that does not have 64 bit integers (Javascript being one in particular, as all numbers in Javascript are IEEE 754 fp doubles). You can test to see if your PHP has 64 bit integers by checking PHP_INT_SIZE.

I will definitely try out the algorithm you outlined.  I am using Node, but surely there's a way to somehow make Redis interpret the proper value via JavaScript??

The reason is because Javascript doesn't actually have an integer type, it's all FP doubles. If you can find a "big integer" package for Javascript that supports multiplication, division, and modulo, that would work. Though it might even be simpler to write a C extension to Node that takes the string and returns the proper double.

Or as an alternative, you can get 10 characters in Javascript without much work with a small variation in what I wrote (you can skip the bigint -> double conversion):

var CHARACTERS = ' abcdefghijklmnopqrstuvwxyz0123456789';
var LOOKUP = {};
for (var i=0; i<CHARACTERS.length; i++) {
    LOOKUP[CHARACTERS[i]] = i;
}

function prefix_to_double(st) {
    st = st.toLowerCase().slice(0, 10);
    var value = 0;
    for (var i=0; i<st.length; i++) {
        value *= 38;
        value += LOOKUP[st[i]] || 0;
    }
    for (var i=st.length; i<10; i++) {
        value *= 38;
    }
    return value;
}

I slightly changed the ordering of the CHARACTERS table from which the LOOKUP table is generated - spaces are a word break, so should sort before any individual character.

 - Josiah
Reply all
Reply to author
Forward
0 new messages