indexing strings as the scores of a sorted set

557 views
Skip to first unread message

Dvir Volk

unread,
Jun 16, 2012, 3:59:13 PM6/16/12
to redi...@googlegroups.com
I was wondering: has anyone ever tried using some order-preserving hash function as a means to store strings as the *scores* of a sorted set, rather than the values?
It could be useful for indexing a specific value of a group objects in a single sorted set, and sorting them by it. 
but I don't know much about order preserving hash functions and what practical implementations exist of such functions, so I was wondering if anyone had tried this approach.

Josiah Carlson

unread,
Jun 16, 2012, 5:34:22 PM6/16/12
to redi...@googlegroups.com
Any order-preserving hash function is tied to the character set that
you care about sorting on within your string, and relies on you using
the upper to choose your bin (rather than the lower bits). Ultimately,
you are limited to 32/64 bits, and essentially 4/8 byte prefixes of
your strings. If your sortable character set is smaller, you can get
longer prefixes, but you are limited by those 32/64 bits.

I've got a section on constructing scores from strings in the chapter
I'm currently writing in my book, which I've also talked about here.
Without resorting to changing Redis source code, that's about the best
that you can do. In the section, I talk about a method of using 6 byte
prefixes of binary strings, though you could get a 7th with a little
effort (as long as your client library sends values out to enough
decimal places). An 8th prefix byte is literally impossible with IEEE
754 fp double scores (you can get a little over 63 bits with some
effort, but strictly less than 64 thanks to NANs).

If you provide your character set, I can give you a method of
constructing a score for it that will sort the same as the binary
string comparison.

Regards,
- Josiah
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to
> redis-db+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/redis-db?hl=en.

Dvir Volk

unread,
Jun 16, 2012, 5:45:42 PM6/16/12
to redi...@googlegroups.com
Thanks Josiah! I was hoping to get a response from you on this :)

That's what I had in mind, I just figured there are ways to compress the representation even more than 6 bytes, while keeping the ordering intact.

The thing is, it looks like this will not allow ZRANGE paging on the key, because say you have 100 strings with the same prefix, you are not guaranteed anything if you get only 10 of them. 
It will allow ZRANGEBYSCORE filtering, but with strings that's not a common need.

So for now I just used python hashes trimmed to 48 bit on the whole string, just to get an exact match on a string using ZRANGEBYSCORE. 
not different than storing a set for each string with object keys, but much simpler and cheaper to update if an attribute's value changes.

Josiah Carlson

unread,
Jun 16, 2012, 6:02:49 PM6/16/12
to redi...@googlegroups.com
On Sat, Jun 16, 2012 at 2:45 PM, Dvir Volk <dv...@everything.me> wrote:
> Thanks Josiah! I was hoping to get a response from you on this :)

You could always email me directly. No one ever thinks of that ;)

> That's what I had in mind, I just figured there are ways to compress the
> representation even more than 6 bytes, while keeping the ordering intact.
>
> The thing is, it looks like this will not allow ZRANGE paging on the key,
> because say you have 100 strings with the same prefix, you are not
> guaranteed anything if you get only 10 of them.
> It will allow ZRANGEBYSCORE filtering, but with strings that's not a common
> need.

It depends on what your members are. Commonly, people will use scores
of 0, then as the member use the string they want to sort on, followed
by some sort of null+id suffix (as necessary). By picking a score that
is constructed from the string -> score, you get primary sorting on
the score (which will actually be faster than character-by-character
strcmp() calls), and secondary sorting on the member. You can trim the
member of the 6 character prefix, so you aren't re-performing the same
operation.

> So for now I just used python hashes trimmed to 48 bit on the whole string,
> just to get an exact match on a string using ZRANGEBYSCORE.
> not different than storing a set for each string with object keys, but much
> simpler and cheaper to update if an attribute's value changes.

Python's hashes are not necessarily unique. And depending on your
command-line arguments, hash('foo') != hash('foo') on subsequent runs
(http://bugs.python.org/issue13703 which addresses the same security
issue that Redis addressed in 2.6).

With FP doubles, you can actually use 53 bits without issue with
integers (52 generally, but 53 for integers because the leading 1 is
assumed except in the case of denormalized values).

- Josiah

Dvir Volk

unread,
Jun 16, 2012, 6:21:21 PM6/16/12
to redi...@googlegroups.com

Python's hashes are not necessarily unique. And depending on your
command-line arguments, hash('foo') != hash('foo') on subsequent runs
(http://bugs.python.org/issue13703 which addresses the same security
issue that Redis addressed in 2.6).

yeah, I should use a better hash function anyway, my favorite being FNV because I can always implement it in 5 minutes.
 

With FP doubles, you can actually use 53 bits without issue with
integers (52 generally, but 53 for integers because the leading 1 is
assumed except in the case of denormalized values).

I know, 48 bit just sounded more "right"... 

Xiangrong Fang

unread,
Jun 16, 2012, 10:46:13 PM6/16/12
to redi...@googlegroups.com
I have similar problem.   I need to use string as scores, to build indices for objects.   Sometime ago, I asked, whether I can use 64 bit integer as score instead of float, but the answer is no.

Now what I did is this (PHP):

$score = '0x'.substr(sha1($str), 0, 13);

I tested 13 is the longest value I can use.  I don't care about preserving orders, but only care about minimizing possible collisions.   Am I doing it correctly? Any better ideas?

Thanks,
Shannon

2012/6/17 Dvir Volk <dv...@everything.me>



--


Josiah Carlson

unread,
Jun 17, 2012, 2:30:34 AM6/17/12
to redi...@googlegroups.com
Your use of 13 here is the same as a 52 bit prefix of the sha1 hash of the string. In this case, it is fine. You could get another 8 bits, but it would take more work, probably wouldn't help a whole lot (you'd need a ZSET of roughly 64 million items to have your chance of having a collision roughly 50% right now), and I'm not sure that PHP would transfer the score correctly (some clients like to truncate certain large floats).

Regards,
 - Josiah

Dvir Volk

unread,
Jun 17, 2012, 2:32:01 AM6/17/12
to redi...@googlegroups.com
PHP and number bit length can be a nightmare, and it is platform dependent.

Xiangrong Fang

unread,
Jun 17, 2012, 5:15:05 AM6/17/12
to redi...@googlegroups.com
I know the possibility of collision is quite low in practice, but I am quite interested on how to gain the extra 8 bit? This might help me better understand the IEEE fp number format, or how to take advantage of that storage.

Could you please elaborate the algorithm, or any sample code?

Thanks!

2012/6/17 Josiah Carlson <josiah....@gmail.com>



--


Xiangrong Fang

unread,
Jun 17, 2012, 5:15:55 AM6/17/12
to redi...@googlegroups.com
I am just present a number as string, because the predis client just want a string to send to redis. Redis use a string protocol anyway.

2012/6/17 Dvir Volk <dv...@everything.me>



--


Josiah Carlson

unread,
Jun 17, 2012, 11:56:58 AM6/17/12
to redi...@googlegroups.com
$thsa = sha1($str);
$shift = 2.0 ** (0 +  '0x'.substr($tsha, 0, 2));
$score = $shift * '0x'.substr($tsha, 2, 13);

Floating point values are typically described as being made up of a 'sign', 'mantissa', and 'exponent'. The 'mantissa' is what you were calculating; it is the part where you care about precision. But the exponent is where the magic in floating point numbers happens. In particular, IEEE 754 FP doubles (as is used by Redis) has an 11 bit exponent, which we use 8 bits of (you can't use all 11 due to special meanings of some exponents). What value is represented by the float is the equivalent of:

sign * 2**exponent * mantissa

Basically, you use the first 8 bits to calculate an exponent, and the next 52 bits to calculate the mantissa. This should require roughly 1 billion entries in a ZSET to have roughly a 50% chance of collision, though I believe that most structures in Redis are limited to 512 Million entries, so you should be pretty safe (I've only ever made a ZSET of ~160 million entries, and that used roughly 30 gigs of memory).

Regards,
 - Josiah

Xiangrong Fang

unread,
Jun 17, 2012, 10:01:57 PM6/17/12
to redi...@googlegroups.com
Thanks Josiah.   I tested the script, there are 2 syntactic problems. After correction, it worked:

$tsha = sha1($str);
$shift = pow(2, '0x'.substr($tsha, 0, 2));
$score = $shift * ('0x'.substr($tsha, 2, 13));

The two problems are:

1) ** does not work in PHP.
2) you need () to indicate "." shall be calculated prior to + or *.

Although I got the formula, but I still need to study some math to understand what you described. :)  Thanks again.

2012/6/17 Josiah Carlson <josiah....@gmail.com>



--


Josiah Carlson

unread,
Jun 17, 2012, 10:55:18 PM6/17/12
to redi...@googlegroups.com
I'm not a php programmer, so I'm not surprised there were problems :)

There used to be a great site that offered a live floating point calculator that showed what went on with everything in javascript, but I can't find it today, and none of the other sites are even half as good.

If you have specific questions about what is going on and why it works, feel free to ask.

Regards,
 - Josiah
Reply all
Reply to author
Forward
0 new messages