ZDIFFSTORE

524 views
Skip to first unread message

George Bashi

unread,
May 18, 2010, 6:29:26 PM5/18/10
to redi...@googlegroups.com
Hi there,

I've implemented the ZDIFFSTORE command, which I needed for a project I'm working on. It works similarly to SDIFF but for sorted sets, keeping the "style" of ZUNION/INTERSTORE. From the commit log:


Acts similarly to SDIFF, but on sorted sets.
ZDIFFSTORE by default ignores scores when
comparing values; pass the WITHSCORES option to
only exclude when both the value and score match.
Any weights passed are applied to the resulting
set similarly to ZUNION/INTERSTORE, and are
applied before testing WITHSCORES.

I'm not sure if "overloading" the meaning of WITHSCORES like this is the right thing to do; I'm open to suggestions for other terms. I've provided the option as comparing both with and without scores seem equally sensible.

I've created a fork on github (http://github.com/georgebashi/redis) and attach the patch to this post; I don't know how patches are normally submitted so apologies if I've missed the normal "protocol"!

This is my first open-source patch so I'd love to hear any feedback / constructive criticism anyone has about the patch, the idea or the code (but please be nice!).

Thanks,

George

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
0001-Adds-the-ZDIFFSTORE-command.patch

Pieter Noordhuis

unread,
May 19, 2010, 4:14:17 AM5/19/10
to redi...@googlegroups.com
Hi George,

Thanks for the patch. To start: the code looks great. There are some
extra conditionals thrown into the mix of the main function, which
makes the behavior a bit less evident, but this is not that bad.
Submitting patches is not done a lot, because Github makes it just too
easy to just pull from a fork. Way easier than incorporating patches,
so you don't need to include a patchfile in the future.

I'm eager to hear about a usecase for this type of DIFF with two
different modes (compare score, ignore score), because I can't think
of one now. I agree that it would be useful to complete the sorted set
operations analogous to the regular set operations, but as long as the
commands make sense for the majority of the usecases.

Furthermore: you should join #redis on IRC if you're creating patches
and need some instant feedback (on code or the idea) :-).

Cheers,
Pieter

parrish

unread,
Apr 11, 2012, 2:01:46 AM4/11/12
to redi...@googlegroups.com
Hi all,

I realize I'm resurrecting a dead topic, but it seems the most relevant.
ZDIFFSTORE would actually be a very common operation in our projects.

We use Redis to select the next object to be shown to a user for classification.

Two things are important in this selection:
  1. Users are shown objects only once (unique selection)
  2. Objects are scored based upon prior classifications (the score is updated in real time) and users should be shown the most highly scored object they haven't seen

Currently we do the following:
  • Store many keys for object scores (objects_score_<id>)
  • Store a set of object ids seen by a user (seen_objects_for_<user_id>)
  • Store a set of object ids available for classification (objects)
  • Remove the diff set after selection is done (since the stored order from the score is invalidated shortly after creation)

Then perform:
  • sdiffstore user_<id>_unseen_objects objects seen_objects_for_user_<id>
  • sort user_<id>_unseen_objects DESC BY objects_score_* LIMIT 0 1
  • del user_<id>_unseen_objects

Given we have over 600,000 users and well over 10,000,000 objects between our projects, staying responsive while handling rates above 100 requests/second is non-trivial.  Optimizing this specific use-case is pretty critical for us.

Using George's patch as reference, I rewrote this command against the 2.4 branch, added tests and merged against unstable in a feature branch.

I'll be sending a pull request on github shortly.

Please feel free to suggest changes, revisions, etc.

Thanks!

-Michael

Chris Barnett

unread,
Apr 12, 2012, 9:59:14 PM4/12/12
to redi...@googlegroups.com
I'm only just beginning to learn Redis but I found your problem an interesting one to think about. I came up with this model:
  • Store object ids and associated scores in a zset: scored_objects
  • Store a set of object ids seen by a user: seen_objects_for_<user_id>
  • Store a set of object ids available for classification: available_objects
    • This may be a subset of all objects in the system - perhaps because some objects have been deleted or made inactive
  • Possibly store other sets of object ids for various categories or indexes an object might appear in
To get a sorted list of unseen objects:
  • SDIFFSTORE user_<id>_unseen_objects available_objects seen_objects_for_user_<id>
    • This could be cached
  • ZINTERSTORE user_<id>_unseen_objects_sorted 2 scored_objects user_<id>_unseen_objects WEIGHTS 1 0
    • Note that this is operating on a zset and a set. According to this post you can do this.
  • ZRANGE  user_<id>_unseen_objects_sorted 0 -1 [WITHSCORES]
The only real difference from your model is that the sorting is done incrementally whenever an object's score changes. The set operations all happen on normal sets (which have the sdiff command) and their results can all be cached because they don't depend on object scores. The ZINTERSTORE step basically just filters the already sorted zset according to whatever query conditions you implemented with set operations over the normal sets.

This seems like a more general paradigm than one that uses ZDIFFSTORE directly on zsets because it allows for arbitrarily complex set operations before the final intersection with a sorted set. If you have any intersection operations to do, you want to do them (along with the diffs) over normal sets (which don't depend on scores) so the results can be cached, and since these cached results can only be combined with the zset using intersection, there is no need for ZDIFFSTORE.

There is one rather important assumption I am making for this model to make any sense, and that is that redis is actually able to achieve efficiency gains when intersecting a sorted set and a smaller unsorted set compared with an ad-hoc sort of the unsorted set. 

Having just examined the implementation of ZINTERSTORE it looks like this is not the case: redis just iterates through the smallest set discarding any elements that aren't in all the others and then inserting each element into a fresh zset (which takes at least as much time as a sort over the resulting set). 

So your way is better than mine given redis' current implementation, and I can see why you wanted a ZDIFFSTORE command as that could avoid re-sorting the sorted set elements. Unfortunately, it currently doesn't because insertion of pre-sorted elements into a new zset currently takes as long as sorting all over again: O(n log n).

It would be very useful if we could do zset set operations that preserve only the scores of the first zset and do not re-sort the results. If in-order insertion of elements into a zset took linear time instead of O(n log n) then your ZDIFFSTORE algorithm and a tweaked version of ZINTERSTORE that iterates over the only non-zero-weighted zset instead of the smallest set would achieve this - that is, linear time set operations on a sorted set.

Linear time insertion of an ordered list into a zset could be achieved if insertion of the largest element yet into the underlying skip list was a constant time operation. I think this could be done by storing some extra information in a tail structure that points to the rightmost node in each level and then inserting from the right if the new element is bigger than the rightmost element.

This modification of the skip list insertion algorithm would always benefit your ZDIFFSTORE algorithm, but it would only be useful for ZINTERSTORE if the weights of all sets except one were 0 and the size of that non-zero-weighted set was not significantly larger than the result set. This would be the case in the scenario I describe above (and I suspect quite a lot of other use cases) where there is a single sorted set with weight 1 and one zero-weighted set of only slightly smaller size to intersect with.


--- tl;dr ---

As far as I can tell, ZDIFFSTORE won't significantly improve performance over SDIFFSTORE and then SORT until the zset skip list insertion algorithm is modified to make appending an element on the rightmost extreme a constant-time operation.

Chris Barnett

unread,
Apr 12, 2012, 10:27:46 PM4/12/12
to redi...@googlegroups.com
I'm only just beginning to learn Redis but I found your problem an interesting one to think about. I came up with this model:
  • Store object ids and associated scores in a zset: scored_objects
  • Store a set of object ids seen by a user: seen_objects_for_<user_id>
  • Store a set of object ids available for classification: available_objects
    • This may be a subset of all objects in the system - perhaps because some objects have been deleted or made inactive
  • Possibly store other sets of object ids for various categories or indexes an object might appear in
To get a sorted list of unseen objects:
  • SDIFFSTORE user_<id>_unseen_objects available_objects seen_objects_for_user_<id>
    • This could be cached.
  • ZINTERSTORE  user_<id>_unseen_objects_sorted 2 scored_objects user_<id>_unseen_objects WEIGHTS 1 0
    • Note that this is operating on a zset and a set. According to this post you can do this.
  • ZRANGE  user_<id>_unseen_objects_sorted 0 -1 [WITHSCORES]
The only real difference from your model is that the sorting is done incrementally whenever an object's score changes. The set operations all happen on normal sets (which have the sdiff command) and their results can all be cached because they don't depend on object scores. The ZINTERSTORE step basically just filters the already sorted zset according to whatever query conditions you implemented with set operations over the normal sets.

This seems like a more general paradigm than one that uses zdiffstore directly on zsets because it allows for arbitrarily complex set operations before the final intersection with a sorted set. If you have any intersection operations to do, you want to do them (along with the diffs) over normal sets so the results (which don't depend on scores) can be cached, and since these cached results can only be combined with the zset using intersection, there is no need for ZDIFFSTORE.

There is one rather important assumption I am making for this model to make any sense, and that is that redis is actually able to achieve efficiency gains when intersecting a sorted set and a smaller unsorted set compared with an ad-hoc sort of the unsorted set. 

Having just examined the implementation of ZINTERSTORE it looks like this is not the case: redis just iterates through the smallest set, discarding any elements that aren't in all the others, and then inserting each element into a fresh zset (which takes at least as much time as a sort over the resulting set). 

So your way is better than mine given redis' current implementation, and I can see why you wanted a ZDIFFSTORE command as that could avoid re-sorting the sorted set elements. Unfortunately, it currently doesn't because insertion of pre-sorted elements into a new zset currently takes as long as sorting them all over again: O(n log n).

It would be very useful if we could do zset operations that preserve only the scores of the first zset and do not re-sort the results. If in-order insertion of elements into a zset took linear time instead of O(n log n) then your ZDIFFSTORE algorithm and a tweaked version of ZINTERSTORE that iterates over the only non-zero-weighted zset instead of the smallest set would achive this - that is, linear time set operations on a sorted set.

Linear time insertion of an ordered list into a zset could be achieved if appending to the right of the underlying skip list was a constant time operation. I think this could be achieved by keeping a tail structure that points to the rightmost node in each level and then inserting from the right if the new element is bigger than the rightmost element.

This modification of the skip list insertion algorithm would always benefit your ZDIFFSTORE algorithm, but it would only be useful for ZINTERSTORE if the weights of all sets except one were 0 and the size of that non-zero-weighted set was not significantly larger than the result set. This would be the case in the scenario I describe above (and I suspect quite a lot of other use cases) where there is a single sorted set with weight 1 and one zero-weighted set of only slightly smaller size to intersect with.


--- tl;dr ---

As far as I can tell, ZDIFFSTORE won't significantly improve performance over SDIFFSTORE and then SORT until the zset skip list insertion algorithm is modified to make appending an element on the rightmost extreme a constant-time operation.



On Tuesday, April 10, 2012 11:01:46 PM UTC-7, parrish wrote:
On Tuesday, April 10, 2012 11:01:46 PM UTC-7, parrish wrote:

parrish

unread,
Apr 19, 2012, 4:00:58 PM4/19/12
to redi...@googlegroups.com
Thanks for the feedback.  Your approach sounded compelling, so I put together some benchmarks to see how it worked out.
For the tests, I ran 1,000 trials against a data set of 50,000 users, 50,000 objects, and a representational distribution of classifications per user. (The Weibull distribution if you're curious)

Test 1, our current implementation outlined in my previous post:
average:  90.201 ms
    min:  86.838 ms
    max: 119.961 ms
std dev:   4.092 ms

Test 2, the zdiffstore implementation:
average:  68.663 ms
    min:  42.957 ms
    max: 460.402 ms
std dev:  56.241 ms

Test 3, your proposed implementation:
average: 112.120 ms
    min:  78.318 ms
    max: 565.326 ms
std dev:  52.418 ms

Test 2 was implemented as:
  • ZSET 'objects'
  • SETs of 'seen_objects_for_<id>'
  • ZDIFFSTORE user_<id>_unseen_objects 2 objects, seen_objects_for_<id>
  • ZREVRANGEBYSCORE user_<id>_unseen_objects +inf -inf limit 0 1
A couple observations:
Test 1 is by far the most conservative in memory usage and the most predictable in timing.
Test 2 and 3 are more aggressive in memory use due to the z*store operations.  Also, the temporary set expiry seems to be the cause of the larger variances.


The set operations all happen on normal sets (which have the sdiff command) and their results can all be cached because they don't depend on object scores.
True, but given the number of user and object sets, the memory hit is too big to be able to hang on to those sets.


There is one rather important assumption I am making for this model to make any sense, and that is that redis is actually able to achieve efficiency gains when intersecting a sorted set and a smaller unsorted set compared with an ad-hoc sort of the unsorted set.
There doesn't appear to be any gains.  In fact, the SDIFFSTORE required to generate the sorted set appears to be slower than a SORT operation on the diff-set.


It would be very useful if we could do zset operations that preserve only the scores of the first zset and do not re-sort the results
The ZDIFFSTORE implementation only uses the scores from the first zset.  Since it iterates the first set in-order and therefore only does insertions in-order, there shouldn't be any re-sorting afterward.  As far as the complexity of ZDIFFSTORE, I'm pretty sure it's going to be an amortized O( N * log(M) ) where N is the number of elements in the input set and M is the number of elements in all other sets being diffed.


As far as I can tell, ZDIFFSTORE won't significantly improve performance over SDIFFSTORE and then SORT until the zset skip list insertion algorithm is modified to make appending an element on the rightmost extreme a constant-time operation.
A 27% difference is a significant improvement for us.  I'd like to more heavily optimize it though, so I may look at implementing non-storing ZSET operations as well.

parrish

unread,
Apr 19, 2012, 4:34:54 PM4/19/12
to redi...@googlegroups.com
It's also worth mentioning that I was able to shave 15-20 ms off those timings by combining the operations into a MULTI op and by replacing key deletion with short-period expiration.

Josiah Carlson

unread,
Apr 19, 2012, 6:54:34 PM4/19/12
to redi...@googlegroups.com

Your runtime is incorrect for the definition of N and M that you have
chosen, I can't think of a definition of N and M that would be correct
for your description.

Algorithmically, there are two (obvious) methods for building ZDIFFSTORE...
1. For each entry in your source zset, check for entries in the other
zsets. If it's not there, add the entry to your destination zset. O(M
log(M) + N*k)
2. Copy your source zset to your destination, iterate over each of
your other zsets, removing entries from your destination as necessary.
O(P log(P) + Q)

M = size of the result zset
N = number of items in the source zset
k = number of zsets to remove items from
P = size of the source zset
Q = number of items in all non-source zsets in total

The first algorithm is what Redis uses for SDIFFSTORE. Which one is
faster depends on M, P, N, k, and Q.

Regards,
- Josiah

>> As far as I can tell, ZDIFFSTORE won't significantly improve performance
>> over SDIFFSTORE and then SORT until the zset skip list insertion algorithm
>> is modified to make appending an element on the rightmost extreme a
>> constant-time operation.
>
> A 27% difference is a significant improvement for us.  I'd like to more
> heavily optimize it though, so I may look at implementing non-storing ZSET
> operations as well.
>
>

> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.

> To view this discussion on the web visit
> https://groups.google.com/d/msg/redis-db/-/F4BWcTnZpVgJ.

Chris Barnett

unread,
Apr 19, 2012, 7:36:53 PM4/19/12
to redi...@googlegroups.com
Thanks for going and actually testing my idea. It's always great to have one's theories tested.

I must say I'm surprised that using ZDIFFSTORE improves your average performance, especially since you only want to retrieve the one maximum scored object from the filtered set. SORT should be able to do this in O(N) time because its partial quicksort algorithm reduces to a maximum search when asked to return only the first result. 

As Josiah says below, the time complexity of ZDIFFSTORE will currently be basically the same as ZINTERSTORE, that is: O(N*K)+O(M*log(M)), where N is the size of the first input set (as opposed to the smallest in the case of ZINTERSTORE), K is the number of input sets and M is the size of the resulting zset.

Looking at the code and stated time complexity of SDIFFSTORE, though, it looks like it actually uses the second of Josiah's algorithms. It takes time O(N) where N is the total number of elements of all input sets. This would be less efficient than the first algorithm (the one your ZDIFFSTORE uses) for 2 input sets, especially if the second is large. I wonder whether this is the cause of ZDIFFSTORE's efficiency increase.

I encourage you to look into to inefficiency I alluded to with inserting in order into a new zset in its current implementation. The underlying skip list implementation does nothing to improve upon insertion time when elements are inserted in ascending order into an empty zset, thus this is just as slow as inserting them in random order and thus as slow as sorting the elements. The skip list could be implemented to keep track of the last node in each level so that insertion at the right of the skiplist could be a constant time operation making the creation of a new zset with in-order insertion of elements a linear time operation rather than O(N*log(N)).


parrish

unread,
Apr 19, 2012, 10:07:54 PM4/19/12
to redi...@googlegroups.com
1. For each entry in your source zset, check for entries in the other
zsets. If it's not there, add the entry to your destination zset. O(M
log(M) + N*k)

This is the algorithm I implemented.

It's been a minute since I've proven complexity, so bear with me.  :)
  • For each element in the source zset: N
    • For each non-source zset: k
      • Perform a lookup to see if the source element exists in the non-source zset: log(Q/k), average number of elements in the non-source zsets
    • Insert the element into the result set if it doesn't exist: M log(M)
Which makes the timing:
O( Nk log(Q/k) + M log(M) )

In most cases the number of non-source zsets is small enough to be a constant.
This would reduce the complexity to:
O( N log(Q) + M log(M) )


I must say I'm surprised that using ZDIFFSTORE improves your average performance
 
The complexity of the SDIFFSTORE, SORT operation:
O( N + Q ) + O( N log(N) )
At least it looks like sorting an unsorted set on an external key makes it O( N log(N) ), probably a bit worse

The complexity of the ZDIFFSTORE, ZREVRANGEBYSCORE operation:
O( N log(Q) + M log(M) ) + O( log(M) )

The complexity of the SDIFFSTORE, ZINTERSTORE, ZREVRANGEBYSCORE operation:
O( N + Q ) + O( N + M log(M) ) + O( log(M) )


The underlying skip list implementation does nothing to improve upon insertion time when elements are inserted in ascending order into an empty zset
 
Ah, I see what you were saying now.  It would certainly be worthwhile to improve upon that, though the skip list indexing would still keep the element insertion complexity at logarithmic timing.

Reply all
Reply to author
Forward
0 new messages