ZUNIONSTORE and average calculation

1,107 views
Skip to first unread message

Oleg Ruchovets

unread,
Sep 7, 2013, 6:08:44 PM9/7/13
to redi...@googlegroups.com
Hi , 
       I try to  calculate average ( mean ) using ZUNIONSTORE . I understood that there is no average calculation out of the box (SUM | MIN | MAX)

I have such data structures of sorted sets:
             
             zset:  key : 
                                    count - timestamp(in miliseconds) 
            example:
                   area1 
                                  50 - 1378590935960
                                         80 - 1378590935860
                                         60 - 1378590935760 

                     area1 
                                  40 - 1378590935960
                                         70 - 1378590935860
                               
   I need to calculate average of counters with same time stamp.
 Using ZUNIONSTORE : ZUNIONSTORE out 2 area1 area2 WEIGHTS 1 1

                 out            
                                        90 -   1378590935960
                                        150 - 1378590935860
                                        60  -  1378590935760

 calculating the average.
But how can I know that  In order to calculate average for each key in OUT  I need 
                                    90/2  =  45
                                   150/2 =  75
                                   60/1   =  60

 Actually the questions : 
                     Is it possible to calculate average for sorted set with different numbers of entries?
                    What is the other way to calculate average for my usecase?

Thanks
Oleg.

Aaron Meriwether

unread,
Sep 9, 2013, 9:36:18 AM9/9/13
to redi...@googlegroups.com
Have you considered Lua?

Here's a script that should fit the bill:

-- EVAL $ZUNIONSTOREAVG numkeys+1 dest key [key ...]
local c = redis.call("ZUNIONSTORE", KEYS[1], #KEYS - 1, select(2, unpack(KEYS)))
local s = redis.call("ZRANGE", KEYS[1], 0, -1, "WITHSCORES")
for i = 1, #s, 2 do
 local n = 0
 for j = 2, #KEYS do
  if redis.call("ZSCORE", KEYS[j], s[i]) then
   n = n + 1
  end
 end
 redis.call("ZADD", KEYS[1], s[i + 1] / n, s[i])
end
return c

And here is how it works on your test data:

$ redis-cli ZADD area1 50 1378590935960
(integer) 1
$ redis-cli ZADD area1 80 1378590935860
(integer) 1
$ redis-cli ZADD area1 60 1378590935760
(integer) 1
$ redis-cli ZADD area2 40 1378590935960
(integer) 1
$ redis-cli ZADD area2 70 1378590935860
(integer) 1
$ redis-cli EVAL "$(cat ZUNIONSTOREAVG.lua)" 3 out area1 area2
(integer) 3
$ redis-cli ZRANGE out 0 -1 WITHSCORES
1) "1378590935960"
2) "45"
3) "1378590935760"
4) "60"
5) "1378590935860"
6) "75"

If you want to be able to specify weights in order to compute weighted averages, that is possible too, but the script becomes a little less legible:

-- EVAL $ZUNIONSTOREAVGW numkeys+1 dest key [key ...] weight [weight ...]
local a = {"ZUNIONSTORE", KEYS[1], #KEYS - 1, select(2, unpack(KEYS))}
a[#a + 1] = "WEIGHTS"
for i = 1, #ARGV do
 a[#a + 1] = ARGV[i]
end
local c = redis.call(unpack(a))
local s = redis.call("ZRANGE", KEYS[1], 0, -1, "WITHSCORES")
for i = 1, #s, 2 do
 local n = 0
 for j = 2, #KEYS do
  if redis.call("ZSCORE", KEYS[j], s[i]) then
   n = n + ARGV[j - 1]
  end
 end
 redis.call("ZADD", KEYS[1], s[i + 1] / n, s[i])
end
return c


Hope that helps.

-Aaron

Oleg Ruchovets

unread,
Sep 9, 2013, 12:34:35 PM9/9/13
to redi...@googlegroups.com
Thank you Aaron for the interesting solution.
   I've never use Lua scripting on production environment.
Is it a best practice to use Lua scripts in production? 

I am all the time looking for using Redis secret weapons :-) like ZUNIONSTORE , ZINTERSTORE. and all the time it is not fit to my requirement. 
May me I have to model data in a different way to be able to use it.
Any advises how it is possible to model data in a different way for my usecase to use ZUNIONSTORE , ZINTERSTORE?

Thanks
Oleg.

Josiah Carlson

unread,
Sep 9, 2013, 3:32:25 PM9/9/13
to redi...@googlegroups.com
Oleg,

What you want is not natively available in Redis. It is perfectly fine to use Lua scripting in production, though in this case you may want to be concerned about running time if you have a lot of entries.

 - Josiah


--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/groups/opt_out.

Oleg Ruchovets

unread,
Sep 9, 2013, 4:21:13 PM9/9/13
to redi...@googlegroups.com
Sure , 
   I understand that it is not naively. I though about re factoring / remodeling my data structure to execute sum on server side end devide already aggregated result to number of entries to get average.
I understand that that it has to switch and start to thinking  in Redis , but it takes time :-) , so my question :
how can I model my data structure to  process partial calculation on server (most of calculation on server) .and finish average calculation involving  client interation.


--
You received this message because you are subscribed to a topic in the Google Groups "Redis DB" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/redis-db/Z8CR-X0tMfE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to redis-db+u...@googlegroups.com.

Aaron Meriwether

unread,
Sep 9, 2013, 4:49:30 PM9/9/13
to redi...@googlegroups.com
I don't think there is currently a server-side aggregate function (beside using a Lua script) that can merge sets and produce a count for each entry.

If you want to move computation to the client side, you could certainly fetch all of the individual sets and aggregate them on the client (the fact that they are already sorted should lend itself to some nice streamlining), but I don't think there is much benefit from trying to do a sum on the server and a count on the client because the data the client would need in order to compute the count would almost (by simply adding "WITHSCORES") be sufficient to compute the full average anyway. (side note: if you merge the sets on the client side, you won't have to worry about the one-key-at-a-time limitation if you plan to use redis sharding or clustering in the future)

Maybe if you describe in a little more detail what you are trying to accomplish overall, I can suggest a different data structure entirely...

-Aaron

Josiah Carlson

unread,
Sep 10, 2013, 3:32:15 PM9/10/13
to redi...@googlegroups.com
Well... you can get counts, but it requires the use of a SET instead of a ZSET (you can use ZSETs too, but it uses more memory). More specifically, for every entry you have in a ZSET with a score, you would also have an entry in a SET. Later, when you are calculating your sums over the ZSETs, you can also calculate a sum over SETs (SETs can be used as input to zinterstore/zunionstore and have a score of 1, which can be modified via the WEIGHTS parameter). Then at least you have the sums *and* the count, though it is a bit nasty (in terms of data duplication).

That said, I know exactly what you want, and have wanted it myself. In my book, I ran the MIN and MAX aggregates over the ZSET and took the average of those (which is trivial by using the WEIGHTS parameter with weights being .5), dealing with the fact that it is wrong but should hopefully be "good enough".

 - Josiah

Aaron Meriwether

unread,
Sep 10, 2013, 10:57:13 PM9/10/13
to redi...@googlegroups.com
I just read over the implementation in t_zset.c, and it doesn't look like it would be too hard to add "COUNT" and "AVG" aggregation functions there.  If this really is a feature that is in high demand, that might be worth doing... (and it might even be simple and efficient enough that antirez would accept a pull request for it)
-Aaron

Josiah Carlson

unread,
Sep 11, 2013, 2:34:36 AM9/11/13
to redi...@googlegroups.com
Both would be good additions and actually would be pretty quick. If no one beats me to it, I'll have a branch that implements them by the afternoon tomorrow Pacific.

I've been contemplating adding LOG, EXP, MUL, and DIV for a long time, which would open up a whole new collection of operations and features.

 - Josiah

Aaron Wirtz

unread,
Sep 11, 2013, 10:02:05 AM9/11/13
to redi...@googlegroups.com, redi...@googlegroups.com
LOG, EXP, and DIV would be order-specific. Also, you'd need to deal with division-by-zero (probably return +/-inf).  And negative numbers raised to fractional exponents. And those three along with MUL would probably need to consider an absent set member on the right size to be equivalent to 1.0, while an absent member on the left side might be considered 0.0?  It just starts to get weird.

So for the sake of sanity, we should probably stick to the same aggregate functions as Oracle uses:


-Aaron


Sent from my iPhone

Josiah Carlson

unread,
Sep 11, 2013, 2:07:45 PM9/11/13
to redi...@googlegroups.com
Multiplication and addition are also order-specific, thanks to the actual details of floating point arithmetic ;) , but I get your point.

On the log/exp side of things, log would be log base e of the value, exp would be to take e to the value, and both of which would only work when a single zset was passed. With those two, mutliplication and division are trivially implemented as SUM aggregates using weights, and geometric average can be calculated by LOG(all of your zsets you care about) + union/intersect with AVG aggregate + EXP(result zset). General exponentiation is also possible (to a point) by LOG(zset) -> zinterstore (with a != 1 weight) -> EXP(zset). Values that lie outside of reasonable input ranges could become 0.

It's a bit slow, but having access to these kinds of manipulations can be important for some applications. LOG/EXP in particular can greatly simplify a variety of statistical calculations.


But I won't be adding them for this patch - which is basically done, I'm just going to do some verification, benchmarks, and add a few tests before sending a pull request with the COUNT and AVG aggregates.

 - Josiah

Aaron Wirtz

unread,
Sep 11, 2013, 2:56:12 PM9/11/13
to redi...@googlegroups.com
When you said "afternoon pacific", I thought I might actually have until afternoon pacific to beat you to an implementation, but it looks like you made it before noon...
My attempt so far compiles and passes the existing unit tests, and seems to work correctly from redis-cli, but I haven't gotten as far as heavy benchmarking yet.

127.0.0.1:6379> ZADD area1 50 1378590935960
(integer) 1
127.0.0.1:6379> ZADD area1 80 1378590935860
(integer) 1
127.0.0.1:6379> ZADD area1 60 1378590935760
(integer) 1
127.0.0.1:6379> ZADD area2 40 1378590935960
(integer) 1
127.0.0.1:6379> ZADD area2 70 1378590935860
(integer) 1
127.0.0.1:6379> ZUNIONSTORE out 2 area1 area2 WEIGHTS 1 1 AGGREGATE AVG
(integer) 3
127.0.0.1:6379> ZRANGE out 0 -1 WITHSCORES
1) "1378590935960"
2) "45"
3) "1378590935760"
4) "60"
5) "1378590935860"
6) "75"
127.0.0.1:6379> ZUNIONSTORE out 2 area1 area2 WEIGHTS 1 1 AGGREGATE COUNT
(integer) 3
127.0.0.1:6379> ZRANGE out 0 -1 WITHSCORES
1) "1378590935760"
2) "1"
3) "1378590935860"
4) "2"
5) "1378590935960"
6) "2"


-Aaron

Oleg Ruchovets

unread,
Sep 11, 2013, 4:13:33 PM9/11/13
to redi...@googlegroups.com
Yes :-) , I've seen similar idea here http://www.starkiller.net/2013/05/02/hacking-redis-adding-a-command/ .
 My only concern that it is hard to me understand the implications. I am a java guy and my C skills stayed at far behind me (it was a very cool time :-) ).

It is interesting  is there a plan to allow to add custom operation to Redis (executing it on server side).  Not using Lua.

Josiah Carlson

unread,
Sep 11, 2013, 6:33:31 PM9/11/13
to redi...@googlegroups.com
My change is similar, though I don't believe that weight should contribute to the count as part of the COUNT or AVG aggregates, primarily because weights can be negative or fractional. Also, you missed the use of weights WRT min/max aggregates, which would change the current semantics of the command.

Here's the diff for my change with tests:

 - Josiah

Aaron Meriwether

unread,
Sep 11, 2013, 10:18:41 PM9/11/13
to redi...@googlegroups.com
Good eye re: MIN/MAX, I've fixed that now.

With weights in AVG, my thought was that it would be good to allow a weighted-average calculation, where one set counts more heavily than the other(s).

For example, if sets represent population groups, and members represent attributes (already reduced to a single score per attribute per group), then using a weighted AVG function where weight equals population size makes sense.  Specifically, imagine that each set represents a country, and each member represents liters per year per capita consumption of a particular alcoholic beverage.  Then you could use AVG, weighted by country population to calculate beverage popularity by continent or worldwide.

If weight is applied only to the sum component of AVG, and not to the count component, I can't think of any realistic application.

For the use case where you might want to negate the scores of a set and then average it with another set, you could accomplish it in two passes (unary SUM with negative weight, then AVG).

Negative (and zero) weights in AVG does make it possible to cause a division-by-zero, but that is already somewhat handled in my code (though perhaps it should result in +inf/-inf/0 instead of just 0 - fixed now).  Not sure what you might use negative weights with this calculation for, but I see no harm in allowing it.

As far as COUNT, since it is fairly trivial to do (float addition instead of integer increment), why not allow weights to have an effect? The default weight of 1.0 means the default behavior still as expected, but leaves open the possibility of other calculations if the user happens to need them.

I've now had a chance to run some benchmarks on my code vs the antirez/redis/unstable branch, and here are the results:

$ # prep the DB
$ ./redis-cli --eval /dev/stdin << 'EOF'
> for n=1,1000000 do
>   for k=1,3 do
>     if(math.random(3) ~= 3) then
>       redis.call('ZADD', 'set'..k, math.random(), 'm'..n)
>     end
>   end
> end
> EOF
(nil)
$ redis-cli zcount set1 -inf +inf
(integer) 666431
$ redis-cli zcount set2 -inf +inf
(integer) 666122
$ redis-cli zcount set3 -inf +inf
(integer) 666756
$ time -p redis-cli zunionstore out 3 set1 set2 set3 aggregate min
(integer) 962659
real 4.21

$ # tests against p120ph37/redis/unstable
$ time -p redis-cli zunionstore out 3 set1 set2 set3 aggregate min
(integer) 962659
real 4.71
$ time -p redis-cli zunionstore out 3 set1 set2 set3 aggregate max
(integer) 962659
real 4.71
$ time -p redis-cli zunionstore out 3 set1 set2 set3 aggregate sum
(integer) 962659
real 5.26
$ time -p redis-cli zunionstore out 3 set1 set2 set3 aggregate count
(integer) 962659
real 8.65
$ time -p redis-cli zunionstore out 3 set1 set2 set3 aggregate avg
(integer) 962659
real 5.37

$ # tests against antirez/redis/unstable
$ time -p redis-cli zunionstore out 3 set1 set2 set3 aggregate min
(integer) 962659
real 4.71
$ time -p redis-cli zunionstore out 3 set1 set2 set3 aggregate max
(integer) 962659
real 4.70
$ time -p redis-cli zunionstore out 3 set1 set2 set3 aggregate sum
(integer) 962659
real 5.25
$ time -p redis-cli zunionstore out 3 set1 set2 set3 aggregate count
(error) ERR syntax error
real 0.01
$ time -p redis-cli zunionstore out 3 set1 set2 set3 aggregate avg
(error) ERR syntax error
real 0.00


This shows negligible performance impact for union of 3 x 2/3 million-member sets.
The longer time for COUNT is because the resulting set has very low cardinality, so writing to it is less efficient.
The faster first run of MIN is because no time was needed to delete the not-yet-polulated "out" set.

I've now also incorporated your unit tests, and added a few more to cover additional cases (SUM, weighted aggregations, and the ZINTER counterparts.)

-Aaron

P.S.
I noticed that you compulsively edited the "*target = *target + val;" line too. :-)

Josiah Carlson

unread,
Sep 12, 2013, 1:31:23 AM9/12/13
to redi...@googlegroups.com
Replies inline.

On Wed, Sep 11, 2013 at 7:18 PM, Aaron Meriwether <m...@awirtz.com> wrote:
Good eye re: MIN/MAX, I've fixed that now.

With weights in AVG, my thought was that it would be good to allow a weighted-average calculation, where one set counts more heavily than the other(s).

For example, if sets represent population groups, and members represent attributes (already reduced to a single score per attribute per group), then using a weighted AVG function where weight equals population size makes sense.  Specifically, imagine that each set represents a country, and each member represents liters per year per capita consumption of a particular alcoholic beverage.  Then you could use AVG, weighted by country population to calculate beverage popularity by continent or worldwide.

If weight is applied only to the sum component of AVG, and not to the count component, I can't think of any realistic application.

I can think of a few, but all of them are operating under the "you are actually storing logs of values, not the values themselves", which gets you a form of exponentiation, leading to some geometric average shenanigans, but it seems that I'm the only one who does crazy things with logs in Redis anyway :P

For the use case where you might want to negate the scores of a set and then average it with another set, you could accomplish it in two passes (unary SUM with negative weight, then AVG).

Negative (and zero) weights in AVG does make it possible to cause a division-by-zero, but that is already somewhat handled in my code (though perhaps it should result in +inf/-inf/0 instead of just 0 - fixed now).  Not sure what you might use negative weights with this calculation for, but I see no harm in allowing it.

I'm personally not going to use it. But that doesn't mean that someone isn't going to use it for some reason (whether sane or not). But I like your reasoning behind using weights for weighted averages, and your tests are much more complete than mine were :P

As far as COUNT, since it is fairly trivial to do (float addition instead of integer increment), why not allow weights to have an effect? The default weight of 1.0 means the default behavior still as expected, but leaves open the possibility of other calculations if the user happens to need them.

You've convinced me :)
I've also tried a switch/case statement (slower) as well as the use of a function pointer (instead of passing aggregate to a single function, you just call one of several functions a'la C standard library sort()), both of which were slower. Seems that using the inline keyword + if/else/... is great for exploiting branch prediction. Part of me wonders if using some macro magic to generate a collection of the containing for loops with the aggregates inlined would make a difference, but then I realized that the difference in runtimes between the different styles were about 5%, which implies that the overhead of structure manipulations were fairly substantial in comparison to the actual aggregate calculation.

I've now also incorporated your unit tests, and added a few more to cover additional cases (SUM, weighted aggregations, and the ZINTER counterparts.)

-Aaron

P.S.
I noticed that you compulsively edited the "*target = *target + val;" line too. :-)

I'm not sure it matters on the compilation side though, but I'm too lazy to compare the compiled results :P

 - Josiah

Josiah Carlson

unread,
Sep 12, 2013, 3:52:58 AM9/12/13
to redi...@googlegroups.com
P.S. I don't experience more than 10% difference in my runtimes on my machine (I pre-delete keys before running my tests).

Aaron Wirtz

unread,
Sep 12, 2013, 12:27:43 PM9/12/13
to redi...@googlegroups.com, redi...@googlegroups.com
Yeah, I had the initial urge to convert the if block to a switch as well, but from what I've read, the point at which it becomes more efficient is somewhere > 5 blocks, and only with small, contiguous integers (which the defines do happen to be). I also considered subroutine pointer indirection like you but realized that would defeat the benefit of the inline keyword. Seems like antirez knows what he's doing as far as performance. :)

Macros to produce separate versions of the loop per aggregate operation sounds intriguing. I think the main concern would be readability. I may have a look in some of the other source files to see if there is a precedent for that. Of course, if it improves performance significantly, it may be worth it anyway.

-Aaron


Sent from my iPhone

Aaron Meriwether

unread,
Sep 12, 2013, 10:36:15 PM9/12/13
to redi...@googlegroups.com
I tried turning the loop/if inside-out (with the help of some macros), and it seems to improve performance measurably.
Using your benchmarking protocol of pre-deleting the key before each run, I also get much more consistent measurements (often about 1% variance on my workstation)

Here are the results (using the same test data mentioned previously):
                   |  min  |  max  |  sum  | count |  avg  |
unstable           | 3.69s | 3.61s | 4.18s |  N/A  |  N/A  |
unstable-agg       | 3.70s | 3.62s | 4.19s | 7.41s | 4.20s |
unstable-agg-macro | 3.43s | 3.38s | 3.83s | 7.28s | 3.84s |

So it does appear that there are some minor performance gains - around 8% for min/max/sum, but probably not enough to justify the mess.

-Aaron

Oleg Ruchovets

unread,
Sep 13, 2013, 4:13:36 AM9/13/13
to redi...@googlegroups.com
Guys ,
   Do you think it is possible to put AVG and other stuff to the main branch of Redis?
It will be hard to me to convince my colleagues to use the AVG calculation in case it is not a part of major release.
Correct me , but AVG calculation is very useful and often required feature.

Thanks
Oleg.

Aaron Wirtz

unread,
Sep 13, 2013, 10:21:01 AM9/13/13
to redi...@googlegroups.com, redi...@googlegroups.com
I've submitted a pull request here:

It's up to Salvatore now to decide if this should be merged into the main branch. It is very unlikely it will be included in the 2.8.0 release, since that is under gesture freeze, but if he accepts it into the unstable branch, maybe that will be sufficient to convince your colleagues of the quality of the code, and you can apply it to 2.8.0 as a patch yourself.

-Aaron


Sent from my iPhone

Aaron Wirtz

unread,
Sep 13, 2013, 10:22:27 AM9/13/13
to redi...@googlegroups.com
Sorry, bumped "send" before I pasted the link: https://github.com/antirez/redis/pull/1284

Sent from my iPhone

Oleg Ruchovets

unread,
Sep 14, 2013, 1:03:38 PM9/14/13
to redi...@googlegroups.com
Great.
 Hope the feature will be in release soon !

Thanks
Oleg.


--
Reply all
Reply to author
Forward
0 new messages