map/reduce standard deviation is way off

752 views
Skip to first unread message

Dmitry

unread,
Feb 22, 2012, 11:00:27 AM2/22/12
to mongodb-user
I've tried different algorithms and implementations, and numbers are
always off. I am trying to calculate standard deviation, here is my
code - https://gist.github.com/1885591 . There is a similar
implementation - https://gist.github.com/1211724 , which has the same
problem.

Now, if I change (from https://gist.github.com/1885591):
> db.stats.mapReduce(map, reduce, { out: "video_stats"});
to:
> db.stats.mapReduce(map, reduce, { out: {replace : "video_stats"}, query: {"user":"abc"} });

Also, when I add min and max calculations, which are straight forward
(https://gist.github.com/1211724 has them, and are also wrong), the
numbers are way off (much much higher).

What is going on? I'm on MongoDB 2.0.2.

Dmitry

unread,
Feb 22, 2012, 11:27:19 AM2/22/12
to mongodb-user
few corrections to my original post, the following should have red:

Now, if I change (from https://gist.github.com/1885591):

> db.stats.mapReduce(map, reduce, { out: "video_stats"});
to:
> db.stats.mapReduce(map, reduce, { out: {replace : "video_stats"}, query: {"channel_id":"123"} });

then the numbers are correct.

Mathias Stearn

unread,
Feb 22, 2012, 3:30:53 PM2/22/12
to mongod...@googlegroups.com
The issue is that the output from reduce() must be the same "shape" as the value emitted from map() because the reduced output will in some cases be fed back into reduce. This is how it is able to scale across many servers easily. The reason that it seems to work with the query is most likily that a small enough subset of the data matches that it only does one reduce call.

Take a look at https://gist.github.com/1886960 for a version that works correctly. It is based on the algorithm described at http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm to merge the results of multiple parallel computations of variance. I have tested it with a large dataset and it produces the correct results.

Dmitry

unread,
Feb 22, 2012, 3:33:21 PM2/22/12
to mongodb-user
Not possible to do standard deviation with mongo's mapreduce
accurately. Reduce is called multiple times for each key with subset
of values, but to calculate standard deviation (and variance) entire
set needs to be considered. Doing it in subsets skews results too much
in an unpredictable way.

Dmitry

unread,
Feb 22, 2012, 3:37:14 PM2/22/12
to mongodb-user
Mathias, thank you. I'll try your solution, although I'm certainly
skeptical this will work 100% given the characteristics of what
standard deviation is.

Mathias Stearn

unread,
Feb 22, 2012, 3:44:18 PM2/22/12
to mongod...@googlegroups.com
This algorithm is specifically designed to break the computation of standard deviation up and merge the results of two subsets. It seems to work well in my testing. If you are still unsure, you can break it into two passes, one to compute the mean, then pass that result into a second MR using the scope option to compute the stddev. 

Dmitry

unread,
Feb 22, 2012, 3:52:51 PM2/22/12
to mongodb-user
Mathias, tested your solution on my collection, and after spot
checking it's looking very good. Thank you!
Reply all
Reply to author
Forward
0 new messages