Computing standard deviation the same way as AverageBy

Bertrand Dechoux

unread,

May 23, 2012, 7:33:30 AM5/23/12

to cascading-user

Hi,

I found about AverageBy, one of the case where cascading uses its own
pseudo-combiner instead of the MapReduce standard Combiner. The
approach makes sense.

ref : http://www.cascading.org/1.2/javadoc/cascading/pipe/assembly/AverageBy.html

I haven't dwelved into the code yet but I assume the same could be
done for computing the standard deviation. Before looking more into
it, I was wondering is that something already existing somewhere?

Regards

Bertrand

Bertrand Dechoux

unread,

May 27, 2012, 4:35:19 PM5/27/12

to cascading-user

I have done it there if it is of interest to anyone :
https://github.com/BertrandDechoux/cascading-deviation

By the way, AverageBy might have a bug.
In the context class of AverageFinal, the reset() method reset 'count'
but not 'sum'.

Regards
Bertrand

On May 23, 1:33 pm, Bertrand Dechoux <decho...@gmail.com> wrote:
> Hi,
>
> I found about AverageBy, one of the case where cascading uses its own
> pseudo-combiner instead of the MapReduce standard Combiner. The
> approach makes sense.
>

> ref :http://www.cascading.org/1.2/javadoc/cascading/pipe/assembly/AverageB...

Ted Dunning

unread,

May 27, 2012, 6:07:27 PM5/27/12

to cascadi...@googlegroups.com

Some of squares is a poor way to compute standard deviation.

See http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm for an explanation of Welford's method for doing this. This method is available in the Apache Mahout class OnlineSummarizer:

http://search-lucene.com/jd/mahout/math/org/apache/mahout/math/stats/OnlineSummarizer.html

The issue is that you may very easily find yourself subtracting large numbers (squared). This gives very poor accuracy and can even lose all significant bits of the answer including the sign bit.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Ken Krugler

unread,

May 27, 2012, 9:31:05 PM5/27/12

to cascadi...@googlegroups.com

On May 27, 2012, at 3:07pm, Ted Dunning wrote:

Some of squares is a poor way to compute standard deviation.

See http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm for an explanation of Welford's method for doing this. This method is available in the Apache Mahout class OnlineSummarizer:

http://search-lucene.com/jd/mahout/math/org/apache/mahout/math/stats/OnlineSummarizer.html

The issue is that you may very easily find yourself subtracting large numbers (squared). This gives very poor accuracy and can even lose all significant bits of the answer including the sign bit.

FWIW, here's a StdDeviation Aggregator that uses code from OnlineSummarizer.

https://github.com/bixolabs/cascading.utils/blob/master/src/main/java/com/bixolabs/cascading/StdDeviation.java

Some caveats, of course...

- It's not an AggregateBy, just a regular Aggregator.

Not sure how you'd merge in the output of two map-side results that had switched over to incremental mode.

And if you didn't have 100 results map-side, I assume you'd flush that work and have to re-process in the reducer.

But Ted would know better.

- I'm only calculating std deviation

You could easily change it to be a base class Statistics class that other classes are based on: StdDeviation, Mean, Median, Quartiles, etc.

-- Ken

--------------------------------------------

http://about.me/kkrugler

+1 530-210-6378

--------------------------

Ken Krugler

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Mahout & Solr

Ted Dunning

unread,

May 28, 2012, 3:03:47 AM5/28/12

to cascadi...@googlegroups.com

On Mon, May 28, 2012 at 1:31 AM, Ken Krugler <kkrugle...@transpac.com> wrote:

Not sure how you'd merge in the output of two map-side results that had switched over to incremental mode.

The OnlineSummarizer has some extra complexity devoted to computing rank statistics like quartiles and median. Without that, you don't need to keep the explicit samples. That makes the merging of results much easier as well since you can use a bit of algebra to take two samples that have current estimates of mean and variance and produce a weighted mean and variance.

And if you didn't have 100 results map-side, I assume you'd flush that work and have to re-process in the reducer.

If you don't care about rank statistics, then this can be dumped.

If you do care, then you have two cases:

a) combining two samples where one has <= 100 data points. Just apply the buffered data points to the other summarizer.

b) combining two samples where both have > 100 data points. Combining the mean and variance is trivial (see above). Combining the rank statistics can be done by averaging if you assume that you have randomized ordering. Since the OnlineSummarizer pretty much assumes this anyway, you may be OK. If this assumption is grossly violated, then you may want to try to detect that the two samples are incompatible and flag the rank statistic results as inaccurate. The mean and variance will be fine.

Bertrand Dechoux

unread,

May 29, 2012, 3:28:43 AM5/29/12

to cascading-user

Thanks for the leads.

If I asked about it before doing anything, it is because I knew that
the subject is not so trivial because you have to consider performance
but also accuracy.

I will look into the OnlineSummarizer and see how I can reuse the
implementation with an AggregateBy.

Thanks again

On May 28, 9:03 am, Ted Dunning <ted.dunn...@gmail.com> wrote:

adam ilardi

unread,

May 30, 2012, 7:24:04 PM5/30/12

to cascadi...@googlegroups.com

Is it possible to turn off map side aggregation in an AggregateBy? For instance we don't need the functor in this case.

Adam

Bertrand Dechoux

unread,

May 31, 2012, 5:24:53 AM5/31/12

to cascading-user

I would say it is not possible strictly speaking to turn it off.
However, you do have a threshold parameter which is needed to know how
much you keep in memory
and that's highly related to amount of aggregation that will be done
during the map phase.

this case => Are you referring to computing the mean? I am not
following you.

By the way, if you don't need the map side aggregation, then you only
need to use a standard aggregator.
The whole point of an AggregateBy is to allow for map side
aggregation. (A bit like the Hadoop combiner.)

http://docs.cascading.org/cascading/2.0/userguide/htmlsingle/#N214DD

Bertrand

On May 31, 1:24 am, adam ilardi <adamila...@gmail.com> wrote:
> Is it possible to turn off map side aggregation in an AggregateBy? For
> instance we don't need the functor in this case.
>
> Adam
>
>
>
>
>
>
>
> On Wednesday, May 23, 2012 7:33:30 AM UTC-4, Bertrand Dechoux wrote:
>
> > Hi,
>
> > I found about AverageBy, one of the case where cascading uses its own
> > pseudo-combiner instead of the MapReduce standard Combiner. The
> > approach makes sense.
>
> > ref :

> >http://www.cascading.org/1.2/javadoc/cascading/pipe/assembly/AverageB...

adam ilardi

unread,

May 31, 2012, 10:28:53 AM5/31/12

to cascadi...@googlegroups.com

Makes sense. The standard aggregator will work for me in this case. The reason I want to have it in an aggregateby class is to chain it together with other operations.

var minMaxMean:Pipe = new AggregateBy("name",Array(pipes),normalizationGroup,Max Function,Min Function, Mean Function, Std Dev Function)

Adam

Chris K Wensel

unread,

May 31, 2012, 10:41:11 AM5/31/12

to cascadi...@googlegroups.com

let me see if it makes sense to work that in to 2.1. I think it would be useful to support Aggregator chaining with AggregateBy for prtootyping etc.

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.

To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/IaSlwCjGfGIJ.

To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--

Chris K Wensel

ch...@concurrentinc.com

http://concurrentinc.com

Kavitha Raghavachar

unread,

Jan 10, 2013, 12:43:06 AM1/10/13

to cascadi...@googlegroups.com

Oscar Boykin

unread,

Jan 10, 2013, 12:48:49 AM1/10/13

to cascadi...@googlegroups.com

In scalding you do: .forceToReducers in the groupBy to turn off map-side aggregation.

--

You received this message because you are subscribed to the Google Groups "cascading-user" group.

To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/3IRhaUVlSQ4J.

To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--
Oscar Boykin :: @posco :: https://twitter.com/intent/user?screen_name=posco

Chris K Wensel

unread,

Jan 10, 2013, 1:44:37 AM1/10/13

to cascadi...@googlegroups.com

unsure if this was a mistake reply, if there was a bug, it was resolved a very long time ago.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/3IRhaUVlSQ4J.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--

Reply all

Reply to author

Forward