On Wed, Apr 11, 2012 at 8:17 AM, Amit <
mahal...@gmail.com> wrote:
> Can someone example the computation of median/quantiles in map reduce?
>
> My understanding of Datafu's median is that the 'n' mappers sort the
> data and send the data to "1" reducer which is responsible for sorting
> all the data from n reducers and finding the median(middle value)
> Is my understanding correct?, if so, does this approach scale for
> massive amounts of data as i can clearly see the one single reducer
> struggling to do the final task.
Hi Amit:
There are two median (quantile) functions in DataFu:
1. Quantile(), which takes a sorted input list and computes quantiles.
2. StreamingQuantile(), which doesn't require the input to be sorted,
but is computing approximate quantiles. In practice, the results
are usually good enough.
There's a blog post on the quantile DataFu UDFs [1].
You are correct in that data must be pushed to a single reducer, so
there are data limits there. Unfortunately, this isn't distributed
quantile. We'd welcome a contribution though :). But be warned that
distributed quantiles using the MapReduce paradigm isn't easy.
1. http://engineering.linkedin.com/open-source/introducing-datafu-open-source-collection-useful-apache-pig-udfs