How does the Median/Quantiles work

451 views
Skip to first unread message

Amit

unread,
Apr 11, 2012, 11:17:39 AM4/11/12
to DataFu
Can someone example the computation of median/quantiles in map reduce?

My understanding of Datafu's median is that the 'n' mappers sort the
data and send the data to "1" reducer which is responsible for sorting
all the data from n reducers and finding the median(middle value)
Is my understanding correct?, if so, does this approach scale for
massive amounts of data as i can clearly see the one single reducer
struggling to do the final task.

Thanks


Amit

unread,
Apr 11, 2012, 11:50:35 AM4/11/12
to DataFu
**Correction**
which is responsible for sorting all the data from n mappers*

Sam Shah

unread,
Apr 11, 2012, 12:41:36 PM4/11/12
to dat...@googlegroups.com
On Wed, Apr 11, 2012 at 8:17 AM, Amit <mahal...@gmail.com> wrote:
> Can someone example the computation of median/quantiles in map reduce?
>
> My understanding of Datafu's median is that the 'n' mappers sort the
> data and send the data to "1" reducer which is responsible for sorting
> all the data from n reducers and finding the median(middle value)
> Is my understanding correct?, if so, does this approach scale for
> massive amounts of data as i can clearly see the one single reducer
> struggling to do the final task.

Hi Amit:

There are two median (quantile) functions in DataFu:

1. Quantile(), which takes a sorted input list and computes quantiles.
2. StreamingQuantile(), which doesn't require the input to be sorted,
but is computing approximate quantiles. In practice, the results
are usually good enough.

There's a blog post on the quantile DataFu UDFs [1].

You are correct in that data must be pushed to a single reducer, so
there are data limits there. Unfortunately, this isn't distributed
quantile. We'd welcome a contribution though :). But be warned that
distributed quantiles using the MapReduce paradigm isn't easy.

1. http://engineering.linkedin.com/open-source/introducing-datafu-open-source-collection-useful-apache-pig-udfs

Reply all
Reply to author
Forward
0 new messages