Actually, simply using MapReduce, one can compute percentiles with a
single MapReduce.
Let's say we want to have the 50th percentile of download par day for
a list of product.
Each product is identified by its productId. And we have a file with
entries such as (productId,nDayDownloads).
0) trick : emit the count as a unreachable nDayDownloads, exemple : -1
1) combine in order to reduce the size : you basically want to get the
data in the form of a histogram (ie not a suite of nDayDownloads but a
mapping of how many times nDayDownloads there was for a product).
1) group by productI
2) secondary sort by nDayDownloads
3) With the first value(s) of each group, you can know the size of the
group
4) you then iterate (only once) and pick the percentile where you want
I did in plain Java and would be interested to do it in Cascading, but
I haven't had the time so far.
Caveat : it is indeed a huge sort. Reducing the histogram width (eg :
well.. finally, I am only interested in thousands of downloads per
day, you can truncate the number) or sampling are way to improve
performance but you will lose in precision. I would like to say that
the precision loss does not matter but it depends why your are looking
for percentile ie for contractual reasons, a estimated percentile
might be irrelevant.
Bertrand
On 11 juil, 00:41, Tim James <
tja...@change.org> wrote:
> Thank you Ken, Ted, Philippe. Very helpful! We'll be looking at all of
> this and I'll let you know what we come up with.
>
> I've looked now at the code inhttps://
github.com/bixolabs/cascading.utils/blob/master/src/main/java...
> and we may be able to modify it for our needs (quartiles are insufficient),
> or we may end up with some simpler sampling approach.
>
> Tim
>
>
>
>
>
>
>
> On Monday, July 9, 2012 11:07:29 PM UTC-7, Ted Dunning wrote:
>
> > Sent from my iPhone
>
> > On Jul 9, 2012, at 8:43 PM, Ken Krugler <
kkrugler_li...@transpac.com>