Calculating percentile on arbitrary list

JPatrick Davenport

unread,

Oct 28, 2012, 2:24:13 PM10/28/12

to cascadi...@googlegroups.com

Hello,

I need to use cascading to dig through a lot of data and roll up reports from that. I get how to do the actual report math. I'm not really sure how I can calculate the percentile ranks of the data. Specifically I'm looking for the records in the 95th percentile. My understanding for finding this on a normal list is to first take 1 - .95 (right now it's the 95th percentile, but it's configurable) to yield .05. I then take the size of the list and multiply it by .05. So if I had a list of 100, I'd get 5. Once I got 5, I would iterate through the list for the first 5 entries.

My problem is that the list size is arbitrary. I can figure out how to get the unique number of entries in the list with cascading. After that I'm not sure how I can calculate the needed information.

For example purposes the information is rather straight forward. I have a list of users and their scores: userid, score. This is any size.

How should I approach this?

Thanks,

JPD

Ted Dunning

unread,

Oct 28, 2012, 2:44:26 PM10/28/12

to cascadi...@googlegroups.com

There are better ways to do this.

If you want the exact 95-th percentile from a list then you can build an accumulator that remembers the number of elements seen so far as well as a priority queue to keep the top 5% of the data seen so far. The maximum size of the priority queue should be set just larger than necessary so you can reject most elements out-of-hand.

This approach can become unwieldy as the number of elements seen increases. There are a variety of advanced algorithms that will give you good approximations, but one of the simplest algorithms is to simply keep several queues that each get progressively more down-sampled versions of your data stream. I usually keep 5-8 of these queues and down-sample each one 10x more severely than the next.

Then to compute an extreme quantile (like say the 99.999-th %-ile), you find compute where this percentile would be in each of your queues and take the least down-sampled version as your estimate.

This algorithm is not as accurate as the more clever algorithms available, but it has three major advantages:

1) it is simple enough to get right the first time you code it

2) it can't be fooled by data that is in a diabolical order

3) you can snag a copy of an implementation at https://github.com/tdunning/load (look at src/main/java/com/mapr/load/TopTailAnalyzer.java )

Hope this helps.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/eovRs0lameEJ.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Paco Nathan

unread,

Oct 28, 2012, 6:44:25 PM10/28/12

to cascadi...@googlegroups.com

+1 to what Ted suggests

A formalism for determining percentile ranks while streaming through
large amounts of data is to use an indexable skip list:
http://en.wikipedia.org/wiki/Skip_list#Indexable_skiplist

That's approximated by using progressively down-sampled queues.

Writing for a general case in a Cascading app is another matter. I'd
like to add that capability to a general library for Cascading, and
would be willing to write/maintain the code. Definitely open to
suggestions.. Using a column store comes to mind.

FWIW, here's the Ryan Tibshirani paper comparing quickselect,
binmedian, and binapprox:
http://www.stat.cmu.edu/~ryantibs/papers/median.pdf

Paco

Ken Krugler

unread,

Oct 28, 2012, 7:44:28 PM10/28/12

to cascadi...@googlegroups.com

There was a previous discussion about calculating standard deviations using an online algorithm. Ted had posted:

See http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm for an explanation of Welford's method for doing this. This method is available in the Apache Mahout class OnlineSummarizer:

http://search-lucene.com/jd/mahout/math/org/apache/mahout/math/stats/OnlineSummarizer.html

The OnlineSummarizer code calculates quartiles, which I assume could be extended to calculating your target 95th percentile.

I posted some quick & dirty code for doing standard deviation here:

https://github.com/bixolabs/cascading.utils/blob/master/src/main/java/com/bixolabs/cascading/StdDeviation.java

Which could be a useful starting point, though it doesn't do map-side pre-calculations.

-- Ken

--------------------------------------------

http://about.me/kkrugler

+1 530-210-6378

Ted Dunning

unread,

Oct 28, 2012, 7:56:35 PM10/28/12

to cascadi...@googlegroups.com

Also, while binapprox and binmedian are nice algorithms, neither is on-line.

The algorithm used in OnlineSummarizer is on-line but not easily decomposed for use in a parallel setting.

For highly skewed percentiles, the successive sub-sampling algorithm that Paco correctly points out is similar to skip-lists is *both* on-line and decomposable. This means that you can make a version that uses a combiner.

The reason that you can do this is that keeping the top 1000 samples of a data stream can trivially be done by keeping the top 1000 samples of portions of that stream and then keeping the top 1000 samples from the concatenation of the partial top-1000 samples. This applies as well to downsampled versions of a stream. Thus, doing the successive down-sample algorithm on each mapper output and then keeping the best 1000 samples from each mapper gives you a combiner form for this summarization.

If you actually want to compute the median, you actually have a considerably harder problem.

Paco Nathan

unread,

Oct 28, 2012, 8:12:16 PM10/28/12

to cascadi...@googlegroups.com

Also, having an online approach is useful for anomaly detection, as in
anti-fraud work. Calculating median can be costly, but running medians
are not so much. Those come in quite handy in risk apps.

There's definitely room among common use cases for alternating between
large batch jobs and online/streaming. That's one of the "patterns" we
see where Hadoop meets "real-time" integrations in practice.

JPatrick Davenport

unread,

Oct 28, 2012, 10:37:39 PM10/28/12

to cascadi...@googlegroups.com

Wow. Thank you all. I will spend the better part of tomorrow researching these topics. I knew that my heart sank for a reason when I got the functional requirement to just fine the 95th percentile and above.

Ted Dunning

unread,

Oct 29, 2012, 1:49:26 AM10/29/12

to cascadi...@googlegroups.com

For that, you have the answer. Just look at the referenced TopTailAnalyzer. Take it if you like. Adapt it if you need to.

To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/5zR91Q83zMYJ.

Reply all

Reply to author

Forward