Is there a distributed algorithm available to calculate percentile?

243 views

Skip to first unread message

wd...@walmart.com

unread,

Dec 7, 2016, 6:02:34 PM12/7/16

to stream-lib-user

Hi, I am looking for a distributed algorithm that can estimate percentile of large distributed data set. Here is the problem:

Image that you have a server cluster of large number of servers running the same web app, you collect user request processing time for each request served by each server independently. Every hour, each server report its statistics to centralized data store with the following information, for example, these statistics data includes (not limited):

Number request received in this hour;
Average processing time per request in this hour;
Variance of processing time in this hour;
99% percentile for the processing time in this hour;

Now we want to find an rough estimate of 99% percentile for the hour for the whole cluster based on the above data available from the centralized storage collected from all the servers of the cluster. Is there any existing algorithm to calculate this 99% percentile? If additional data needs to be collected on each server in order to calculate this aggregated 99% percentile, what are those data? The bottomline is we don't want to collect all the raw processing time data to the centralized data storage in order to do this calculation.

Thank you!

Weian

Matt Abrams

unread,

Dec 7, 2016, 6:06:24 PM12/7/16

to stream-...@googlegroups.com

t-digest may do what you need? We have an implementation in
stream-lib but I think Ted Dunning's has advanced his own
implementation beyond what we have:

https://github.com/tdunning/t-digest

Matt

> --
> You received this message because you are subscribed to the Google Groups
> "stream-lib-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to stream-lib-us...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages