Hi, I am looking for a distributed algorithm that can estimate percentile of large distributed data set. Here is the problem:
Image that you have a server cluster of large number of servers running the same web app, you collect user request processing time for each request served by each server independently. Every hour, each server report its statistics to centralized data store with the following information, for example, these statistics data includes (not limited):
- Number request received in this hour;
- Average processing time per request in this hour;
- Variance of processing time in this hour;
- 99% percentile for the processing time in this hour;
Now we want to find an rough estimate of 99% percentile for the hour for the whole cluster based on the above data available from the centralized storage collected from all the servers of the cluster. Is there any existing algorithm to calculate this 99% percentile? If additional data needs to be collected on each server in order to calculate this aggregated 99% percentile, what are those data? The bottomline is we don't want to collect all the raw processing time data to the centralized data storage in order to do this calculation.
Thank you!
Weian