I am a computer programmer with only little knowledge of statistics. I am working on a program which computes some statistical indices (min, max,..) of very large input data vectors. As you can imagine, there is a little problem in computing quantiles (or the median), because the normal algorithms for that problem assume, that you have access to all values.
In
Raj Jain, Imrich Chlamtac "The P2 Algorithm for Dynamic Calculation of Quantiles and Histograms Without Storing Observations" in: Communcations of the ACM, Oct 1985, Vol. 28, No. 10
a heuristic algorithm is proposed, which calculates estimates for the quantiles without storing the values.
Now my questions:
* I tested the algorithm with 5000 test values. The estimate of the median was only 0.002 % - 0.005 % away from the exact value. Is this good enough for an statistican ?
* Are there other (better?) algorithms for that purpose ?
> Raj Jain, Imrich Chlamtac > "The P2 Algorithm for Dynamic Calculation of Quantiles > and Histograms Without Storing Observations" > in: Communcations of the ACM, Oct 1985, Vol. 28, No. 10
> a heuristic algorithm is proposed, which calculates estimates for the > quantiles without storing the values.
> Now my questions:
> * I tested the algorithm with 5000 test values. The estimate of the median > was only 0.002 % - 0.005 % away from the exact value. Is this good enough > for an statistican ?
Quantiles are often used for descriptive or informal comparative purposes only and hence high accuracy is not required, so the method, on the face of it, looks very promising, and more than adequate for most purposes. Thanks for the reference.
However the accuracy standards required of statistical calcualations are as variable as in any discipline. "Good enough for the statistician" is a meaningless phrase, (perhaps even slightly insulting :-). Whether it is good enough for your application is a matter for your client, not for statistics as a discipline in general. The usual convention in general prupose packages is to offer such an approximation as a cheap alternative.
Just a point on quoting the error they way you have: quantiles are location rather than scale measures and errors should, in principle, be given in additive rather than multiplicative terms, relative to some suitable measure of error variation. Any method of estimating quantiles could be made to appear arbitrarily good, in simple percentage terms, simply by adding a sufficiently large number to each sample value.
Also the median is the easiest quantile to estimate, so what you have quoted is the best possible case. It may be a different story at the 10% or 5% quantile.
> * Are there other (better?) algorithms for that purpose ?
D. E. Knuth "The art of computer programming" vol 3: "Sorting and searching" discusses the problem of finding the sample median at some length, but only for exact calculation and not for general quantiles.
-- Bill Venables, Dept. of Statistics, | Email: venab...@spam.adelaide.edu.au Univ. of Adelaide, South Australia. | Phone: +61 8 228 5412
[ A question about how to calculate the quantiles ]
There are essentially two methods of calculating the quantiles accurately, and the 'approximate' methods are necessarily all dependent on the data occuring in a suitably 'random' order. I would use an exact method, which are:
1) Sorting the observations. This is effectively O(N ln N) for any number of quantiles.
2) Multiple passes through the data. This is effectively O(Q N) for Q quantiles.
This means that sorting is most efficient UNLESS the number of data points exceeds K**Q, where K is a constant and Q is the number of quantiles needed. K is somewhere between 50 and 200, depending on the implementation. Unless you are looking for only the median, I advise sorting the data. Sorry.
Nick Maclaren University of Cambridge Computer Laboratory n...@cl.cam.ac.uk
In article <1991Apr13.200005.26...@cl.cam.ac.uk> n...@cl.cam.ac.uk (Nick Maclaren) writes:
>[ A question about how to calculate the quantiles ]
>There are essentially two methods of calculating the quantiles accurately, >and the 'approximate' methods are necessarily all dependent on the data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>occuring in a suitably 'random' order. I would use an exact method,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>which are:
Not so; one useful possibility for large data sets is to use a one-pass algorithm to select a random subset of N (user-selected) of the data, and to get the exact quantiles of the random subset. By suitable choice of N you can make the approximation as accurate as you like. The use of the one-pass sampling algorithm means that no more than N items need to be stored, regardless of the size of the whole data set.
Douglas M Hawkins Department of Applied Statistics e-mail d...@umnstat.stat.umn.edu University of Minnesota Voice (612) 625-8159 St Paul Fax (612) 624-2719 MN 55108