Methods for computing percentiles

210 views
Skip to first unread message

Liz

unread,
Feb 8, 2008, 8:01:24 AM2/8/08
to MedStats
Dear Medstats,
Recently I was calculating quartiles for a relatively small dataset
(10 observations) in the 'Frequencies' dialogue in SPSS. The result
for the 75th percentile was 1.375, despite the fact that the 7th and
8th observations were both 1 (the 9th and 10th observations were
higher). This led me to do some digging, with the help of our kind uni
stats software expert, and we discovered that within the syntax for
the 'Examine' procedure within 'Explore' one can specify 5 different
methods for computing percentiles (this is limited to 1 default
setting in 'Frequencies)'. I also found a macro online that offers 6
different methods. I have played around with a few datasets, looking
at the results for all 6 methods, and sometimes they differ quite
dramatically. For the dataset I mentioned, all of the other 5 methods
apart from the default agreed that the 75th percentile should be 1!
SAS offers 5 different options for calculating quartiles, method '4'
in SAS corresponding to the default setting in the 'frequencies'
dialogue in SPSS, which is a weighted algorithm. Is this something
that everyone is generally aware of? Is there a consensus as to which
method is recommended? The preamble to the Macro stated that the
different methods should be all but indistinguishable for large N, but
discrepancies will be magnified in small datasets. Since non-
parametric stats are frequently used for smaller datasets, this seems
like an important issue to resolve. I'd be interested to hear your
thoughts.
Thanks,
Liz
More info on the various methods:
http://www.data-for-all.com/documents/computing-percentiles.pdf
http://www.graphpad.com/index.cfm?cmd=library.page&pageID=1&categoryID=4
http://www2.jura.uni-hamburg.de/instkrim/kriminologie/Mitarbeiter/Enzmann/Software/median.txt

Bruce Weaver

unread,
Feb 8, 2008, 8:43:27 AM2/8/08
to MedStats
I suspect many users of SPSS and SAS are not aware of it. You might
consider posting this to the SPSS newsgroup (comp.soft-
sys.stat.spss).

Thanks for posting those links.

--
Bruce Weaver
bwe...@lakeheadu.ca
www.angelfire.com/wv/bwhomedir
"When all else fails, RTFM."

Ray Koopman

unread,
Feb 9, 2008, 2:15:23 AM2/9/08
to MedStats
Mathematica uses four parameters to specify the method by which
it will compute percentiles ("quantiles"). Eight different sets
are shown as "common choices", and it is claimed that "About ten
different choices of parameters are in use in statistical work."
For details, see
http://reference.wolfram.com/mathematica/ref/Quantile.html

Ted Harding

unread,
Feb 9, 2008, 9:06:59 AM2/9/08
to Liz, MedStats
On 08-Feb-08 13:01:24, Liz wrote:
> Dear Medstats,
> Recently I was calculating quartiles for a relatively small
> dataset (10 observations) in the 'Frequencies' dialogue in
> SPSS. The result for the 75th percentile was 1.375, despite
> the fact that the 7th and 8th observations were both 1
> (the 9th and 10th observations were higher). This led me to
> do some digging, with the help of our kind uni stats software
> expert, and we discovered that within the syntax for the
> 'Examine' procedure within 'Explore' one can specify 5 different
> methods for computing percentiles (this is limited to 1 default
> setting in 'Frequencies)'. I also found a macro online that offers 6
> different methods.

Liz,
Congratulations on doing the digging! I may have more to say
on this issue later (and, if so, then at length).

Meanwhile: I am puzzled that a result should be returned,
according to which 80% of the data lie below the "75th percentile".

Did you manage to find out what procedure was used by the
software to arrive at this result?

If so, would you share it?

With best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 09-Feb-08 Time: 14:06:54
------------------------------ XFMail ------------------------------

Liz

unread,
Feb 11, 2008, 7:55:07 AM2/11/08
to MedStats
On Feb 9, 2:06 pm, (Ted Harding) <Ted.Hard...@manchester.ac.uk> wrote:
> On 08-Feb-08 13:01:24, Liz wrote:
> Did you manage to find out what procedure was used by the
> software to arrive at this result?
>
> If so, would you share it?

Our applications support manager has been in contact with the company,
and they say:
"The basic method in FREQUENCIES is what SAS PROC UNIVARIATE calls
PCTLDEF=4. The five versions offered in SPSS EXAMINE are the same as
those offered in SAS PROC UNIVARIATE. We call the method used in
FREQUENCIES HAVERAGE in EXAMINE.If you click on Help>Algorithms, then
scroll down and click on the link to FREQUENCIES Algorithms, you can
navigate to the detailed formulas.
The complication is that FREQUENCIES, while basically using the
HAVERAGE method, does something additional under certain
circumstances. If the percentile requested is large enough that X(p(n
+1)) is larger than n, the HAVERAGE method in EXAMINE will return a
missing value, while FREQUENCIES usually will not. FREQUENCIES will
attempt to provide an estimate by reversing the data (multiplying it
by -1), computing the 100-pth percentile on the reversed data, and
taking -1 times that as the value for the pth percentile."

I've found the relevant formula but can't paste it accurately into
Medstats. I haven't had time to go through it in detail yet to figure
out why it comes up with 1.375 - I think it is adding 1 to the total N
before calculating the location of the 75th percentile. This would
produce 8.25 instead of 7.5. The 8th observation is 1, whilst the 9th
observation is 2.5, and imputing between the two would give
1+(0.25*1.5)=1.375 so that sounds logical.

It would be helpful if there was some convention on which method to
use - is anyone aware of a discussion of this matter?

Peter Lane

unread,
Feb 12, 2008, 4:57:36 AM2/12/08
to MedStats
We fell over this quartile problem recently in GSK, in an interface to
S-Plus graphics that we have been developing with Insightful. Somebody
noticed that the boxplots produced by S-Plus were different to those
from SAS -- not seriously different, but enough to cause concern in an
environment where "validated" software soaks up so much time and
effort. Here is the result of our investigation into what was going
on. Since then, Insightful has provided our interface with a method
that matches the SAS standard method.

Why has this come about...

Whilst quantiles may seem one of the simpler statistical concepts,
there is no universally agreed way of calculating them precisely
(though all methods will give results that are similar enough for
practical purposes). None of these methods are statistically
incorrect.

Details on algorithms for calculating quartiles and commentary on
which is better can be found in an article in the American
Statistician journal - Hyndman, R.J., and Fan, Y. (1996) Sample
quantiles in statistical packages. American Statistician, 50,
361-365. To give a brief review, the article gives nine definitions
of a quantile and notes that SAS offers Defs 1, 2, 3, 4 & 6, while S-
Plus has Def 7. In the end the paper recommends one that was not
currently implemented in any package at the time: Def 8!

What SAS does...

There are actually five different methods built into SAS for
calculating quantiles. The definitions of these can be found in the
SAS Online documentation,
http://v8doc.sas.com/sashtml/proc/zormulas.htm#z0093467
and it's well worth being aware of them as SAS might not be doing
exactly what you think it is doing! As far as I can tell (in PROC
SUMMARY) the default is Method 3 (the empirical distribution
function). SAS looks for the position where the quartile should lie
and then takes the observation closest to it.

NB: SAS does this for all quantiles including the median, which may
come as a surprise to those of you who, like me, were taught as far
back as school to take the average of the middle observations from a
dataset with an even number of observations. The other methods
available, particularly Method 1 (weighted average) may be more what
you were expecting SAS to do, depending on how you were taught.

What S-Plus does...

The S-Plus quantile function (which is used to create boxplots), takes
a weighted average approach. Unfortunately there is a subtle
difference to the way in which S-Plus calculates the position of the
quantile compared to SAS, which means that SAS's Method 1 (or any
other method) is still not identical to this method. The S-Plus
algorithm is actually the same one that is implemented in Excel, and
you can find an easy-to-follow explanation here:
http://support.microsoft.com/kb/214072/en-us

Peter Lane
Research Statistics Unit, GlaxoSmithKline
Reply all
Reply to author
Forward
0 new messages