We fell over this quartile problem recently in GSK, in an interface to
S-Plus graphics that we have been developing with Insightful. Somebody
noticed that the boxplots produced by S-Plus were different to those
from SAS -- not seriously different, but enough to cause concern in an
environment where "validated" software soaks up so much time and
effort. Here is the result of our investigation into what was going
on. Since then, Insightful has provided our interface with a method
that matches the SAS standard method.
Why has this come about...
Whilst quantiles may seem one of the simpler statistical concepts,
there is no universally agreed way of calculating them precisely
(though all methods will give results that are similar enough for
practical purposes). None of these methods are statistically
incorrect.
Details on algorithms for calculating quartiles and commentary on
which is better can be found in an article in the American
Statistician journal - Hyndman, R.J., and Fan, Y. (1996) Sample
quantiles in statistical packages. American Statistician, 50,
361-365. To give a brief review, the article gives nine definitions
of a quantile and notes that SAS offers Defs 1, 2, 3, 4 & 6, while S-
Plus has Def 7. In the end the paper recommends one that was not
currently implemented in any package at the time: Def 8!
What SAS does...
There are actually five different methods built into SAS for
calculating quantiles. The definitions of these can be found in the
SAS Online documentation,
http://v8doc.sas.com/sashtml/proc/zormulas.htm#z0093467
and it's well worth being aware of them as SAS might not be doing
exactly what you think it is doing! As far as I can tell (in PROC
SUMMARY) the default is Method 3 (the empirical distribution
function). SAS looks for the position where the quartile should lie
and then takes the observation closest to it.
NB: SAS does this for all quantiles including the median, which may
come as a surprise to those of you who, like me, were taught as far
back as school to take the average of the middle observations from a
dataset with an even number of observations. The other methods
available, particularly Method 1 (weighted average) may be more what
you were expecting SAS to do, depending on how you were taught.
What S-Plus does...
The S-Plus quantile function (which is used to create boxplots), takes
a weighted average approach. Unfortunately there is a subtle
difference to the way in which S-Plus calculates the position of the
quantile compared to SAS, which means that SAS's Method 1 (or any
other method) is still not identical to this method. The S-Plus
algorithm is actually the same one that is implemented in Excel, and
you can find an easy-to-follow explanation here:
http://support.microsoft.com/kb/214072/en-us
Peter Lane
Research Statistics Unit, GlaxoSmithKline