Box and box-percentile plots

Tim Hesterberg

unread,

Nov 27, 1997, 3:00:00 AM11/27/97

to

I'll add to Frank Harrell's comments regarding boxplots -- the
standard method for setting whiskers can be very misleading, because the
upper (lower) whisker depends on the
lower (upper) half of the distribution,
in addition to depending on the appropriate half.

I have an example where the visual impression imparted by the whiskers
was the exact opposite of what should have been imparted.
I did a boxplot of Y vs year, for a large data set of fairly skewed
data, with 10 different years. One year was quite distinct -- the
upper half of the distribution was about the same as for other years,
but the lower quartile was lower (as were other percentiles below the
median). Because of that the interquartile range was wider, and the
upper whisker extended much further, giving the impression of higher
claims that year (the lower whisker didn't extend much more, because
it hit the minimum of the data).

There was an interesting story in this data set.
Y is cube root of dollar values for auto insurance claims, and the
unique year was the Desert Storm year, when lots of young guys were
in Kuwait and Iraq. There were many fewer claims that year and the
claims there were tended to be lower, particularly in the lower half
of the distribution.

Tim Hesterberg

Lutz Prechelt

unread,

Nov 27, 1997, 3:00:00 AM11/27/97

to

ti...@statsci.com said:
> I have an example where the visual impression imparted by the whiskers
> was the exact opposite of what should have been imparted.
> I did a boxplot of Y vs year, for a large data set of fairly skewed
> data, with 10 different years. One year was quite distinct -- the
> upper half of the distribution was about the same as for other years,
> but the lower quartile was lower (as were other percentiles below the
> median). Because of that the interquartile range was wider, and the
> upper whisker extended much further, giving the impression of higher
> claims that year

But then in your example there should be fewer upper outliers
in the case where the upper whisker was 'too high', compared
to the other boxplots, right?

Hence, the moral from your car insurance story seems twofold to me:

1. Whiskers should indeed be drawn as 'muted marks' (as
suggested by Bert Gunter) to avoid over-interpretation.

2. The standard definition of whisker lengths is appropriate only
if outliers are plotted. If outliers are not to be shown,
the whiskers should better indicate some fixed quantile.
[The boxplot() documentation (and maybe even the boxplot function)
of S-Plus could be improved in this respect.]

Lutz

Lutz Prechelt http://wwwipd.ira.uka.de/~prechelt/ | Whenever you
Institut f. Programmstrukturen und Datenorganisation | complicate things,
Universitaet Karlsruhe; D-76128 Karlsruhe; Germany | they get
(Phone: +49/721/608-4068, FAX: +49/721/608-7343) | less simple.
>>> Ever had negative research results? http://wwwipd.ira.uka.de/fnr <<<

Christmann, Andreas

unread,

Nov 27, 1997, 3:00:00 AM11/27/97

to

I like the library HMISC developed by F. Harrell and his box and
box-percentile plots are very interesting to me.

>I would be interested in other users posting to s-news their opinions
> about box plots and
>especially which if any modifications they like best, i.e., continuous
>box-percentile plot vs.
>selected quantiles and which quantiles they like to depict. If other
>statisticians find that
>"outer values" are easier to explain to non-statisticians than I have
> found, I would like to hear that too.

I like a modification of the the usual boxplot which tries to avoid to
"detect to many observations as outliers".
There are persons who interpretate the usual boxplot in the following
way. Every
observation outside the whiskers is an outlier (with respect to the
normal distribution).
This often results in many observations which are
called "outliers" especially for large sample sizes.
But this is often too pessimistic even if the assumption of normality is

valid!

Consider n iid random variables from the standard normal distribution.
The probability that at least one out of n random variables is
outside the interval
lower quartile - c*MAD to upper quartile + c*MAD
where c>0 is fixed (e.q. 1.5 or 3) depends on the sample size n
(and this is not only true for the normal distribution of course).
If c is fixed and n is sufficiently large, one expects at least one out
of n random variables even under the iid N(0,1) assumption to fall
outside this interval.
Therefore
Davies, P.L. and Gather, U. (1993).
The identification of multiple outliers (with discussion).
{\em J. Amer. Statist. Assoc.}, {\bf 88}, 782-801.
proposed the alpha_n outlier approach which reflects this.
In their approach the constant c depends on n in an appropriate way.

I use the following prototypes of their proposal (based on MEDIAN and
MAD).
I apologize that my code is not well-written.
The function ac.oir computes one such boxplot for a single vector of
observations x.
The function ac.oircomp computes such boxplots for a matrix x with two
columns.
The first column with integers ranging from 1 to at most 10 define the
groups. The second column contains the observations.
Unfortunately I failed to write an S-Plus code, where the
group membership may be more general (not 1 to k) or a function for
trellis.
Comments are well-come.
Andreas Christmann

#################################
ac.oir <- function (x,alpha=0.05,plot=T)
{
# Andreas Christmann, Dortmund, 03.06.1997
# Program: S-Plus for Windows 3.3, Release 1
# External functions: even(x), odd(x)
# !!! no warranty !!!
# Boxplot with outlier identification rule
# in the sense of
# Davies, P.L. and Gather, U. (1993).
# The identification of multiple outliers.
# J.Amer.Statist.Assoc. 88, 782-801.
# implemented rule: Hampel (MED,MAD) with
# level alpha (0.05 or 0.01) and formulae (4) and (16).
# missing values will be excluded
# Examples for possible calls:
# ac.oir(x)
# ac.oir(x,0.05)
# ac.oir(x,0.01)
# describe(ac.oir(ballon.dat$y,0.05)$out)
# sapply( split(x[,2], x[,1]), ac.oir)
#
x <- x[x != "NA"]
alpha <- alpha
tmptmp <- boxplot(x,plot=F)
# medmad.crit contains the simulated critical constants for
# sample sizes 3(1)9. First column for alpha=0.05, second column for
0.01
medmad.crit <- matrix( c( 21.3061, 109.0697,
6.7567, 16.7149,
7.7893, 18.6976,
5.5897, 10.4136,
5.6985, 10.3709,
4.8029, 7.8734,
5.1160, 8.3660 ), ncol=2, byrow=T)
if (tmptmp$n < 3) { print (c("ERROR: at least 3 observations required.
STOP"));
stop() }
if (alpha != 0.05 & alpha != 0.01)
{ print (c("ERROR: alpha is different from 0.05 or 0.01. STOP"));
stop() }
n <- tmptmp$n
if (2 < n & n < 10)
{ if (alpha == 0.05) c1 <- medmad.crit[n-2,1]
if (alpha == 0.01) c1 <- medmad.crit[n-2,2] }
if (10 <= n)
{ # alphan <- 1- (1-alpha)**(1/n)
zn <- qnorm( (1-(alpha/2))^(1/n) )
if (even(n) & alpha == 0.05) c1 <- zn + 21.61*( (n+1)^(-0.8655) )
/ 1.4826
if (odd(n) & alpha == 0.05) c1 <- zn + 14.43*( (n-3)^(-0.7939) )
/ 1.4826
if (even(n) & alpha == 0.01) c1 <- zn + 41.39*( n^(-0.9143) )
/ 1.4826
if (odd(n) & alpha == 0.01) c1 <- zn + 24.48*( (n-5)^(-0.8236) )
/ 1.4826
}
xupper <- median(x) + c1 * mad(x)
xlower <- median(x) - c1 * mad(x)
tmptmp$stats[[1]] <- max(x [x <= xupper])
tmptmp$stats[[5]] <- min(x [x >= xlower])
tmptmp2 <- sort(x [ x > tmptmp$stats[[1,]] | x < tmptmp$stats[[5,]] ] )

extremes <- 0
if (length(as.vector(tmptmp2)) > 0 ) extremes <- 1
if (extremes == 1)
{ tmptmp$out <- tmptmp2
tmptmp$group <- rep(1,nrow(as.matrix(tmptmp$out)))
tmptmp <- list(stats=tmptmp$stats,n=tmptmp$n, conf=tmptmp$conf,
names=tmptmp$names, out=tmptmp$out,
group=tmptmp$group) }
if (extremes == 0) tmptmp <- list(stats=tmptmp$stats, n=tmptmp$n,
conf=tmptmp$conf, names=tmptmp$names)

if (plot == T) {
bxp(tmptmp)
if (alpha == 0.05) title(sub="OIR-(MED,MAD), alpha=0.05")
if (alpha == 0.01) title(sub="OIR-(MED,MAD), alpha=0.01") }

if (extremes == 1)
{ tmptmp$out <- tmptmp2
tmptmp$group <- rep(1,nrow(as.matrix(tmptmp$out)))
tmptmp <- list(stats=tmptmp$stats,n=tmptmp$n, conf=tmptmp$conf,
names=tmptmp$names, out=tmptmp$out,
group=tmptmp$group,
alpha=alpha,critconst=c1 ) }
if (extremes == 0) tmptmp <- list(stats=tmptmp$stats, n=tmptmp$n,
conf=tmptmp$conf, names=tmptmp$names,

alpha=alpha,critconst=c1)

return(tmptmp)
}

#############
even <- function(x)
{
cond <- F
if(abs(x - floor(x/2) * 2) < 1e-006)
cond <- T
return(cond)
}

#############
odd <- function(x)
{
cond <- F
if(abs(x - floor(x/2) * 2) >= 1e-006)
cond <- T
return(cond)
}

#################
ac.oir(x,0.05)

###########################################
ac.oircomp <- function(x,groupnames=as.character(c(1:10)),
alpha=0.05,plot=F) {
# Andreas Christmann, Dortmund, 03.06.1997
# Program: S-Plus for Windows 3.3, Release 1
# External functions: even(x), odd(x), ac.oir
# !!! no warranty !!!
# MULTIPLE Boxplot with outlier identification rule
# in the sense of
# Davies, P.L. and Gather, U. (1993).
# The identification of multiple outliers.
# J.Amer.Statist.Assoc. 88, 782-801.
# implemented rule: Hampel (MED,MAD) with
# level alpha (0.05 or 0.01) and formulae (4) and (16).
# missing values will be excluded
# ARGUMENTS:
# x (n,2) matrix, column 1: group identifiers (1 to 10)
# column 2: observations
# groupnames text (labels) for group identifiers
# HAVE TO BE IN CORRECT INCREASING ORDER (1 to 10)
# alpha 0.01 or 0.05 (default)
# plot T or F (default)
# EXAMPLES FOR POSSIBLE CALLS:
# x <- matrix(c( rep(1,200), rep(2,100), rep(3,100),
rep(4,100), rep(5,500),
# rnorm(200), rlogis(100)/1.6, rt(100,3),
rstab(100,1.5,0.5), rchisq(500,3)),
# ncol=2, byrow=F)
# ac.oircomp(x, c("N01", "logistic/1.6", "t(3)", "stable(1.5,0.5)",
"chisq3"))
# title ( main="S-Plus function ac.oircomp")
# ac.oircomp(x, c("N01", "logistic/1.6", "t(3)", "stable(1.5,0.5)",
"chisq3"),plot=T)
# ac.oircomp(x, c("N01", "logistic/1.6", "t(3)", "stable(1.5,0.5)",
"chisq3"),alpha=0.01)
# ac.oircomp(x, c("N01", "logistic/1.6", "t(3)", "stable(1.5,0.5)",
"chisq3"), alpha=0.01, plot=T)
#
x <- x[order(x[,1]),]
alpha <- alpha
plot <- plot
tmpx <- split(x[,2],x[,1])
if (plot == T) {
# par(mfrow=c(3,4))
tmp <- lapply(tmpx, ac.oir, alpha, plot=T) }
if (plot == F) tmp <- lapply(tmpx, ac.oir, alpha, plot=F)
tmpstats <- cbind(tmp$"1"$stats, tmp$"2"$stats, tmp$"3"$stats,
tmp$"4"$stats, tmp$"5"$stats, tmp$"6"$stats,
tmp$"7"$stats, tmp$"8"$stats, tmp$"9"$stats,
tmp$"10"$stats)
tmpn <- c(tmp$"1"$n, tmp$"2"$n, tmp$"3"$n, tmp$"4"$n,
tmp$"5"$n, tmp$"6"$n, tmp$"7"$n, tmp$"8"$n,
tmp$"9"$n, tmp$"10"$n)
tmpconf <- cbind(tmp$"1"$conf, tmp$"2"$conf, tmp$"3"$conf,
tmp$"4"$conf,
tmp$"5"$conf, tmp$"6"$conf, tmp$"7"$conf,
tmp$"8"$conf,
tmp$"9"$conf, tmp$"10"$conf)
tmpgroup <- c( 1 * tmp$"1"$group, 2 * tmp$"2"$group, 3 * tmp$"3"$group,

4 * tmp$"4"$group, 5 * tmp$"5"$group, 6 *
tmp$"6"$group,
7 * tmp$"7"$group, 8 * tmp$"8"$group, 9 *
tmp$"9"$group,
10 * tmp$"10"$group)
tmpgroup <- tmpgroup[tmpgroup != "NA"]
tmpout <- c(tmp$"1"$out, tmp$"2"$out, tmp$"3"$out, tmp$"4"$out,
tmp$"5"$out, tmp$"6"$out, tmp$"7"$out, tmp$"8"$out,
tmp$"9"$out, tmp$"10"$out)
tmpout <- tmpout[tmpout != "NA"]
tmpnames <- groupnames[1:length(tmpstats[3,])]
if (length(tmpout) == 0)
tmp <- list(stats = tmpstats, n = tmpn, conf = tmpconf,
names = tmpnames)
if (length(tmpout) > 0)
tmp <- list(stats = tmpstats, n = tmpn, conf = tmpconf,
names = tmpnames, out = tmpout, group =
tmpgroup)
bxp(tmp)
if (alpha == 0.05) title(sub = "OIR-(MED,MAD), alpha=0.05")
if (alpha == 0.01) title(sub = "OIR-(MED,MAD), alpha=0.01")
return(tmp)
}
###########################################
x <- matrix(c( rep(1,200), rep(2,100), rep(3,100),
rep(4,100), rep(5,500),
rnorm(200), rlogis(100)/1.6, rt(100,3),
rstab(100,1.5,0.5), rchisq(500,3)),
ncol=2, byrow=F)
ac.oircomp(x, c("N01", "logistic/1.6", "t(3)", "stable(1.5,0.5)",
"chisq3"),plot=F)
title ( main="S-Plus function ac.oircomp")
############################################

--
----------------------------------------------
A.Chri...@hrz.uni-dortmund.de

Dr. Andreas Christmann ///////
Universitaet Dortmund U N I D O ///
Hochschulrechenzentrum ______///////
Wissenschaftl. Anwendungen \_\_\_\/////
D-44221 Dortmund \_\_\_\///
Tel. 049-231-755-2763 \_\_\_\/
Fax. -2731

Tim Hesterberg

unread,

Nov 28, 1997, 3:00:00 AM11/28/97

to

>ti...@statsci.com said:
>> I have an example where the visual impression imparted by the whiskers
>> was the exact opposite of what should have been imparted.
>> I did a boxplot of Y vs year, for a large data set of fairly skewed
>> data, with 10 different years. One year was quite distinct -- the
>> upper half of the distribution was about the same as for other years,
>> but the lower quartile was lower (as were other percentiles below the
>> median). Because of that the interquartile range was wider, and the
>> upper whisker extended much further, giving the impression of higher
>> claims that year

Lutz Prechelt wrote:
>But then in your example there should be fewer upper outliers
>in the case where the upper whisker was 'too high', compared
>to the other boxplots, right?

That is correct. However, with large samples there are so many
"outliers" that they overplot, and give a weaker visual impression
than do the whiskers. As Andreas Christmann pointed out, the default
whiskers flag too may observations as outliers.

Tim Hesterberg

Doug Moog

unread,

Dec 3, 1997, 3:00:00 AM12/3/97

to

I very much appreciate the recent contributions and discussion in this
subject. Whisker/outlier definition can indeed be problematic. When I
plotted some "large" data sets (800 points), the number of default outliers
was so great that they tended to catch the eye, and reviewers focused more
on the "outliers" than the rest of the distributions. My solution - not
too satisfying - was to reduce the size of the data sets by redefining
them. Probably, redefing the whisker ranges would work better.

I am generally uncomfortable with the whiskers, for reasons well
demonstrated by Tim Hesterberg's example. The problem is that they often
don't reflect what they're supposed to, as I understand it, which are the
"limits" of the data, with the "outliers" regarded as aberrant points which
aren't really part of the distribution (in which case might it not be
sometimes preferable not to plot them at all?). But with larger data sets,
they seem more often to simply reflect the interquartile range (i.e., 1.5
times this range), while ginving a false impression that they represent
some kind of percentile. For this reason, replacing them by an actual
percentile may be preferable, which I gather is a step in the direction of
the box-percentile plot. I am grateful to Frank Harrell for making this
available.

_________________________________________________________________________
Douglas B. Moog, Ph.D. phone: 216-368-1688
Research Associate fax: 216-368-3691
Dept. of Geological Sciences internet: db...@po.cwru.edu
Case Western Reserve University
Cleveland, OH 44106-7216
USA