What does length of whiskers in box plot represent (geom_boxplot)?

4,522 views
Skip to first unread message

Adi

unread,
Dec 27, 2011, 5:13:44 AM12/27/11
to ggp...@googlegroups.com
Dear all,

The box of a box plot is plotted as follows:
  • the lower bound is Q1 (or the 25th percentile)
  • the upper bound is Q3 (or the 75th percentile)
  • the median is indicated within the box.

For the whiskers, different variants are in use (http://en.wikipedia.org/wiki/Box_plot).
I am wondering whether ggplot2 does it in the following way:
  • upper whisker = highest observed value within [Q3, 1.5*Q3]
  • lower whisker = lowest observed value within [0.5*Q1, Q1]

Thanks for your help!

Adi

Adi

unread,
Dec 27, 2011, 5:54:43 AM12/27/11
to ggp...@googlegroups.com
Is it in general correct to assume that ggplot2 follows the same logic as base graphics?

Adi

Adi

unread,
Dec 27, 2011, 6:33:05 AM12/27/11
to ggp...@googlegroups.com

Winston Chang

unread,
Dec 27, 2011, 11:52:21 AM12/27/11
to ggp...@googlegroups.com
Hi Adi -

ggplot2 uses the method described in the wikipedia page. The base graphics version of boxplot sometimes does something slightly different, depending on the sample size. See here:


The formulas for the whiskers are slightly different from what you listed in your first message:

IQR = Q3-Q1
upper whisker = max (Q3, Q3+1.5*IQR)
lower whisker = min (Q1, Q1-1.5*IQR)

-Winston


On Tue, Dec 27, 2011 at 5:33 AM, Adi <bhagwa...@gmail.com> wrote:

--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: http://gist.github.com/270442
 
To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+u...@googlegroups.com
More options: http://groups.google.com/group/ggplot2

Winston Chang

unread,
Dec 27, 2011, 11:57:13 AM12/27/11
to ggp...@googlegroups.com

IQR = Q3-Q1
upper whisker = max (Q3, Q3+1.5*IQR)
lower whisker = min (Q1, Q1-1.5*IQR)

Oops, this is slightly wrong. Wikipedia has a good description of the whiskers:
"the lowest datum still within 1.5 IQR of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile"

Dennis Murphy

unread,
Dec 27, 2011, 1:12:43 PM12/27/11
to Winston Chang, ggp...@googlegroups.com
In other words...

upper whisker = min(max(x), Q3 + 1.5 * IQR)
lower whisker = max(min(x), Q1 - 1.5 * IQR)

Dennis

Adi

unread,
Dec 27, 2011, 3:54:28 PM12/27/11
to ggp...@googlegroups.com, Winston Chang
Thanks guys!

Adi

Winston Chang

unread,
Dec 27, 2011, 5:03:18 PM12/27/11
to Dennis Murphy, ggp...@googlegroups.com
On Tue, Dec 27, 2011 at 12:12 PM, Dennis Murphy <djm...@gmail.com> wrote:
In other words...

upper whisker = min(max(x), Q3 + 1.5 * IQR)
lower whisker = max(min(x), Q1 - 1.5 * IQR)


Still not quite right... 

The upper whisker goes to the largest x that's less than or equal to Q3 + 1.5 * IQR. It won't actually go to Q3 + 1.5 * IQR unless there's a data point right there, even if there are outliers. The lower whisker is similar.

-Winston

Dennis Murphy

unread,
Dec 27, 2011, 6:44:17 PM12/27/11
to Winston Chang, ggp...@googlegroups.com
You're right. Let's go to the source (Tukey, 1977, EDA, pp. 44-47):

Definitions:

What we now call the interquartile range (IQR) was called the
'H-spread' by Tukey.
A 'step' = 1.5 * IQR
The inner fences are one step above Q3 and one step below Q1.
The outer fences are two steps above Q3 and two steps below Q1.
An adjacent value is a value at each end closest to, but still inside,
an inner fence.
Values between an inner and outer fence are called 'outside values'.
Values beyond outer fences are said to be 'far out'.

Today the convention is to term any value outside the inner fence as
an outlier, but Tukey deliberately avoided that terminology.

On p. 47, Tukey suggested the following set of rules for a boxplot:

* outside and far out values should appear separately.
* whiskers should end at the adjacent values.
* 'far out values should be marked impressively and identified in
capital letters'.
* outside and adjacent values should all be identified in small letters.

The convention in R is not to label the adjacent, outside and far out
values, but to identify them as individual points. The user has the
option of identifying them with text annotation.

Dennis

Dennis Murphy

unread,
Dec 28, 2011, 5:28:21 PM12/28/11
to Winston Chang, ggp...@googlegroups.com, Hadley Wickham
In the context of this discussion yesterday, take a look at
stat-boxplot.r and look at the return value. Do we need to amend this?

Dennis

Winston Chang

unread,
Dec 28, 2011, 5:37:11 PM12/28/11
to Dennis Murphy, ggp...@googlegroups.com, Hadley Wickham
Hi Dennis -

I actually made some changes to the docs yesterday because of that discussion. You can see the changes here:
https://github.com/wch/ggplot2/commit/ff3ef9df8dcce699bf4c64c527e8aa5c6f336dbf

If you see something that still doesn't look right, please let me know. You can see all of stat-boxplot.r here:

-Winston
Reply all
Reply to author
Forward
0 new messages