Notched boxplots

1,591 views
Skip to first unread message

Winston Chang

unread,
Dec 7, 2011, 11:49:28 AM12/7/11
to ggplo...@googlegroups.com
Hi all -

I just made an implementation of notched box plots. There are two new parameters for geom_boxplot:
- notch = TRUE/FALSE
- notchwidth = the relative width of the notch to the overall width (default .5)

set.seed(110)
dat <- data.frame(x=LETTERS[1:3], y=round(rnorm(90),2))

# Notched box plot
ggplot(dat, aes(x=x, y=y)) + geom_boxplot(notch=TRUE)
boxplot-1.png

# Can set notch width
ggplot(dat, aes(x=x, y=y)) + geom_boxplot(notch=TRUE, notchwidth=.2)
boxplot-2.png

# Manually-specified positions
ggplot(NULL, aes(x = 1, y=NULL, ymin = 0, lower = 25, middle = 50,
                 upper = 75, ymax = 100, notchupper=60, notchlower=40)) +
  geom_boxplot(stat="identity", notch=TRUE) 
boxplot-manual.png

# Gives a warning if notches go past hinges
ggplot(mtcars, aes(factor(cyl), mpg)) + geom_boxplot(notch=TRUE) + ylim(10,35)
# Warning messages:1: In get(x, envir = this, inherits = inh)(this, ...) :
#   notch went outside hinges. Try setting notch=FALSE.
# 2: In get(x, envir = this, inherits = inh)(this, ...) :
#   notch went outside hinges. Try setting notch=FALSE.
boxplot-3.png

# Regular boxplot function gives similar warning
boxplot(mpg ~ cyl, mtcars, notch=TRUE)
# Warning message:
# In bxp(list(stats = c(21.4, 22.8, 26, 30.4, 33.9, 17.8, 18.65, 19.7,  :
#   some notches went outside hinges ('box'): maybe set notch=FALSE
boxplot-base.png
(You may have noticed that the outliers are different between geom_boxplot and the base boxplot. This is a known issue with geom_boxplot.)


It also supports weighted boxplots. Normally the notch size is:
  median +- 1.58*IQR*sqrt(n)
Where n is the number of observations. Weighted boxplots require a change to the calculation of n, so it uses the sum of all weights at observations where weight and y values are not NA. For example, three unweighted observations at a given y value are the same as one observation with weight=3, in terms of calculating notch size.

It required some changes to geom_crossbar; instead of drawing the body with a GeomRect, it now uses GeomPolygon.



As usual, comments and suggestions are welcome!
-Winston
boxplot-2.png
boxplot-manual.png
boxplot-1.png
boxplot-base.png
boxplot-3.png

Hadley Wickham

unread,
Dec 7, 2011, 11:56:49 AM12/7/11
to Winston Chang, ggplo...@googlegroups.com
It required some changes to geom_crossbar; instead of drawing the body with a GeomRect, it now uses GeomPolygon.
-Winston

Looks good.  Just made a couple of little comments.

Have you tested it into non-Cartesian coordinates?

Hadley


--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Winston Chang

unread,
Dec 7, 2011, 12:42:54 PM12/7/11
to ggplo...@googlegroups.com
Have you tested it into non-Cartesian coordinates?


(Oops, forgot to reply-all before...)


I just tried testing the unmodified geom_boxplot code (the lastest version from master), and I get weird results. It's hard for me to say what should happen with notch and log together, given that without notch, the results are strange.

These examples are made with the version from the master branch:

ggplot(dat2, aes(x=x, y=y)) + geom_boxplot()
boxplot-linear.png

ggplot(dat2, aes(x=x, y=y)) + geom_boxplot() + scale_y_log10()
boxplot-log10.png
It looks like the whiskers are undergoing some sort of transformation, but the boxes aren't. 

-Winston
boxplot-linear.png
boxplot-log10.png

Hadley Wickham

unread,
Dec 7, 2011, 12:47:55 PM12/7/11
to Winston Chang, ggplo...@googlegroups.com
Now I'm starting to remember the hideous complexity of boxplots - I think currently those bits were special cased for transformation - or maybe it's an aesthetic that starts with x or y?  I think whiskermin probably needs to be named to xwiskermin for it to work.
boxplot-log10.png
boxplot-linear.png

Winston Chang

unread,
Dec 7, 2011, 1:19:14 PM12/7/11
to Hadley Wickham, ggplo...@googlegroups.com
On Wed, Dec 7, 2011 at 12:42 PM, Winston Chang <winsto...@gmail.com> wrote:
Have you tested it into non-Cartesian coordinates?


I just tried testing the unmodified geom_boxplot code (the lastest version from master), and I get weird results. It's hard for me to say what should happen with notch and log together, given that without notch, the results are strange.

These examples are made with the version from the master branch:

ggplot(dat2, aes(x=x, y=y)) + geom_boxplot()

ggplot(dat2, aes(x=x, y=y)) + geom_boxplot() + scale_y_log10()
It looks like the whiskers are undergoing some sort of transformation, but the boxes aren't. 
Now I'm starting to remember the hideous complexity of boxplots - I think currently those bits were special cased for transformation - or maybe it's an aesthetic that starts with x or y?  I think whiskermin probably needs to be named to xwiskermin for it to work.


Does "working" mean the entire boxplot is unaffected by the log scaling? That is, if the box goes from [1,10] in regular coordinates, should it also go from [1,10] instead of [0,1] when scale_log10 is used?


Also, I've made the minor changes based on your code comments.

-Winston

Hadley Wickham

unread,
Dec 7, 2011, 1:22:33 PM12/7/11
to Winston Chang, ggplo...@googlegroups.com
> Does "working" mean the entire boxplot is unaffected by the log scaling?
> That is, if the box goes from [1,10] in regular coordinates, should it also
> go from [1,10] instead of [0,1] when scale_log10 is used?

Working means that qplot(x, y) + scale_y_log10() is identical to
qplot(x, log10(y)), modulo the axes.

But thinking about it more, the transformation happens before
stat_boxplot see's the code, so you shouldn't have to do anything.

Winston Chang

unread,
Dec 7, 2011, 2:51:27 PM12/7/11
to Hadley Wickham, ggplo...@googlegroups.com
Working means that qplot(x, y) + scale_y_log10() is identical to
qplot(x, log10(y)), modulo the axes.

But thinking about it more, the transformation happens before
stat_boxplot see's the code, so you shouldn't have to do anything.


In that case, I think everything works. These all look the same (except that the notch versions have notches, of course):


set.seed(110)
dat <- data.frame(x=LETTERS[1:3], y=round(rnorm(90),2))
dat2 <- transform(dat,y=y+3)  # Make sure all positive before log transform

# No notch, scale_y_log10
ggplot(dat2, aes(x=x, y=y)) + scale_y_log10() +
    geom_boxplot() + geom_point(shape=21, colour="red") 

# No notch, log(y)
ggplot(dat2, aes(x=x, y=log10(y))) +
    geom_boxplot() + geom_point(shape=21, colour="red")


# Notch, scale_y_log10
ggplot(dat2, aes(x=x, y=y)) + scale_y_log10() +
    geom_boxplot(notch=TRUE) + geom_point(shape=21, colour="red") 

# Notch, log(y)
ggplot(dat2, aes(x=x, y=log10(y))) +
    geom_boxplot(notch=TRUE) + geom_point(shape=21, colour="red")

-Winston

Hadley Wickham

unread,
Dec 7, 2011, 2:55:00 PM12/7/11
to Winston Chang, ggplo...@googlegroups.com
On Wed, Dec 7, 2011 at 2:51 PM, Winston Chang <winsto...@gmail.com> wrote:
>> Working means that qplot(x, y) + scale_y_log10() is identical to
>> qplot(x, log10(y)), modulo the axes.
>>
>> But thinking about it more, the transformation happens before
>> stat_boxplot see's the code, so you shouldn't have to do anything.
>>
>
> In that case, I think everything works. These all look the same (except that
> the notch versions have notches, of course):

Ok, great. What about with coord_polar or coord_transform? Those
will test that you've constructed the final grob appropriately.

Would you also mind taking a look at
https://github.com/hadley/ggplot2/issues/108 ? It might be the same
problem as https://github.com/hadley/ggplot2/issues/281

Winston Chang

unread,
Dec 7, 2011, 4:14:56 PM12/7/11
to Hadley Wickham, ggplo...@googlegroups.com
On Wed, Dec 7, 2011 at 1:55 PM, Hadley Wickham <had...@rice.edu> wrote:
On Wed, Dec 7, 2011 at 2:51 PM, Winston Chang <winsto...@gmail.com> wrote:
>> Working means that qplot(x, y) + scale_y_log10() is identical to
>> qplot(x, log10(y)), modulo the axes.
>>
>> But thinking about it more, the transformation happens before
>> stat_boxplot see's the code, so you shouldn't have to do anything.
>>
>
> In that case, I think everything works. These all look the same (except that
> the notch versions have notches, of course):

Ok, great.  What about with coord_polar or coord_transform?  Those
will test that you've constructed the final grob appropriately.


Thanks for the idea for the test - it looks like when using coord_polar, GeomPolygon was closing the box by using a straight line, rather than a curved line. To fix it, I made sure to explicitly close the polygon by setting the last point to the same coordinates as the first point.


ggplot(dat2, aes(x=x, y=y)) + scale_y_log10() +
    geom_boxplot(notch=TRUE) + geom_point(shape=21, colour="red") 
boxplot-scale_y_log10.png

ggplot(dat2, aes(x=x, y=log10(y))) +
    geom_boxplot(notch=TRUE) + geom_point(shape=21, colour="red")
boxplot-log10_y.png

ggplot(dat2, aes(x=x, y=y)) + coord_trans(y="log10") +
    geom_boxplot(notch=TRUE) + geom_point(shape=21, colour="red") 
boxplot-coord_trans_log10.png

ggplot(dat2, aes(x=x, y=y)) + coord_polar() +
    geom_boxplot(notch=TRUE) + geom_point(shape=21, colour="red") 
boxplot-coord_polar.png

One odd thing about the coord_polar version is that the A and C boxes are right up against each other. This is because the AC angle is smaller than the AB and BC angles. I think this is a general issue with using x-axis factors in polar coordinates. I'll file a bug for that...

-Winston

boxplot-coord_trans_log10.png
boxplot-coord_polar.png
boxplot-log10_y.png
boxplot-scale_y_log10.png

Winston Chang

unread,
Dec 7, 2011, 4:38:21 PM12/7/11
to Hadley Wickham, ggplo...@googlegroups.com

ggplot(dat2, aes(x=x, y=log10(y))) +
    geom_boxplot(notch=TRUE) + geom_point(shape=21, colour="red")
boxplot-log10_y.png

ggplot(dat2, aes(x=x, y=y)) + coord_trans(y="log10") +
    geom_boxplot(notch=TRUE) + geom_point(shape=21, colour="red") 
boxplot-coord_trans_log10.png


Just noticed that the lower whisker for B is different between the coord_trans(y="log10") version and the other log10 graphs. Again, this looks like an issue on the master branch. Here comes another bug report...



boxplot-log10_y.png
boxplot-coord_trans_log10.png

Winston Chang

unread,
Dec 7, 2011, 4:55:34 PM12/7/11
to Hadley Wickham, ggplo...@googlegroups.com
On Wed, Dec 7, 2011 at 3:38 PM, Winston Chang <winsto...@gmail.com> wrote:

ggplot(dat2, aes(x=x, y=log10(y))) +
    geom_boxplot(notch=TRUE) + geom_point(shape=21, colour="red")


ggplot(dat2, aes(x=x, y=y)) + coord_trans(y="log10") +
    geom_boxplot(notch=TRUE) + geom_point(shape=21, colour="red") 


Just noticed that the lower whisker for B is different between the coord_trans(y="log10") version and the other log10 graphs. Again, this looks like an issue on the master branch. Here comes another bug report...


I just realized that this isn't a bug; it's doing things correctly. :-/
Reply all
Reply to author
Forward
0 new messages