geom_text Labels not agreeing with geom_boxplot Hinges

75 views
Skip to first unread message

Ryan Pugh

unread,
Apr 2, 2015, 11:19:05 AM4/2/15
to ggp...@googlegroups.com
In trying to place labels on the median hinges of my boxplots, I've found that the y-axis positions of the geom_text labels aren't agreeing with the geom_boxplot median hinges. Mathematically, the geom_text y values are correct (I've compared median values calculated in r with Excel), but it appears that the median hinges aren't drawn correctly by ggplot2.

In this code, the median should be labelled exactly on top of the median hinge.

# Generate random dataframe of values and factors
df <- data.frame(value = rnorm(5000, mean = 5, sd = 0.5), factor = factor(sample(c(1,2,3,4,5), 5000, replace = TRUE), labels = c(1,2,3,4,5), levels = c(1,2,3,4,5)))

# Calculate the value medians by factor
df.m <- aggregate(value~factor, data = df, FUN = function(x) median(x))

# Plot the dataframe value by boxplot and overlay df.m as a label directly 
ggplot(data = df, aes(x = factor, y = value)) + geom_boxplot() + geom_text(data = df.m, aes(x = factor, y = value, label = value)) + scale_y_continuous(limits = c(4.5, 5.5))

I'm working with a much larger dataset and this effect is much more pronounced at larger scales; to the point where I can't use the graphs due to the glaring mislabeling.

Has anyone else come across this problem?

eipi10

unread,
Apr 2, 2015, 12:21:06 PM4/2/15
to ggp...@googlegroups.com
This is because scale_y_continuous excludes any data that are outside the limits range (ggplot should have given you a warning about this when you ran the code), so geom_boxplot is using a truncated data set and therefore getting different values than are contained in df.m, which uses the full data set. Remove the scale_y_continuous line and replace it with coord_cartesian(ylim=c(4.5,5.5)). coord_cartesian does not exclude data that is outside the ylim range.

Dennis Murphy

unread,
Apr 2, 2015, 1:13:30 PM4/2/15
to Ryan Pugh, ggplot2
This falls into the category of "scale transformations in ggplot2 take
place before statistical transformations, which in turn take place
before coordinate transformations". By using scale_y_continuous() in
your call, you have excluded all of the observations outside of (4.5,
5.5) **before computing a boxplot**, which is a statistical
transformation of the data. This is why you get the observed results.
As the other poster correctly diagnosed, if you change the range of
the y-scale *after* the boxplots have been constructed, then the
perceived anomaly disappears.

For the sake of readability, I'd suggest placing the medians slightly
above the median line and round them to something sensible - I doubt
that you need 8-10 digit accuracy on a graph. I'm also concerned about
zooming in on the boxplots when it is often the case that the most
interesting observations are in the extremes of the distribution, but
you may have reasonable grounds for doing that.

# Generate random dataframe of values and factors
df <- data.frame(value = rnorm(5000, mean = 5, sd = 0.5),
f = factor(sample(1:5, 5000, replace = TRUE),
levels = 1:5))
df.m <- aggregate(value ~ f, data = df,
FUN = function(x) round(median(x), 4)) # round
median to 4 digits

# Plot the dataframe value by boxplot and overlay df.m as a label directly
# Insert label 0.05 units above the median line
ggplot(data = df, aes(x = f, y = value)) + geom_boxplot() +
geom_text(data = df.m, aes(y = value + 0.05, label = value)) +
coord_cartesian(y = c(4.5, 5.5))


Dennis
> --
> --
> You received this message because you are subscribed to the ggplot2 mailing
> list.
> Please provide a reproducible example:
> https://github.com/hadley/devtools/wiki/Reproducibility
>
> To post: email ggp...@googlegroups.com
> To unsubscribe: email ggplot2+u...@googlegroups.com
> More options: http://groups.google.com/group/ggplot2
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ggplot2" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ggplot2+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Ryan Pugh

unread,
Apr 2, 2015, 1:25:38 PM4/2/15
to ggp...@googlegroups.com
As I began to dig into this, I started to feel I was missing a key concept in ggplot, and sure enough that's the case! Thanks so much for your insight, it's really appreciated.

As for Dennis' concern, you're right: censoring data can be very misleading, but the discussion being furthered by these graphs doesn't suffer by excluding outlier data. Also, the reason I was plotting directly on top of the median hinges was to enable a clearer illustration of the issue I was facing.

Once again, thanks so much. I've been struggling with this for a few days and was beginning to question a lot of work I had done in the past!
Reply all
Reply to author
Forward
0 new messages