Messages when stat="bin" is used with y mapping

1,324 views
Skip to first unread message

Winston Chang

unread,
Sep 25, 2012, 1:06:46 PM9/25/12
to ggplot2-dev
As many of you are no doubt aware, the default behavior of geom_bar is to use stat="bin", and this can be confusing for new users. 

Normally, there are two ways to use geom_bar:
- Map a variable to y, and use stat="identity"
- Don't map a variable to y, and use stat="bin"


If you map a variable to y and use stat="bin", it will (correctly) give an error. However, in the case where there is one y value per bin, it will behave exactly as if you used stat="identity". This can make things a lot more confusing.


# Data to be used with stat="identity"
dat_val <- data.frame(x=c("a","b"), y=c(3,2))
# OK
ggplot(dat_val, aes(x, y)) + geom_bar(stat="identity")
# No error, and looks the same as stat="identity", but it shouldn't
ggplot(dat_val, aes(x, y)) + geom_bar(stat="bin")


# Data to be used with stat="bin"
dat_bin <- data.frame(x=c("a","a","a", "b","b"))
# Error, as expected
ggplot(dat_bin, aes(x)) + geom_bar(stat="identity")
# OK
ggplot(dat_bin, aes(x)) + geom_bar(stat="bin")


I've modified stat_bin to print out a message when it is used with a variable mapped to y. For backward compatibility, it will continue to give the same graphical output, for at least a few more versions of ggplot2.

This is what it says now in that case:
> ggplot(dat, aes(x, val)) + geom_bar(stat="bin")
Mapping a variable to y and also using stat="bin".
  With stat="bin", it will attempt to set the y value to the count of cases in each group.
  This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
  If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
  If you want y to represent values in the data, use stat="identity".
  See ?geom_bar for examples.

I've also added some text and examples to the geom_bar help to try to clarify the issue:


What do you think? Is this clear enough? I have a feeling this is a change that will hit a lot of ggplot2 users so I want to make it as clear as possible what's happening, and why.

-Winston

Bryan Hanson

unread,
Nov 19, 2012, 8:01:53 PM11/19/12
to ggplo...@googlegroups.com
I get the warning message 2x when running this (partial) code.  It looks to be a little different case than anticipated.  Here's the aesthetic:

    if (identical(fac2, NULL)) {
        p <- ggplot(data, aes_string(x = fac1, y = res, color = fac1))
       
        } else {
            p <- ggplot(data, aes_string(x = fac1, y = res, color = fac1)) +
            facet_grid(paste(". ~", fac2, sep = ""))
            }

Then, I think this is the piece that generates the error (once for each facet in my example, I think):

    p <- p + geom_text(aes_string(x = fac1,
        y = paste("min(",res,") - 0.1 * diff(range(",res,"))", sep=""),
        label = 'paste("n = ", ..count.. , sep = "")'),
        color = "black", size = 4.0, stat = "bin", data = data)

There are no other calls to stat = "bin" but I have a bunch to stat_summary(my custom function).

What this does (or did) is print something like 'n = 5' giving the number of observations under each pointrange.  Is there a different way to write this geom_text under ggplot2 0.9.3?

Thanks, Bryan

Winston Chang

unread,
Nov 29, 2012, 3:58:21 PM11/29/12
to Bryan Hanson, ggplot2-dev
It's hard for me to say for sure just by looking at this piece of code, but it seems that you're doing something tricky with the stat="bin" and using the ..count.. for the label. The usual behavior with stat="bin" is to use ..count.. for the y values, but in your case, you're already mapping something else to y. Since this is a pretty unusual use case, I think the best solution is to calculate all this values you want before the ggplot call and put them in a separate data frame, and then use something like geom_text(data=textdata, ...) instead.

For example if your data is called 'data', you might want something like this:

# Use ddply and summarise to calculate groupwise values
library(plyr)
data_labels <- ddply(data, "fac1", summarise,
    y = min(res) - 0.1 * diff(range(res)),  # Get the y value
    label = paste("n =", length(res)))    # A way of getting the count

Then you can do:
p + geom_text(data=data_labels, aes(x=fac1, y=y, label=label))

In this example, I've assumed that 'res' is a column in your data. Also note that the length() method of getting the count will also include NA's, which may or not be appropriate for your data.

-Winston

Bryan Hanson

unread,
Dec 2, 2012, 7:52:15 PM12/2/12
to ggplo...@googlegroups.com
If the code I have is not right for 0.9.3, can someone suggest something more sensible?  Thanks, Bryan
Reply all
Reply to author
Forward
0 new messages