Winston Chang

Sep 25, 2012, 1:06:46 PM9/25/12
to ggplot2-dev
As many of you are no doubt aware, the default behavior of geom_bar is to use stat="bin", and this can be confusing for new users. 

Normally, there are two ways to use geom_bar:
- Map a variable to y, and use stat="identity"
- Don't map a variable to y, and use stat="bin"

If you map a variable to y and use stat="bin", it will (correctly) give an error. However, in the case where there is one y value per bin, it will behave exactly as if you used stat="identity". This can make things a lot more confusing.

# Data to be used with stat="identity"
dat_val <- data.frame(x=c("a","b"), y=c(3,2))
# OK
ggplot(dat_val, aes(x, y)) + geom_bar(stat="identity")
# No error, and looks the same as stat="identity", but it shouldn't
ggplot(dat_val, aes(x, y)) + geom_bar(stat="bin")

# Data to be used with stat="bin"
dat_bin <- data.frame(x=c("a","a","a", "b","b"))
# Error, as expected
ggplot(dat_bin, aes(x)) + geom_bar(stat="identity")
# OK
ggplot(dat_bin, aes(x)) + geom_bar(stat="bin")

I've modified stat_bin to print out a message when it is used with a variable mapped to y. For backward compatibility, it will continue to give the same graphical output, for at least a few more versions of ggplot2.

This is what it says now in that case:
> ggplot(dat, aes(x, val)) + geom_bar(stat="bin")
Mapping a variable to y and also using stat="bin".
  With stat="bin", it will attempt to set the y value to the count of cases in each group.
  This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
  If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
  If you want y to represent values in the data, use stat="identity".
  See ?geom_bar for examples.

I've also added some text and examples to the geom_bar help to try to clarify the issue:

What do you think? Is this clear enough? I have a feeling this is a change that will hit a lot of ggplot2 users so I want to make it as clear as possible what's happening, and why.


Bryan Hanson

Nov 19, 2012, 8:01:53 PM11/19/12
I get the warning message 2x when running this (partial) code.  It looks to be a little different case than anticipated.  Here's the aesthetic:

    if (identical(fac2, NULL)) {
        p <- ggplot(data, aes_string(x = fac1, y = res, color = fac1))
        } else {
            p <- ggplot(data, aes_string(x = fac1, y = res, color = fac1)) +
            facet_grid(paste(". ~", fac2, sep = ""))

Then, I think this is the piece that generates the error (once for each facet in my example, I think):

    p <- p + geom_text(aes_string(x = fac1,
        y = paste("min(",res,") - 0.1 * diff(range(",res,"))", sep=""),
        label = 'paste("n = ", ..count.. , sep = "")'),
        color = "black", size = 4.0, stat = "bin", data = data)

There are no other calls to stat = "bin" but I have a bunch to stat_summary(my custom function).

What this does (or did) is print something like 'n = 5' giving the number of observations under each pointrange.  Is there a different way to write this geom_text under ggplot2 0.9.3?

Thanks, Bryan

Winston Chang

Nov 29, 2012, 3:58:21 PM11/29/12
to Bryan Hanson, ggplot2-dev
It's hard for me to say for sure just by looking at this piece of code, but it seems that you're doing something tricky with the stat="bin" and using the ..count.. for the label. The usual behavior with stat="bin" is to use ..count.. for the y values, but in your case, you're already mapping something else to y. Since this is a pretty unusual use case, I think the best solution is to calculate all this values you want before the ggplot call and put them in a separate data frame, and then use something like geom_text(data=textdata, ...) instead.

For example if your data is called 'data', you might want something like this:

# Use ddply and summarise to calculate groupwise values
data_labels <- ddply(data, "fac1", summarise,
    y = min(res) - 0.1 * diff(range(res)),  # Get the y value
    label = paste("n =", length(res)))    # A way of getting the count

Then you can do:
p + geom_text(data=data_labels, aes(x=fac1, y=y, label=label))

In this example, I've assumed that 'res' is a column in your data. Also note that the length() method of getting the count will also include NA's, which may or not be appropriate for your data.


Bryan Hanson

Dec 2, 2012, 7:52:15 PM12/2/12
If the code I have is not right for 0.9.3, can someone suggest something more sensible?  Thanks, Bryan
