Messages when stat="bin" is used with y mapping

Winston Chang

unread,

Sep 25, 2012, 1:06:46 PM9/25/12

to ggplot2-dev

As many of you are no doubt aware, the default behavior of geom_bar is to use stat="bin", and this can be confusing for new users.

Normally, there are two ways to use geom_bar:

- Map a variable to y, and use stat="identity"

- Don't map a variable to y, and use stat="bin"

If you map a variable to y and use stat="bin", it will (correctly) give an error. However, in the case where there is one y value per bin, it will behave exactly as if you used stat="identity". This can make things a lot more confusing.

# Data to be used with stat="identity"

dat_val <- data.frame(x=c("a","b"), y=c(3,2))

# OK

ggplot(dat_val, aes(x, y)) + geom_bar(stat="identity")

# No error, and looks the same as stat="identity", but it shouldn't

ggplot(dat_val, aes(x, y)) + geom_bar(stat="bin")

# Data to be used with stat="bin"

dat_bin <- data.frame(x=c("a","a","a", "b","b"))

# Error, as expected

ggplot(dat_bin, aes(x)) + geom_bar(stat="identity")

# OK

ggplot(dat_bin, aes(x)) + geom_bar(stat="bin")

I've modified stat_bin to print out a message when it is used with a variable mapped to y. For backward compatibility, it will continue to give the same graphical output, for at least a few more versions of ggplot2.

This is what it says now in that case:

> ggplot(dat, aes(x, val)) + geom_bar(stat="bin")

Mapping a variable to y and also using stat="bin".

With stat="bin", it will attempt to set the y value to the count of cases in each group.

This can result in unexpected behavior and will not be allowed in a future version of ggplot2.

If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.

If you want y to represent values in the data, use stat="identity".

See ?geom_bar for examples.

I've also added some text and examples to the geom_bar help to try to clarify the issue:

https://github.com/hadley/ggplot2/blob/master/R/geom-bar-.r

What do you think? Is this clear enough? I have a feeling this is a change that will hit a lot of ggplot2 users so I want to make it as clear as possible what's happening, and why.

-Winston

Bryan Hanson

unread,

Nov 19, 2012, 8:01:53 PM11/19/12

to ggplo...@googlegroups.com

I get the warning message 2x when running this (partial) code. It looks to be a little different case than anticipated. Here's the aesthetic:

    if (identical(fac2, NULL)) {
        p <- ggplot(data, aes_string(x = fac1, y = res, color = fac1))

        } else {
            p <- ggplot(data, aes_string(x = fac1, y = res, color = fac1)) +
            facet_grid(paste(". ~", fac2, sep = ""))
            }

Then, I think this is the piece that generates the error (once for each facet in my example, I think):

    p <- p + geom_text(aes_string(x = fac1,
        y = paste("min(",res,") - 0.1 * diff(range(",res,"))", sep=""),
        label = 'paste("n = ", ..count.. , sep = "")'),
        color = "black", size = 4.0, stat = "bin", data = data)

There are no other calls to stat = "bin" but I have a bunch to stat_summary(my custom function).

What this does (or did) is print something like 'n = 5' giving the number of observations under each pointrange. Is there a different way to write this geom_text under ggplot2 0.9.3?

Thanks, Bryan

Winston Chang

unread,

Nov 29, 2012, 3:58:21 PM11/29/12

to Bryan Hanson, ggplot2-dev

It's hard for me to say for sure just by looking at this piece of code, but it seems that you're doing something tricky with the stat="bin" and using the ..count.. for the label. The usual behavior with stat="bin" is to use ..count.. for the y values, but in your case, you're already mapping something else to y. Since this is a pretty unusual use case, I think the best solution is to calculate all this values you want before the ggplot call and put them in a separate data frame, and then use something like geom_text(data=textdata, ...) instead.

For example if your data is called 'data', you might want something like this:

# Use ddply and summarise to calculate groupwise values

library(plyr)

data_labels <- ddply(data, "fac1", summarise,

y = min(res) - 0.1 * diff(range(res)), # Get the y value

label = paste("n =", length(res))) # A way of getting the count

Then you can do:

p + geom_text(data=data_labels, aes(x=fac1, y=y, label=label))

In this example, I've assumed that 'res' is a column in your data. Also note that the length() method of getting the count will also include NA's, which may or not be appropriate for your data.

-Winston

Bryan Hanson

unread,

Dec 2, 2012, 7:52:15 PM12/2/12

to ggplo...@googlegroups.com

If the code I have is not right for 0.9.3, can someone suggest something more sensible? Thanks, Bryan

Reply all

Reply to author

Forward