As many of you are no doubt aware, the default behavior of geom_bar is to use stat="bin", and this can be confusing for new users.
Normally, there are two ways to use geom_bar:
- Map a variable to y, and use stat="identity"
- Don't map a variable to y, and use stat="bin"
If you map a variable to y and use stat="bin", it will (correctly) give an error. However, in the case where there is one y value per bin, it will behave exactly as if you used stat="identity". This can make things a lot more confusing.
# Data to be used with stat="identity"
dat_val <- data.frame(x=c("a","b"), y=c(3,2))
# OK
ggplot(dat_val, aes(x, y)) + geom_bar(stat="identity")
# No error, and looks the same as stat="identity", but it shouldn't
ggplot(dat_val, aes(x, y)) + geom_bar(stat="bin")
# Data to be used with stat="bin"
dat_bin <- data.frame(x=c("a","a","a", "b","b"))
# Error, as expected
ggplot(dat_bin, aes(x)) + geom_bar(stat="identity")
# OK
ggplot(dat_bin, aes(x)) + geom_bar(stat="bin")
I've modified stat_bin to print out a message when it is used with a variable mapped to y. For backward compatibility, it will continue to give the same graphical output, for at least a few more versions of ggplot2.
This is what it says now in that case:
> ggplot(dat, aes(x, val)) + geom_bar(stat="bin")
Mapping a variable to y and also using stat="bin".
With stat="bin", it will attempt to set the y value to the count of cases in each group.
This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
If you want y to represent values in the data, use stat="identity".
See ?geom_bar for examples.
I've also added some text and examples to the geom_bar help to try to clarify the issue:
What do you think? Is this clear enough? I have a feeling this is a change that will hit a lot of ggplot2 users so I want to make it as clear as possible what's happening, and why.
-Winston