A more natural default would be one that depends on both the spread
*and* the number of observations, with decreasing binwidth for
increasing number of observations, so that the histogram will converge
to the density as the number of observations tends to infinity.
For instance, it is natural that a histogram based on a million
observations can (and should) have much smaller bins that one based on
only ten observations, but in ggplot2 this doesn’t happen.
Other histogram plotting function in R do use variable binwidth (though
hist has a bad default – truehist in MASS is more sane).
Here’s an example, using the Freedman-Diaconis rule (my favourite!) for
the number of bins:
library(MASS)
library(lattice)
library(ggplot2)
set.seed(2)
ng = 5
m = .5 * 10^(1:ng)
x = rnorm(sum(m))
gr = rep(1:ng, times = m)
par(mfrow = c(1, ng))
sapply(1:ng, function(ind) truehist(x[gr==ind], nbins = "FD", col =
"wheat"))
As we increase the number of observations, we can use more bins.
However, in ggplot2, similar histograms plot look like the following.
d = data.frame(x, gr)
ggplot(d, aes(x = x)) +
geom_histogram(aes(y = ..density..)) +
facet_grid(~gr)
Here the default is good for panel 4 (and perhaps usable for panel 3 and
5), but not for the others.
--
Regards,
Karl Ove Hufthammer
> The default binwidth for histograms in ggplot2 is 1/30 of the range of
> the data, which seems like a very curious choice.
Yes, it's a bad choice - but the idea is to encourage you to explore
using different bin widths for your data. There is no ideal bin width
- the theoretical methods make strong assumptions about the true
underlying distribution which are not usually true. I realise that
this is a bit of a contentious choice, but I just don't believe you
can rely on any of the automated solutions.
Hadley