My understanding was that data outside the limits of the scales was
dropped, appropriate stats were applied, and then any resulting elements
that fell outside the limits of the scales were dropped.
I know that censoring does (can?) take place before stats are
calculated; this is the source of the
boxplots-changed-when-I-zoomed-in-using-scale problem that new users
sometimes have:
DF <- data.frame(x="A", y=rep(1:20, 20:1))
ggplot(DF, aes(x,y)) + geom_boxplot()
ggplot(DF, aes(x,y)) + geom_boxplot() + scale_y_continuous(limits=c(5,15))
The latter even gives a warning:
Warning message:
Removed 89 rows containing non-finite values (stat_boxplot).
because the points outside the y-limits are not included in the
calculation of the box_plot stat.
I know that censoring does take place after stats are calculated on the
resulting computed data:
DF <- data.frame(x=1:20, y=(1:20)^1.3)
ggplot(DF, aes(x,y)) + geom_point() + geom_smooth(method="lm")
ggplot(DF, aes(x,y)) + geom_point() + geom_smooth(method="lm") +
scale_y_continuous(limits=c(0,50))
The second of these has its fit line truncated at y=0, and it's bands
around the line truncated when the lower bound of it first crosses y=0.
(with warning message:
Removed 3 rows containing missing values (geom_path).
)
I know that both can happen (data outside limits dropped, then stats
applied, then stats outside of limits dropped):
ggplot(DF, aes(x,y)) + geom_point() + geom_smooth(method="lm") +
scale_y_continuous(limits=c(0,30))
This has a different fit curve, and it is similarly truncated at y=0. It
gives 3 warning messages:
1: Removed 7 rows containing missing values (stat_smooth).
2: Removed 7 rows containing missing values (geom_point).
3: Removed 2 rows containing missing values (geom_path).
All this makes sense to me. Now to set up what didn't. Here is a
dataset and three histograms based on it; all make sense to me. The data
are 3 groups of 6 points evenly spaced between 0 and 1, with a histogram
of fixed binwidth of 0.2. This should, in general give 6 equal sized
bars for a histogram. The first two give that, whether dodged or faceted.
DF <- data.frame(x=(0:5)/5L, group=rep(LETTERS[1:3], each=6))
ggplot(DF, aes(x=x)) + geom_bar(aes(fill=group), colour="black",
binwidth=0.2, position="dodge")
ggplot(DF, aes(x=x)) + geom_bar(aes(fill=group), colour="black",
binwidth=0.2, position="dodge") + facet_wrap(~group, ncol=3)
If I set the limits to c(0,1), I expect the highest point (1) to be
dropped, leaving 5 bars (so that the 1-1.2 bar is not there since there
is no data for it). The next bar (the 0.8 to 1) might or might not be
suppressed depending on whether the 1 limit is inclusive or exclusive
(presumably everything has to be less than 1 (not less than or equal to
1) since the data point of 1 was dropped, though floating point issues
may come into play).
In the following first graph, groups A and B have 5 bars and group C has
4. I assume that since the physical range the bar 0.8-1 bar takes up for
groups A and B does not include 1, they are drawn while the
corresponding bar for group C is not (since the x range for that bar
would be 0.9333 to 1). If faceted (the second plot), every group has
only 4 bars (no group has the 0.8-1 bar included, since in each facet,
that spans the axis from 0.8 to 1).
ggplot(DF, aes(x=x)) + geom_bar(aes(fill=group), colour="black",
binwidth=0.2, position="dodge") + xlim(0,1)
ggplot(DF, aes(x=x)) + geom_bar(aes(fill=group), colour="black",
binwidth=0.2, position="dodge") + facet_wrap(~group, ncol=3) + xlim(0,1)
So I can explain this (data outside the limits is dropped, stats are
applied, positioning is applied, then any resulting points outside the
range are dropped), but getting different data displayed/included in the
position="dodge" and facet case was surprising.
If anyone has made it this far, does this seem like what should happen?
Or should the positioning not affect the suppression? Or should there be
another suppression step after stats are applied but before positioning
is applied (so that any stat results that are out of range are
suppressed, even if a positioning effect would bring them nominally in
range, since they represent data/information that is outside the range)?
I've include a PDF of all the graphs generated from the code for reference.
--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University
I can see how this is a bit inconsistent, but it makes sense to me -
limits are applied to the final position adjusted geoms. This
introduces some inconsistency between dodging and facets, but I don't
see why it's any different to the following plot displaying a
different number of points each time you plot it:
DF <- data.frame(x=0:3, g=rep(LETTERS[1:2], each=4))
qplot(g, x, data = DF, position = position_jitter(height = 2)) + ylim(0, 4)
Hadley
--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/
position_adjust takes place before building grobs, doesn't it?
> Hadley
>
> --
> Assistant Professor / Dobelman Family Junior Chair
> Department of Statistics / Rice University
> http://had.co.nz/
--
--
Kohske Takahashi <takahash...@gmail.com>
Research Center for Advanced Science and Technology,
The University of Tokyo, Japan.
http://www.fennel.rcast.u-tokyo.ac.jp/profilee_ktakahashi.html
Yes, that's right. Grob building is the last thing that happens.