What is dropped due to scale limits

3,361 views
Skip to first unread message

Brian Diggs

unread,
Apr 10, 2012, 11:58:48 AM4/10/12
to ggplo...@googlegroups.com
I was taking a stab at a stack overflow question yesterday
(http://stackoverflow.com/q/10064080/892313) and came across some
behavior that broke my mental model of how ggplot2 works. I'd like to
think I have a pretty good mental model, so I wanted to see where I was
lacking (or if maybe it was a bug).

My understanding was that data outside the limits of the scales was
dropped, appropriate stats were applied, and then any resulting elements
that fell outside the limits of the scales were dropped.

I know that censoring does (can?) take place before stats are
calculated; this is the source of the
boxplots-changed-when-I-zoomed-in-using-scale problem that new users
sometimes have:

DF <- data.frame(x="A", y=rep(1:20, 20:1))
ggplot(DF, aes(x,y)) + geom_boxplot()
ggplot(DF, aes(x,y)) + geom_boxplot() + scale_y_continuous(limits=c(5,15))

The latter even gives a warning:

Warning message:
Removed 89 rows containing non-finite values (stat_boxplot).

because the points outside the y-limits are not included in the
calculation of the box_plot stat.

I know that censoring does take place after stats are calculated on the
resulting computed data:

DF <- data.frame(x=1:20, y=(1:20)^1.3)
ggplot(DF, aes(x,y)) + geom_point() + geom_smooth(method="lm")
ggplot(DF, aes(x,y)) + geom_point() + geom_smooth(method="lm") +
scale_y_continuous(limits=c(0,50))

The second of these has its fit line truncated at y=0, and it's bands
around the line truncated when the lower bound of it first crosses y=0.
(with warning message:
Removed 3 rows containing missing values (geom_path).
)

I know that both can happen (data outside limits dropped, then stats
applied, then stats outside of limits dropped):

ggplot(DF, aes(x,y)) + geom_point() + geom_smooth(method="lm") +
scale_y_continuous(limits=c(0,30))

This has a different fit curve, and it is similarly truncated at y=0. It
gives 3 warning messages:

1: Removed 7 rows containing missing values (stat_smooth).
2: Removed 7 rows containing missing values (geom_point).
3: Removed 2 rows containing missing values (geom_path).

All this makes sense to me. Now to set up what didn't. Here is a
dataset and three histograms based on it; all make sense to me. The data
are 3 groups of 6 points evenly spaced between 0 and 1, with a histogram
of fixed binwidth of 0.2. This should, in general give 6 equal sized
bars for a histogram. The first two give that, whether dodged or faceted.

DF <- data.frame(x=(0:5)/5L, group=rep(LETTERS[1:3], each=6))
ggplot(DF, aes(x=x)) + geom_bar(aes(fill=group), colour="black",
binwidth=0.2, position="dodge")
ggplot(DF, aes(x=x)) + geom_bar(aes(fill=group), colour="black",
binwidth=0.2, position="dodge") + facet_wrap(~group, ncol=3)

If I set the limits to c(0,1), I expect the highest point (1) to be
dropped, leaving 5 bars (so that the 1-1.2 bar is not there since there
is no data for it). The next bar (the 0.8 to 1) might or might not be
suppressed depending on whether the 1 limit is inclusive or exclusive
(presumably everything has to be less than 1 (not less than or equal to
1) since the data point of 1 was dropped, though floating point issues
may come into play).

In the following first graph, groups A and B have 5 bars and group C has
4. I assume that since the physical range the bar 0.8-1 bar takes up for
groups A and B does not include 1, they are drawn while the
corresponding bar for group C is not (since the x range for that bar
would be 0.9333 to 1). If faceted (the second plot), every group has
only 4 bars (no group has the 0.8-1 bar included, since in each facet,
that spans the axis from 0.8 to 1).

ggplot(DF, aes(x=x)) + geom_bar(aes(fill=group), colour="black",
binwidth=0.2, position="dodge") + xlim(0,1)
ggplot(DF, aes(x=x)) + geom_bar(aes(fill=group), colour="black",
binwidth=0.2, position="dodge") + facet_wrap(~group, ncol=3) + xlim(0,1)

So I can explain this (data outside the limits is dropped, stats are
applied, positioning is applied, then any resulting points outside the
range are dropped), but getting different data displayed/included in the
position="dodge" and facet case was surprising.

If anyone has made it this far, does this seem like what should happen?
Or should the positioning not affect the suppression? Or should there be
another suppression step after stats are applied but before positioning
is applied (so that any stat results that are out of range are
suppressed, even if a positioning effect would bring them nominally in
range, since they represent data/information that is outside the range)?

I've include a PDF of all the graphs generated from the code for reference.

--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University

DroppedGeoms.pdf

Winston Chang

unread,
Apr 11, 2012, 12:23:34 AM4/11/12
to Brian Diggs, ggplo...@googlegroups.com
Interesting examples... it does look like the data that's out of range is dropped before the stat, and then after the stat and position adjustment, the geoms that go out of range are dropped.

Here's some simple code that may illustrate the problem a bit more clearly. These examples make use of the property that bins are closed on the left and open on the right.

DF <- data.frame(x=0:3, g=rep(LETTERS[1:2], each=4))
 # x g
 # 0 A
 # 1 A
 # 2 A
 # 3 A
 # 4 A
 # 5 A
 # 0 B
 # 1 B
 # 2 B
 # 3 B
 # 4 B
 # 5 B

# Four bars in a row with count=2 alternating colors
p <- ggplot(DF, aes(x=x, fill=g)) + 
  geom_histogram(binwidth=2, colour="black", position="dodge")
p

# The data points at 0 are removed, resulting in bars with count=1
# Also, the left-most bar (group A) is removed because the rectangle goes
# out of the x scale range
p + xlim(0.001, 4)

# Facet by g: two bars in each facet
p + facet_grid(g ~ .)

# In both facets, the left bar is removed because the rectangle goes
# out of the x scale range
p + facet_grid(g ~ .) + xlim(0.001, 4)


As to whether this is how it should work, I'll let someone else answer that question...

-Winston
p.png
p-xlim.001.png
p-facet.png
p-facet-xlim.001.png

Hadley Wickham

unread,
Apr 16, 2012, 9:52:32 AM4/16/12
to Brian Diggs, Winston Chang, ggplo...@googlegroups.com
> If anyone has made it this far, does this seem like what should happen? Or
> should the positioning not affect the suppression? Or should there be
> another suppression step after stats are applied but before positioning is
> applied (so that any stat results that are out of range are suppressed, even
> if a positioning effect would bring them nominally in range, since they
> represent data/information that is outside the range)?

I can see how this is a bit inconsistent, but it makes sense to me -
limits are applied to the final position adjusted geoms. This
introduces some inconsistency between dodging and facets, but I don't
see why it's any different to the following plot displaying a
different number of points each time you plot it:

DF <- data.frame(x=0:3, g=rep(LETTERS[1:2], each=4))

qplot(g, x, data = DF, position = position_jitter(height = 2)) + ylim(0, 4)

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Kohske Takahashi

unread,
Apr 16, 2012, 10:58:55 AM4/16/12
to Hadley Wickham, Brian Diggs, Winston Chang, ggplo...@googlegroups.com
> but I don't
> see why it's any different to the following plot displaying a
> different number of points each time you plot it:
>
> DF <- data.frame(x=0:3, g=rep(LETTERS[1:2], each=4))
> qplot(g, x, data = DF, position = position_jitter(height = 2)) + ylim(0, 4)

position_adjust takes place before building grobs, doesn't it?


> Hadley
>
> --
> Assistant Professor / Dobelman Family Junior Chair
> Department of Statistics / Rice University
> http://had.co.nz/

--
--
Kohske Takahashi <takahash...@gmail.com>

Research Center for Advanced Science and Technology,
The University of  Tokyo, Japan.
http://www.fennel.rcast.u-tokyo.ac.jp/profilee_ktakahashi.html

Hadley Wickham

unread,
Apr 16, 2012, 11:00:38 AM4/16/12
to Kohske Takahashi, Brian Diggs, Winston Chang, ggplo...@googlegroups.com
>> but I don't
>> see why it's any different to the following plot displaying a
>> different number of points each time you plot it:
>>
>> DF <- data.frame(x=0:3, g=rep(LETTERS[1:2], each=4))
>> qplot(g, x, data = DF, position = position_jitter(height = 2)) + ylim(0, 4)
>
> position_adjust takes place before building grobs, doesn't it?

Yes, that's right. Grob building is the last thing that happens.

Reply all
Reply to author
Forward
0 new messages