Set different y axis limits for facets to fit summary data

3,447 views
Skip to first unread message

Samuel Dennis

unread,
Nov 29, 2011, 6:57:15 PM11/29/11
to ggplot2
I am trying to produce a bar graph with several facets, with different
y axis scales to show the data clearly. As I wish to display the means
and confidence intervals, but the raw data has a high range, there are
large blank spaces at the top of my graphs.

How can I adjust the y axis limits to show this data clearly? Normally
I would use coord_cartesian for this, however that only allows me to
set a single pair of y limits that applies to all facets (unless I
have missed something).

An example is:

test <- data.frame(animal = rep(c("cattle","sheep","none"),120),
irrigation = rep(rep(c("dry","wet"),each=3),60), paddock =
rep(c(1:3),each=120), value = c(1:358,2000,5000))
ggplot(test, aes(reorder(animal,value,mean),value)) +
stat_summary(fun.y = mean, geom = "bar") + stat_summary(fun.data =
mean_cl_normal, geom = "pointrange") + facet_grid(paddock ~ ., scales
= "free")

On this example, how would I either:
- Manually set ymax to around 80, 210 and 1000 for the three facets,
or
- Persuade scales = "free" to scale the axis to the visible data only
rather than the full dataset.

For reference, this question is related to the following two previous
posts but not answered in them:
http://groups.google.com/group/ggplot2/browse_thread/thread/6ded9538cd848279
http://groups.google.com/group/ggplot2/browse_thread/thread/bb96b59a99c947b1/1c766aabd12bf81b

Samuel Dennis

unread,
Nov 29, 2011, 7:03:23 PM11/29/11
to ggplot2
Immediately after posting I received a private response from Hadley
Wickham to a separate query on this issue, stating that ggplot2 does
not yet have this functionality but it is planned for a future
release. If anyone has a suggested workaround in the meantime that
would be appreciated, but I might have to use base graphics for this
one it seems.
Sorry for the emails,
Samuel Dennis

> posts but not answered in them:http://groups.google.com/group/ggplot2/browse_thread/thread/6ded9538c...http://groups.google.com/group/ggplot2/browse_thread/thread/bb96b59a9...

Winston Chang

unread,
Nov 29, 2011, 7:49:24 PM11/29/11
to Samuel Dennis, ggplot2
It can be done if you summarize the data yourself and give the summarized data to ggplot. Here's one way to do it, using the function smean.cl.normal from Hmisc to calculate 95% confidence intervals:

library(plyr)
library(Hmisc)

testc <-
ddply(test, .(animal, paddock), summarise,
        value= smean.cl.normal(value)[1], 
        min  = smean.cl.normal(value)[2],
        max  = smean.cl.normal(value)[3])
        

ggplot(testc, aes(reorder(animal,value), value)) +
    geom_bar(colour="black", fill="white") + 
    geom_errorbar(aes(ymin=min, ymax=max), width=.1) +
    facet_grid(paddock ~ ., scales = "free")


You can also calculate confidence intervals without Hmisc, but it takes a little more work:

-Winston



--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: http://gist.github.com/270442

To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+u...@googlegroups.com
More options: http://groups.google.com/group/ggplot2

test.png

Samuel Dennis

unread,
Nov 29, 2011, 8:01:09 PM11/29/11
to Winston Chang, ggplot2
Perfect, thanks heaps. I have adapted it to my real dataset and it works well there too. The clear explanation of how to use ddply in your response is particularly useful, I am still getting my head around that package but when it works it saves an enormous amount of time.

Thankyou again,
Samuel

Winston Chang

unread,
Nov 29, 2011, 8:14:44 PM11/29/11
to Samuel Dennis, ggplot2
Great, I'm glad it helped.

It just occurred to me that you can increase the y range in each facet to any arbitrary value, by adding points that are invisible (alpha=0). This adds a column called "top" that contains the max y coordinate for each facet, and plots invisible points there. This is a hack, but it gets the job done.

library(plyr)
library(Hmisc)

testc <-
ddply(test, .(animal, paddock), summarise,
        value= smean.cl.normal(value)[1], 
        min  = smean.cl.normal(value)[2],
        max  = smean.cl.normal(value)[3])

# Add y coordinates for "phantom" points
testc$top <- NA
testc$top[testc$paddock==1] <- 80
testc$top[testc$paddock==2] <- 210
testc$top[testc$paddock==3] <- 1000
# animal paddock   value       min       max  top
# cattle       1  59.500  48.28364  70.71636   80
# cattle       2 179.500 168.28364 190.71636  210
# cattle       3 299.500 288.28364 310.71636 1000
#   none       1  61.500  50.28364  72.71636   80
#   none       2 181.500 170.28364 192.71636  210
#   none       3 417.500 179.58864 655.41136 1000
#  sheep       1  60.500  49.28364  71.71636   80
#  sheep       2 180.500 169.28364 191.71636  210
#  sheep       3 341.525 254.83492 428.21508 1000

ggplot(testc, aes(reorder(animal,value,mean),value)) +
    geom_bar(colour="black", fill="white") + 
    geom_errorbar(aes(ymin=min, ymax=max), width=.1) +
    geom_point(aes(y=top), alpha=0) +
    facet_grid(paddock ~ ., scales = "free")


-Winston
test-expand.png

Hadley Wickham

unread,
Dec 1, 2011, 11:07:56 AM12/1/11
to Winston Chang, Samuel Dennis, ggplot2
Obligatory response to this type of plot:
http://biostat.mc.vanderbilt.edu/wiki/Main/DynamitePlots
http://biostat.mc.vanderbilt.edu/twiki/pub/Main/TatsukiRcode/Poster3.pdf

"Beware of dynamite!"

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Samuel Dennis

unread,
Dec 1, 2011, 2:56:32 PM12/1/11
to Hadley Wickham, Winston Chang, ggplot2
You're dead right about barplots, and I try to avoid them myself. However my actual dataset in this case is particularly complex, and I am trying to find a tidy way of presenting it.

Here is a simulation of what I am actually ending up with (my actual graphs using random data), presented as either a barplot or a dotplot. I am struggling to decide which is best. This is intended for journal publication so must be grayscale. I am presenting the effect on the population of four species of earthworm (1-4) of four grazing treatments, nitrogen and irrigation. As nitrogen had no real effect, and I wish to present statistical data on the effect of irrigation, I then overlay larger points displaying the mean of each irrigation treatment and a 95% confidence interval for it.

In your opinion, which of the two options is better? The barplot is much easier to read, but presents little data. The dotplot is great for a paper as with a digitiser you could get all the raw data back out again, it makes it easier for me to label the N treatment, and in my real dataset it shows how a number of the treatments in reality have mostly zero values which has a major effect on the mean. But it is a very daunting graph to try to make sense of. I don't want to use a boxplot or the like as this will clash with the statistics I present for the irrigation treatment.

Keen to hear your thoughts.

wormfake <- data.frame(harvest = rep(c("Cattle","Fallow","Mown","Sheep"),160), invertebrate = factor(rep(rep(c(1:4),each=4),40)), N = rep(rep(c("-N","+N"),each=16),20), irrigation = rep(rep(c("Dry","Irrigated"),each=32),10), tha = abs(rnorm(640)*5+rep(1:4)+rep(rep(c(1,5),each=32),10)))
wormfake1 <- ddply(wormfake, .(harvest, invertebrate, N, irrigation), summarise, tha = smean.cl.normal(tha)[1], min = smean.cl.normal(tha)[2], max = smean.cl.normal(tha)[3])
wormfake1$labels <- ""; wormfake1$labels[as.character(wormfake1$N) == " +N"] <- "N"
wormfake2 <- ddply(wormfake, .(harvest, invertebrate, irrigation), summarise, tha = smean.cl.normal(tha)[1], min = smean.cl.normal(tha)[2], max = smean.cl.normal(tha)[3])

ggplot(data=wormfake1, aes(harvest, tha/10, colour = N, fill = irrigation)) +
  geom_bar(position = position_dodge()) +
  geom_pointrange(aes(ymin=min/10, ymax=max/10), position = position_dodge(width = 0.9), colour = "black", data=wormfake2) +
  facet_grid(invertebrate ~ ., scales = "free") +
  scale_colour_manual(breaks = c("-N","+N"), values = c("grey50","grey50"), legend=FALSE) +
  scale_fill_manual(breaks = c("dry","irr"), values = c("grey70","grey50"), labels = c("Dry","Irrigated")) +
  geom_text(aes(x = harvest, y = tha/10+rep(c(0.025,0.003,0.005,0.002),each=4), label = labels), position = position_dodge(width=1, height=1)) +
  labs(x = NULL, y = "Total mass of earthworms (kg / square m)", fill = "Irrigation")

ggplot(data=wormfake, aes(x = harvest, y = tha/10, colour = N, shape = irrigation)) +
  geom_point(position = position_dodge(width = 0.7)) +
  geom_pointrange(aes(ymin=min/10, ymax=max/10), position = position_dodge(width = 0.7), colour = "black", data=wormfake2, size = I(1), legend = FALSE) +
  facet_grid(invertebrate ~ ., scales = "free") +
  scale_colour_manual(breaks = c("-N","+N"), values = c("grey50","grey20"), legend=TRUE) +
  labs(x = NULL, y = "Total mass of earthworms (kg / square m)", shape = "Irrigation")

Samuel

Hadley Wickham

unread,
Dec 1, 2011, 4:12:50 PM12/1/11
to Samuel Dennis, Winston Chang, ggplot2
Hmmm, it's a tricky problem. I did a bit of noodling around and came
up with this:

ggplot(data=wormfake, aes(harvest, tha/10, colour = irrigation)) +
geom_point(aes(shape = N), position = position_jitter(width = 0.1),
alpha = 1/2) +
geom_errorbar(aes(ymin=min/10, ymax=max/10), data=wormfake2, width =
0.1, size = 1) +
geom_line(aes(group = irrigation), data = wormfake2, size = 1) +
facet_wrap(~ invertebrate, nrow = 2)

A few comments:

* I don't think free scales help here (maybe they do in your real
data), and aligning the data in columns makes it very hard to compare
species

* Given that there isn't a significant nitrogen effect, you might
consider dropping it from this plot altogether

* I like being able to see the raw data, but faintly, in the
background. This plot still needs tweaking to make the lines and
error bars pop out over the raw data - see another attempt below.

ggplot(data=wormfake, aes(harvest, tha/10, colour = irrigation)) +
geom_point(aes(shape = N), position = position_jitter(width = 0.1),
alpha = 1/2) +
geom_errorbar(aes(ymin=min/10, ymax=max/10), data=wormfake2, width =
0.12, size = 1.5, colour = "black") +
geom_line(aes(group = irrigation), data = wormfake2, size = 1.5,
colour = "black") +
geom_errorbar(aes(ymin=min/10, ymax=max/10), data=wormfake2, width =
0.1, size = 1) +
geom_line(aes(group = irrigation), data = wormfake2, size = 1) +
facet_wrap(~ invertebrate, nrow = 2)

* Lines can be very effective at forming groups. The shape of the
lines here is also helpful for detecting interactions.

* Are you sure you want 95% confidence intervals? People will try and
use them to determine if dry is different to irrigated for a given
harvest/species combination. If you want to use them for comparison
you want (IIRC) 84% CIs

Hadley

Dennis Murphy

unread,
Dec 1, 2011, 5:06:55 PM12/1/11
to Samuel Dennis, ggplot2
Hi:

I took a different direction from Hadley; there are several graphs to
choose from below. The primary thing I did differently was to facet by
both harvest and invertebrate so that the comparisons among nitrogen
and irrigation would be easier to visualize. Comparisons across the
columns would give some indication of how harvests compared for a
given invertebrate.

(And I thank Hadley for pointing out the dynamite plot references -
that's usually my rant :)

# The first two plots use geom_pointrange() - one should not
# take the overlap (or lack thereof) between apposed pointranges
# as representing (non)significant differences. These plot the
# min and max from the wormfake1 data frame; the large point
# represents the group average.

# irrigation is the x variable, N the color aesthetic
ggplot(data = wormfake1, aes(irrigation, tha, colour = N)) +
geom_pointrange(aes(ymin = min, ymax = max),
position = position_dodge(width = 0.4),
width = 0.2, size = 1) +
facet_grid(invertebrate ~ harvest, scales = "free_y") +
scale_colour_manual(breaks = levels(wormfake1$N),


values = c("grey50","grey20"), legend=TRUE) +
labs(x = NULL,
y = "Total mass of earthworms (kg / square m)",

colour = "Nitrogen")

# switch: N is the x variable, irrigation the color aesthetic
ggplot(data = wormfake1, aes(N, tha, colour = irrigation)) +
geom_pointrange(aes(ymin = min, ymax = max),
position = position_dodge(width = 0.4),
width = 0.2, size = 1) +
facet_grid(invertebrate ~ harvest, scales = "free_y") +
scale_colour_manual(breaks = levels(wormfake1$irrigation),


values = c("grey50","grey20"), legend=TRUE) +
labs(x = NULL,
y = "Total mass of earthworms (kg / square m)",

colour = "Irrigation")

# Interaction plot between N and irrigation within panel.
# Again, one should not attach too much importance to
# the nonparallelism of profiles since no measure of variation
# is present in the plot
ggplot(data = wormfake1, aes(irrigation, tha, colour = N)) +
geom_point(size = 3) +
geom_line(aes(group = N), size = 1) +
facet_grid(invertebrate ~ harvest, scales = "free_y") +
scale_colour_manual(breaks = levels(wormfake1$N),


values = c("grey50","grey20"), legend=TRUE) +
labs(x = NULL,
y = "Total mass of earthworms (kg / square m)",

colour = "Nitrogen")

# Assuming that the nitrogen effect is nonsignificant, one can
# compare the distributions by irrigation type in each panel
# with boxplots as follows (n = 20 per group):

ggplot(data = wormfake, aes(irrigation, tha, colour = irrigation,
fill = irrigation)) +
geom_boxplot(outlier.size = 0) + # since points included
geom_point(position = position_jitter(width = 0.2)) +
facet_grid(invertebrate ~ harvest, scales = "free_y") +
scale_colour_manual(breaks = levels(wormfake$irrigation),
values = c("grey40","grey20")) +
scale_fill_manual(breaks = levels(wormfake$irrigation),
values = c("grey70","grey50")) +


labs(x = NULL,
y = "Total mass of earthworms (kg / square m)",

colour = "Irrigation", fill = 'Irrigation')

HTH,
Dennis

Samuel Dennis

unread,
Dec 1, 2011, 6:11:11 PM12/1/11
to Dennis Murphy, Hadley Wickham, ggplot2
Thankyou very much for all those suggestions Hadley and Dennis, you've both been rather busy on it!

I'd never heard of 84% confidence intervals before, thankyou very much for pointing that out to me, these would be far more appropriate (for reference for future readers, Payton et al (2003) Journal of Insect Science 3:34 explains that 84% CIs overlap around 95% of the time, while 95% CIs actually overlap around 99% of the time). I will change to using those.

After carefully considering both of your approaches, I think I find it easier to relate to Hadley's faint raw data + CIs. Dennis' graphs give a better direct pairwise comparison between two treatments, but Hadley's gives a more comfortable overall picture of the data, I'm not sure why, it's probably subconscious. The wrapped layout suggested in Hadley's example works very well, as do the crossbars on the error bars.

However I will remove the lines, as although they show the means well they also imply a trend from one unrelated factor to another, which could mislead some readers, I try to only use lines when x is continuous. The overlap between the points also makes it more difficult to see the N data, which I must show as on this topic it is something readers will be particularly interested in and there is one significant effect in one treatment that I need to discuss. With all the dots blended together it is hard to spot this, which is why I dodged the points earlier. Free scales are necessary in my real dataset (order of magnitude differences between species).

I'll keep playing with it. But you have both given me excellent suggestions, and the final version will be the better for it, particularly with the smaller CIs.

And I can assure you that there are no other "dynamite" plots in my paper, the rest of the data is in lines or tables!

Thankyou very much for all your help,

Samuel
Reply all
Reply to author
Forward
0 new messages