calculating means within ggplot

3,263 views
Skip to first unread message

Stefan

unread,
Mar 14, 2011, 8:47:07 PM3/14/11
to ggplot2
Hey all!

I just started using ggplot and I am blown away by the functionality!
I did figure out quite a bit already but have the following question:

I am measuring leaf senescence of 7different trees at 5 different
times (x) in scores ranging from 1 to 7 (y).
So, my factor is tree with 7 levels and 45 and 60 independent
observations per tree. Hence I have a lot of over-plotting (since the
score can only be 1 to 7). I would like to do a line plot with the
mean of each of the 7 trees.

Can I calculate the means+standard errors from within ggplot?

I did faceting per tree and geom_smooth() for the mean line with
confidence limits, which looks great but I would like to have it in
one plot to see whether it looks better.

Any suggestions are much appreciated!

Thank you all!

Stefan

Ista Zahn

unread,
Mar 14, 2011, 9:46:16 PM3/14/11
to Stefan, ggplot2
Hi Stefan,
See the examples for stat_summary. If you haven't seen the website, I
strongly recommend book marking it.

http://had.co.nz/ggplot2/stat_summary.html

Best,
Ista

> --
> You received this message because you are subscribed to the ggplot2 mailing list.
> Please provide a reproducible example: http://gist.github.com/270442
>
> To post: email ggp...@googlegroups.com
> To unsubscribe: email ggplot2+u...@googlegroups.com
> More options: http://groups.google.com/group/ggplot2
>

--
Ista Zahn
Graduate student
University of Rochester
Department of Clinical and Social Psychology
http://yourpsyche.org

Dennis Murphy

unread,
Mar 14, 2011, 11:09:28 PM3/14/11
to Stefan, ggplot2
Hi:

There are basically two ways to go about this type of problem:
  (1) Let ggplot2 do the summaries.
  (2) Provide the summaries yourself and add layers to the plot using the summary data.

The latter strategy gives you more control. Here's an example based on your description.

# Manufactured data:
d <- data.frame(tree = rep(1:7, each = 50),
                score = sample(1:7, 350, replace = TRUE))

# (1) Let ggplot2 do the work:
g <- ggplot(d, aes(x = tree, y = score))
g + geom_point(position = position_jitter(width = 0.2, height = 0.2), alpha = 0.2) +
    stat_summary(fun.data = 'mean_sdl', geom = 'errorbar', width = 0.2, size = 1) +
    stat_summary(fun.y = mean, geom = 'point', size = 3, color = 'red') +
    stat_summary(fun.y = mean, geom = 'line', size = 1, color = 'red')

This 'works', but the mean_sdl() function computes means and standard deviations, not means and standard errors. I don't know offhand how, or if, mean_sdl() can produce standard errors instead of standard deviations; perhaps someone else can chime in on that point, with perhaps a different summary function.

The second approach, which I prefer in my own work, is to summarize the data prior to plotting so that I have control over what gets plotted. This also aids in the design of the graphic.

# Summary data frame, generated by the ddply() function in package plyr, which loads with ggplot2
ds <- ddply(d, 'tree', summarise, mean = mean(score),
                                   se = sd(score)/sqrt(length(score)) )

g + geom_point(position = position_jitter(width = 0.2, height = 0.2), alpha = 0.2) +
    geom_point(data = ds, aes(y = mean), size = 3, colour = 'red') +
    geom_line(data = ds, aes(y = mean), size = 1, colour = 'red') +
    geom_errorbar(data = ds, aes(y = mean, ymin = mean - se, ymax = mean + se),
                     width = 0.3, size = 1)

There are just as many lines of plot code, but in geom_errorbar() I can specify ymin and ymax in terms of standard errors computed in the summary data frame. The latter graph is meant to be consonant with your expectations; I'll let you decide whether or not they do. I jittered the points and reduced the alpha transparency so that the error bars and mean lines were emphasized, but it's merely an idea for you to consider.

HTH,
Dennis


Stefan

unread,
Mar 15, 2011, 1:56:02 AM3/15/11
to ggplot2
Thanks Ista and Dennis!!

I should be a bit more specific, sorry for not doing so in the first
place.

I have a dataframe with the columns:
tree (with 7 levels), sampling.date(my x values), score(my y
values).

Here is the plot I created:
http://tinyurl.com/5ts7qb2

And here my code:
p<-ggplot(x,aes(date,score))
p+geom_point(alpha=I(1/6))+geom_smooth(size=0.5)+facet_wrap(~tree)
+xlab('Observation date')+ylab('Leaf senescence score')+theme_bw()
+opts(axis.text.x=theme_text(angle=90))

Now, I would like to have all 7 lines (without any points) in one plot
connected by their means over the 5 different sampling dates. When I
can, I avoid calculating mean and standard errors; instead I really
like boxplots since they show everything. In that case though, it
would look a bit messy. So, I now have to get the standard errors (or
conf.limits) that describe the mean. I usually use the doSummary()
function in the doBy package and calculate the mean by 'score~tree
+date' (but I checked out the ddply, which is pretty cool, too! Thanks
for pointing it out Dennis!). But the problem is my uneven sample size
per clone. I did calculate the SEs in excel pretty quickly and
reimported everything into R, which is a workaround I really don't
like.

So, I was wondering if there is a way that ggplot2 does recognizes the
length of the variables it summarizes. I agree with you Dennis that
doing it outside ggplots2 gives you more control about what is going
to be plotted but I can't really figure out a smart way of doing so.

Thanks again guys for all the time you put into reading and answering
all these questions!

Stefan
> <stefan.schrei...@ales.ualberta.ca>wrote:

Ista Zahn

unread,
Mar 15, 2011, 10:46:56 AM3/15/11
to Stefan, ggplot2
Hi Stefan,
Dennis pretty much gave you the tools for this already, unless I
missed something. The unequal length in each group is not a problem at
all. Here is Dennis' example tweeked to match you description:

d <- data.frame(date = rep(1:7, each = 50),
score = sample(1:7, 350, replace = TRUE),
tree = rep(1:10, 35 ))

## stat_summary can actually take any function, so we define one to
return the values we want.
se <- function(x) {
return(c(ymin=(mean(x) - (sd(x)/sqrt(length(x)))), ymax=(mean(x) +
(sd(x)/sqrt(length(x))))))
}

g <- ggplot(d, aes(x = date, y = score))
g + stat_summary(fun.data = 'se', geom = 'errorbar', width = 0.2, size = 1) +


stat_summary(fun.y = mean, geom = 'point', size = 3, color = 'red') +

stat_summary(fun.y = mean, geom = 'line', size = 1, color = 'red') +
facet_wrap(~ tree)

# Or we can do the summarizing separately.

ds <- ddply(d, .(tree, date ), summarise, mean = mean(score),
se = sd(score)/sqrt(length(score)) )

g + geom_point(position = position_jitter(width = 0.2, height = 0.2),
alpha = 0.2) +
geom_point(data = ds, aes(y = mean), size = 3, colour = 'red') +
geom_line(data = ds, aes(y = mean), size = 1, colour = 'red') +
geom_errorbar(data = ds, aes(y = mean, ymin = mean - se, ymax = mean + se),

width = 0.3, size = 1) +
facet_wrap(~ tree )


Best,
Ista

--

Schreiber, Stefan

unread,
Mar 15, 2011, 3:14:06 PM3/15/11
to Ista Zahn, ggplot2, had...@rice.edu

Thanks again Ista!!

I forgot to mention that I have NAs in my data as well and using different approaches I get different standard errors:
So I used your example dataset and included some NAs

require(plyr)


d <- data.frame(date = rep(1:7, each = 50),
                score = sample(1:7, 350, replace = TRUE),
                tree = rep(1:10, 35 ))

d$score <- ifelse(d$score==6,NA,d$score)

#First approach removing the NAs within the ddply function:
d1 <- ddply(d, .(tree, date ), summarise, mean = mean(score,na.rm=T),se = sd(score,na.rm=T)/sqrt(length(score)))

#Second approach removing NAs before using ddply function:
d0 <- subset(d,!is.na(d$score))
d2 <- ddply(d0, .(tree, date ), summarise, mean = mean(score),se = sd(score)/sqrt(length(score)))

Now, looking at these to datasets there some standard errors that are not the same.

I assume that when removing the NAs within ddply the sqrt(length(score)) still includes the rows which have NAs.

I think in this special case it would be better to go for the second approach, or what would you suggest?

Also then (assuming all the NAs), I cannot really do the calculation within ggplot2 using the stat_summary() function?

Thanks for your help!

Stefan

Dennis Murphy

unread,
Mar 15, 2011, 8:28:37 PM3/15/11
to Schreiber, Stefan, ggplot2
Hi:


On Tue, Mar 15, 2011 at 12:14 PM, Schreiber, Stefan <Stefan.S...@ales.ualberta.ca> wrote:

Thanks again Ista!!

I forgot to mention that I have NAs in my data as well and using different approaches I get different standard errors:
So I used your example dataset and included some NAs

require(plyr)


d <- data.frame(date = rep(1:7, each = 50),
                score = sample(1:7, 350, replace = TRUE),
                tree = rep(1:10, 35 ))

d$score <- ifelse(d$score==6,NA,d$score)

#First approach removing the NAs within the ddply function:
d1 <- ddply(d, .(tree, date ), summarise, mean = mean(score,na.rm=T),se = sd(score,na.rm=T)/sqrt(length(score)))


Almost - you need to adjust the length as well:
 
d1 <- ddply(d, .(tree, date ), summarise, mean = mean(score,na.rm=T), se = sd(score,na.rm=T)/sqrt(length(score[!is.na(score)])))

#Second approach removing NAs before using ddply function:
d0 <- subset(d,!is.na(d$score))
d2 <- ddply(d0, .(tree, date ), summarise, mean = mean(score),se = sd(score)/sqrt(length(score)))

Now, looking at these to datasets there some standard errors that are not the same.

I assume that when removing the NAs within ddply the sqrt(length(score)) still includes the rows which have NAs.

I think in this special case it would be better to go for the second approach, or what would you suggest?

Also then (assuming all the NAs), I cannot really do the calculation within ggplot2 using the stat_summary() function?


You can, but you have to use the right summary function. [Thanks for the example, Ista!] To me, one advantage of the 'create the summaries yourself' approach is that you can see in advance what you'll be plotting. There's nothing wrong with stat_summary(), but it's a black box of sorts - you have to trust that it's giving you the right answers. I prefer to control the input and the output, even if it means a few extra lines of code. If something doesn't look right, I can backtrack to find the problem easily. YMMV.

I assume that if you want to plot means + errorbars for each tree that you'll be doing so in separate panels of a faceted plot rather than a combined one. I can see plotting the mean profiles over day in a single plot - that isn't hard and overlaps are not that visually distracting - but if you intend to augment them with error bars, I predict you'll end up with a mess unless your mean profiles are well separated and your standard errors are small. With a response range of 1-7 and seven tree profiles, that doesn't seem likely...but I'm a skeptic by nature :)

Example (admittedly a bit extreme but it illustrates the point):

d <- data.frame(tree = factor(rep(c(18, 24, 32, 33, 36, 48, 52), each = 50)),
                date = rep(rep(c(1, 2, 3, 5, 8), each = 10), 7),

                score = sample(1:7, 350, replace = TRUE))
d$score[c(27, 59, 122, 205, 300)] <- NA

dsumm <- ddply(d, .(tree, date), summarise, mean = mean(score, na.rm = TRUE),
         se = sd(score,na.rm=T)/sqrt(length(score[!is.na(score)])) )

g <- ggplot(dsumm, aes(x = date, y = mean, group = tree))
g + geom_point(aes(colour = tree), size = 2.5) +
    geom_line(aes(colour = tree), size = 1)

last_plot() + geom_errorbar(aes(colour = tree, ymin = mean - se, ymax = mean + se),
                    width = 0.2, size = 1)
last_plot() + facet_wrap(~ tree)

HTH,
Dennis

Schreiber, Stefan

unread,
Mar 16, 2011, 12:05:27 AM3/16/11
to Dennis Murphy, ggplot2

Thanks Dennis!

Now it's all clear :))

With this '7 lines in one plot' approach I was just trying to increase the information to ink ratio. I was thinking of providing 4 average standard errors (one for the 7 trees per day) in the figure caption and no standard errors in the actual figure. Not sure if this is appropriate or not... I was just scoping out the options to present the data.

Thanks again!

Stefan

Reply all
Reply to author
Forward
0 new messages