Overall trendline

1,371 views
Skip to first unread message

Joshua Wiley

unread,
Oct 27, 2010, 8:39:23 PM10/27/10
to ggplot2
Hi All,

I am sure I am missing something very obvious, but I've been going in circles trying to figure out why this does not work.  I have some (discrete) data, that was measured at T1 and T2.  I want to plot all the individual lines (a la a spaghetti plot), and then add an overall line of best fit.  This sample data illustrates my problem and is fairly representative of my actual data.

# An example of my data
set.seed(1716)
sampdat <- data.frame(ids = rep(1:50, each = 2),
                     variable = factor(rep(c(0:1), 50)),
                     value = sample(0:4, 100, replace = TRUE))

# individual lines, works fine
ggplot(data = sampdat, aes(x = variable, y = value, group = ids)) +
 geom_line()

# How I *thought* I should add the overall trend line
ggplot(data = sampdat, aes(x = variable, y = value, group = ids)) +
 geom_line() + stat_smooth(aes(group = 1), method = "lm", size = 2, colour = "blue")

Thanks for your help,

Josh


FWIW:

> sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: i486-pc-linux-gnu (32-bit)

other attached packages:
[1] ggplot2_0.8.8  proto_0.3-8    reshape_0.8.3  plyr_1.2.1


--
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
http://www.joshuawiley.com/

Dennis Murphy

unread,
Oct 27, 2010, 10:51:37 PM10/27/10
to Joshua Wiley, ggplot2
Hi Josh:

I think the problem is that group = 1 is meant to be used for a double purpose here. In the book example of a spaghetti plot (sec. 4.9.3), geom_smooth(aes(group = 1)) was applied to a continuous x variable and rendered an overall mean + SE envelope of the individual profiles. When the x-variable is a factor, one uses geom_line(aes(group = 1)) to plot a line across the levels of that factor. You're apparently trying to do both in one call, and evidently ggplot() is balking :)

On Wed, Oct 27, 2010 at 5:39 PM, Joshua Wiley <jwiley...@gmail.com> wrote:
Hi All,

I am sure I am missing something very obvious, but I've been going in circles trying to figure out why this does not work.  I have some (discrete) data, that was measured at T1 and T2.  I want to plot all the individual lines (a la a spaghetti plot), and then add an overall line of best fit.  This sample data illustrates my problem and is fairly representative of my actual data.

# An example of my data
set.seed(1716)
sampdat <- data.frame(ids = rep(1:50, each = 2),
                     variable = factor(rep(c(0:1), 50)),
                     value = sample(0:4, 100, replace = TRUE))

# individual lines, works fine
ggplot(data = sampdat, aes(x = variable, y = value, group = ids)) +
 geom_line()

# How I *thought* I should add the overall trend line
ggplot(data = sampdat, aes(x = variable, y = value, group = ids)) +
 geom_line() + stat_smooth(aes(group = 1), method = "lm", size = 2, colour = "blue")

How about something like this?
 
# Get mean/SD summaries from each level of variable:
dsumm <- ddply(sampdat, .(variable), summarise, m = mean(value), s = sd(value))

# Spaghetti plots
g <- ggplot(data = sampdat, aes(x = variable)) +
      geom_line(aes(y = value, group = ids))

# Mean line with geom_ribbon to add SDs
g + geom_line(data = dsumm, aes(y = m, group = 1), color = 'blue', size = 1) +
     geom_ribbon(data = dsumm, aes(ymin = m - s, ymax = m + s, group = 1),
                   colour = 'gray80', alpha = I(0.2))

Not perfect, but perhaps enough to get you started.

HTH,
Dennis


Thanks for your help,

Josh


FWIW:

> sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: i486-pc-linux-gnu (32-bit)

other attached packages:
[1] ggplot2_0.8.8  proto_0.3-8    reshape_0.8.3  plyr_1.2.1


--
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
http://www.joshuawiley.com/

--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: http://gist.github.com/270442
 
To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+u...@googlegroups.com
More options: http://groups.google.com/group/ggplot2

Joshua Wiley

unread,
Oct 27, 2010, 11:49:35 PM10/27/10
to Dennis Murphy, ggplot2
Hi Dennis,

You're right (as usual).  I thought I had checked for the factor issue by leaving 'variable' as numeric 0s and 1s, but stat_smooth() appears to need at least 3 different values on the predictor (actually, it works as a factor too if there are 3 levels).

On Wed, Oct 27, 2010 at 7:51 PM, Dennis Murphy <djm...@gmail.com> wrote:
Hi Josh:

I think the problem is that group = 1 is meant to be used for a double purpose here. In the book example of a spaghetti plot (sec. 4.9.3), geom_smooth(aes(group = 1)) was applied to a continuous x variable and rendered an overall mean + SE envelope of the individual profiles. When the x-variable is a factor, one uses geom_line(aes(group = 1)) to plot a line across the levels of that factor. You're apparently trying to do both

Curse my tautologic coding
 
in one call, and evidently ggplot() is balking :)

On Wed, Oct 27, 2010 at 5:39 PM, Joshua Wiley <jwiley...@gmail.com> wrote:
Hi All,

I am sure I am missing something very obvious, but I've been going in circles trying to figure out why this does not work.  I have some (discrete) data, that was measured at T1 and T2.  I want to plot all the individual lines (a la a spaghetti plot), and then add an overall line of best fit.  This sample data illustrates my problem and is fairly representative of my actual data.

# An example of my data
set.seed(1716)
sampdat <- data.frame(ids = rep(1:50, each = 2),
                     variable = factor(rep(c(0:1), 50)),
                     value = sample(0:4, 100, replace = TRUE))

# individual lines, works fine
ggplot(data = sampdat, aes(x = variable, y = value, group = ids)) +
 geom_line()

# How I *thought* I should add the overall trend line
ggplot(data = sampdat, aes(x = variable, y = value, group = ids)) +
 geom_line() + stat_smooth(aes(group = 1), method = "lm", size = 2, colour = "blue")

How about something like this?
 
# Get mean/SD summaries from each level of variable:
dsumm <- ddply(sampdat, .(variable), summarise, m = mean(value), s = sd(value))

Now that I'm thinking in these terms, this is also pretty easy to get from predict.lm()
 

# Spaghetti plots
g <- ggplot(data = sampdat, aes(x = variable)) +
      geom_line(aes(y = value, group = ids))

# Mean line with geom_ribbon to add SDs
g + geom_line(data = dsumm, aes(y = m, group = 1), color = 'blue', size = 1) +
     geom_ribbon(data = dsumm, aes(ymin = m - s, ymax = m + s, group = 1),
                   colour = 'gray80', alpha = I(0.2))

Not perfect

Seems pretty close...this is not really a good plot choice for this type of data, I was just exploring things visually and got sidetracked when my code did not work as expected.
 
, but perhaps enough to get you started.

HTH,

It does, thanks!

Josh
 

Hadley Wickham

unread,
Oct 28, 2010, 12:07:44 AM10/28/10
to Joshua Wiley, ggplot2
> # How I *thought* I should add the overall trend line
> ggplot(data = sampdat, aes(x = variable, y = value, group = ids)) +
>  geom_line() + stat_smooth(aes(group = 1), method = "lm", size = 2, colour =
> "blue")

This may technically be a bug, but stat_smooth assumes that if you
want to fit a smooth line then you have at least 3 x values.

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Joshua Wiley

unread,
Oct 28, 2010, 12:19:56 AM10/28/10
to Hadley Wickham, ggplot2
On Wed, Oct 27, 2010 at 9:07 PM, Hadley Wickham <had...@rice.edu> wrote:
>
> > # How I *thought* I should add the overall trend line
> > ggplot(data = sampdat, aes(x = variable, y = value, group = ids)) +
> >  geom_line() + stat_smooth(aes(group = 1), method = "lm", size = 2, colour =
> > "blue")
>
> This may technically be a bug, but stat_smooth assumes that if you
> want to fit a smooth line then you have at least 3 x values.

Thanks Hadley; that seems reasonable. Perhaps there could be a
warning (there may be a very good reason you did not)?

http://github.com/hadley/ggplot2/blob/master/R/stat-smooth.r
lines 16 - 19
############################
if (length(unique(data$x)) <= 2) {
# Not enough data to perform fit
message("geom_smooth: There must be at least 3 unique x values.",
"No smooth will be added.")
return(data.frame())
}
############################

>
> Hadley
>
> --
> Assistant Professor / Dobelman Family Junior Chair
> Department of Statistics / Rice University
> http://had.co.nz/

Hadley Wickham

unread,
Oct 30, 2010, 8:03:59 PM10/30/10
to Joshua Wiley, ggplot2
>> This may technically be a bug, but stat_smooth assumes that if you
>> want to fit a smooth line then you have at least 3 x values.
>
> Thanks Hadley; that seems reasonable.  Perhaps there could be a
> warning (there may be a very good reason you did not)?

I'm a bit torn on this issue - there's a similar problem with
geom_line (which need at least two points). The problem is that it
often crops up for a single group in a single panel, and if you
already know about it, then it's a bit of a pain. If I'm going to do
it in one place, I really should do it everywhere to be consistent
(and have some argument to turn it off?)

Joshua Wiley

unread,
Oct 30, 2010, 11:12:44 PM10/30/10
to Hadley Wickham, ggplot2
On Sat, Oct 30, 2010 at 5:03 PM, Hadley Wickham <had...@rice.edu> wrote:
>>> This may technically be a bug, but stat_smooth assumes that if you
>>> want to fit a smooth line then you have at least 3 x values.
>>
>> Thanks Hadley; that seems reasonable.  Perhaps there could be a
>> warning (there may be a very good reason you did not)?
>
> I'm a bit torn on this issue - there's a similar problem with
> geom_line (which need at least two points).  The problem is that it
> often crops up for a single group in a single panel, and if you
> already know about it, then it's a bit of a pain.  If I'm going to do

As long as it is just a message, to me it does not seem anymore of a
pain than the warning when missing values are removed.

> it in one place, I really should do it everywhere to be consistent
> (and have some argument to turn it off?)

I appreciate why you are torn; warning that a line requires two points
is rather obvious. I was initially thrown because lm() will fit a
line to two values, but the default for stat_smooth is se = TRUE,
which obviously requires at least 3 values, suggesting that it
actually followed the expected behavior. Lacking a clearly "better"
or "right" choice, staying with the status quo seems most efficient.

Josh

Reply all
Reply to author
Forward
0 new messages