Passing models to geom_smooth

2,621 views
Skip to first unread message

Winston Chang

unread,
Jan 22, 2012, 12:49:56 PM1/22/12
to ggplot2-dev
Hi everyone -

A recent discussion on the main ggplot2 list spurred me on to try to make it possible to pass a model object to geom_smooth/stat_smooth. 

I made some changes to make it possible to use the syntax geom_smooth(model=mymodel), where mymodel is a model object. The change was fairly trivial -- you can see it here:

It doesn't make use of fortify, which was one possibility that was discussed. Instead, it uses the predictdf functions in stat-smooth-methods.r.

Right now it works with a single predictor. But it doesn't work if you have multiple predictors, which you might want with color mappings or facets. I have a feeling that getting multiple predictors to work will require much more extensive changes.

You can try it out by installing my experimental branch:
install_github("ggplot2", "wch", "experimental")


Here are some examples:

set.seed(123)
x <- runif(20, 0, 10)
z <- factor(rep(c("foo", "bar"), each=10))
y <- x*(0.5*as.numeric(z)) + (3+as.numeric(z)) + runif(20, 0, 3)
dat <- data.frame(x, y, z)

# Works
modlm <- lm(y ~ x, data = dat)
ggplot(dat, aes(x=x, y=y)) + geom_point() + geom_smooth(model = modlm)

# Works
modloess <- loess(y ~ x, data = dat)
ggplot(dat, aes(x=x, y=y)) + geom_point() + geom_smooth(model = modloess)

# Doesn't work: 2 variables 
modlm2 <- lm(y ~ x + z + x:z, data = dat)
ggplot(dat, aes(x=x, y=y)) + facet_wrap( ~ z) + geom_point() + geom_smooth(model = modlm2)


-Winston

Winston Chang

unread,
Jan 22, 2012, 4:24:26 PM1/22/12
to ggplot2-dev
I just realized that this modification only works if the columns in the original data are named 'x' and 'y'. It looks like I need to figure out how to get the aesthetic mappings, such as aes(x=xvar, y=yvar, colour=zvar).

Joshua Wiley

unread,
Jan 22, 2012, 4:43:02 PM1/22/12
to Winston Chang, ggplot2-dev
This is completely different in a way but Does relate to graphing model objects.  I think it would be nice if there could be other RHS variables.  Thus for example you could have a scatter plot for x and y with one line showing the effect of x on y; and a second showing the conditional effect of x on with w and z in the model.  Equivalent to:

y ~ x
y ~ x + w + z

To use predict() this would require the creation of a new data frame where x was the same but w and z were held at their mean/median or for factors, perhaps mode or the referent group based on contrasts.  In the past, I have done all of this outside ggplot2 and ten just passed a final data frame with c and expected y values to geom_line() but if work is being done with stat_smooth that would allow models, perhaps it could also allow selection of hitch aspects of the model to plot and what values to hold the other terms at.

Josh

Winston Chang

unread,
Jan 23, 2012, 2:57:33 PM1/23/12
to Joshua Wiley, ggplot2-dev
On Sun, Jan 22, 2012 at 3:43 PM, Joshua Wiley <jwiley...@gmail.com> wrote:
This is completely different in a way but Does relate to graphing model objects.  I think it would be nice if there could be other RHS variables.  Thus for example you could have a scatter plot for x and y with one line showing the effect of x on y; and a second showing the conditional effect of x on with w and z in the model.  Equivalent to:

y ~ x
y ~ x + w + z

To use predict() this would require the creation of a new data frame where x was the same but w and z were held at their mean/median or for factors, perhaps mode or the referent group based on contrasts.  In the past, I have done all of this outside ggplot2 and ten just passed a final data frame with c and expected y values to geom_line() but if work is being done with stat_smooth that would allow models, perhaps it could also allow selection of hitch aspects of the model to plot and what values to hold the other terms at.


Now that I've thought about this a bit more, I've realized that this whole thing may need more consideration about how it should work... The way things currently operate, if your data is grouped (say, with color, or facets) and you add a geom_smooth() to it, it makes a model for each individual group, independent of each other. This is different from passing in a single model that contains these variables.

Here's an example that will illustrate. 

dat <- data.frame(
        x = c(0,10,0,10,0,10),
        y = c(0,10,10,0,5,5),
        z = c(1,1,2,2,3,3))

# Get lm 
mod <- lm(y ~ x + z + x:z, data=dat)
fortify(mod)
#  y  x z      .hat   .sigma .cooksd .fitted .resid .stdresid
#  0  0 1 0.8333333 6.123724   1.250     2.5   -2.5        -1
# 10 10 1 0.8333333 6.123724   1.250     7.5    2.5         1
# 10  0 2 0.3333333 6.123724   0.125     5.0    5.0         1
#  0 10 2 0.3333333 6.123724   0.125     5.0   -5.0        -1
#  5  0 3 0.8333333 6.123724   1.250     7.5   -2.5        -1
#  5 10 3 0.8333333 6.123724   1.250     2.5    2.5         1

ggplot(dat, aes(x=x, y=y)) + geom_point() + stat_smooth(method=lm) + 
    geom_line(aes(y=.fitted), colour="red", data=mod) +
    facet_grid(. ~ factor(z))


The blue lines are the result of stat_smooth, where each facet gets its own linear model. The red lines are the result of a linear model y ~ x + z + x:z. As you can see, the lines are different. This itself is not a problem, but it does show that passing in a single model is different from using stat_smooth.

Here are the two cases spelled out:
1. A model is built for each group. (This is what stat_smooth does.)
2. A single model is used across all groups. (This is done in the example above with the geom_line.)

I wanted to allow syntax like this:
  mod <- lm(y ~ x+z+x:z, data=dat)
  ggplot(dat, aes(x,y)) + geom_point() + stat_smooth(model=mod) +
    facet_grid(. ~ factor(z))
... but I don't think this syntax is expressive enough to handle case (1), since it just passes in one object.

My original goal was to make it possible to have access to the model objects _and_ to easily plot the model objects with ggplot2. Presently, in case (1), it's easy to plot the models, but not to have access to the objects. In case (2), it's possible to plot the objects with the geom_line thing I used above (which internally makes use of fortify), and also have access to the model objects. But if you do it this way, the interface is not particularly clear and there's not a simple way to get confidence regions. Also, there presently are not S3 methods for fortify for many other types of model objects besides lm -- there's no fortify.glm or fortify.loess, for example.

At any rate, what would be nice is a simple, consistent interface to plot and get the model objects in cases (1) and (2). If anyone has thoughts about what such an interface interface would look like, please feel free to share.

-Winston

models.png

Jean-Olivier Irisson

unread,
Jan 23, 2012, 3:59:15 PM1/23/12
to Winston Chang, ggplot2-dev
On 2012-Jan-23, at 20:57 , Winston Chang wrote:
>
> dat <- data.frame(
> x = c(0,10,0,10,0,10),
> y = c(0,10,10,0,5,5),
> z = c(1,1,2,2,3,3))
>
> # Get lm
> mod <- lm(y ~ x + z + x:z, data=dat)
> fortify(mod)
> # y x z .hat .sigma .cooksd .fitted .resid .stdresid
> # 0 0 1 0.8333333 6.123724 1.250 2.5 -2.5 -1
> # 10 10 1 0.8333333 6.123724 1.250 7.5 2.5 1
> # 10 0 2 0.3333333 6.123724 0.125 5.0 5.0 1
> # 0 10 2 0.3333333 6.123724 0.125 5.0 -5.0 -1
> # 5 0 3 0.8333333 6.123724 1.250 7.5 -2.5 -1
> # 5 10 3 0.8333333 6.123724 1.250 2.5 2.5 1
>
> ggplot(dat, aes(x=x, y=y)) + geom_point() + stat_smooth(method=lm) +
> geom_line(aes(y=.fitted), colour="red", data=mod) +
> facet_grid(. ~ factor(z))
>
>
> The blue lines are the result of stat_smooth, where each facet gets its own linear model. The red lines are the result of a linear model y ~ x + z + x:z. As you can see, the lines are different. This itself is not a problem, but it does show that passing in a single model is different from using stat_smooth.

That's because the behaviour of lm is different depending wether the explanatory variables are numeric or factors. When an explanatory variable is numeric, then it just uses that as an additional predictor (think of it as y = a.x1 + b.x2 + c instead of just y = a.x + b , in pseudo mathematical terms). When an explanatory variable is a factor, it splits the regression per level. Either the intercepts are different, or the slope is different, or both.

In you example above, the difference comes from the fact that z is a factor for ggplot (hence the ability to split) and not for the regression. With this:

dat <- data.frame(
x = c(0,10,0,10,0,10),
y = c(0,10,10,0,5,5),
z = c(1,1,2,2,3,3))

dat$z <- factor(dat$z)

mod <- lm(y ~ x + z + x:z, data=dat)

ggplot(dat, aes(x=x, y=y)) + geom_point() + stat_smooth(method=lm) +

geom_line(aes(y=.fitted), colour="red", data=mod) +
facet_grid(. ~ factor(z))

the two approaches are indeed identical.

You can check this further in a less caricatural case this way:

library("plyr")
set.seed(123)
dat <- data.frame(x=runif(20), y=rnorm(20), z=factor(rep(c(1,2), each=10)))

# fit a model per level, separately
ddply(dat, ~z, function(X) {
lm(y ~ x, data=X)$coefficients
})

# fit a model per level, with one call
lm(y ~ x + z + x:z, data=dat)$coefficients

In that case, z=1 is taken as the reference level and the coefficient z2 is the difference in intercept between z=1 and z=2, and x:z2 is the difference in slope between z=1 and z=2. You could also write it this way

lm(y ~ z + x:z, data=dat)$coefficients

in which case the slopes (z1:x and z2:x) will be computed independently and not with one as the difference from the other (and you would find exactly what you get from the separate models). But the first syntax is the usual ANCOVA approach (because the test for the second slope has H0 = differences of slope = 0, and it therefore tells you wether you need to actually consider a separate slope model or wether the same slope can be used across levels).

And, you probably guessed it, but to get the same intercept but a different slope you do

lm(y ~ x + x:z, data=dat)

while to get the same slope but a different intercept

lm(y ~ x + z, data=dat)

I hope this clears things a little.

Of course for the representation, things get tricky when you get two (or more) numeric predictors: the regression is not a line anymore relative to anyone of those two, it is a plane in the y,x,z space (where y is the response and x and z two numeric explanatory variables). I've been looking for a good way to represent the results of multivariate regressions or a while. In the general case (n explanatory variables, n>2), I can't find one. Suggestions are welcome ! ;)

Jean-Olivier Irisson
---
Observatoire Océanologique
Station Zoologique, B.P. 28, Chemin du Lazaret
06230 Villefranche-sur-Mer
Tel: +33 04 93 76 38 04
Mob: +33 06 21 05 19 90
http://jo.irisson.com/

Winston Chang

unread,
Jan 23, 2012, 4:54:36 PM1/23/12
to Jean-Olivier Irisson, ggplot2-dev
On Mon, Jan 23, 2012 at 2:59 PM, Jean-Olivier Irisson <jo.ir...@gmail.com> wrote:

That's because the behaviour of lm is different depending wether the explanatory variables are numeric or factors. When an explanatory variable is numeric, then it just uses that as an additional predictor (think of it as y = a.x1 + b.x2 + c instead of just y = a.x + b , in pseudo mathematical terms). When an explanatory variable is a factor, it splits the regression per level. Either the intercepts are different, or the slope is different, or both.

In you example above, the difference comes from the fact that z is a factor for ggplot (hence the ability to split) and not for the regression. With this:

   dat <- data.frame(
           x = c(0,10,0,10,0,10),
           y = c(0,10,10,0,5,5),
           z = c(1,1,2,2,3,3))

   dat$z <- factor(dat$z)

   mod <- lm(y ~ x + z + x:z, data=dat)

   ggplot(dat, aes(x=x, y=y)) + geom_point() + stat_smooth(method=lm) +
       geom_line(aes(y=.fitted), colour="red", data=mod) +
       facet_grid(. ~ factor(z))

the two approaches are indeed identical.

Cool - I didn't know that lm did this when you have a factor for a predictor. I wonder if other types of models do that -- in other words, can this behavior (having a separate regression for each factor level) be expected to generalize so that code like this can be used with models other than lm?
 
-Winston

Jean-Olivier Irisson

unread,
Jan 23, 2012, 5:07:15 PM1/23/12
to Winston Chang, ggplot2-dev
On 2012-Jan-23, at 22:54 , Winston Chang wrote:
>
> Cool - I didn't know that lm did this when you have a factor for a predictor. I wonder if other types of models do that -- in other words, can this behavior (having a separate regression for each factor level) be expected to generalize so that code like this can be used with models other than lm?

All those that I know about (glm of course, rq, gam I think etc.). I would say this is a property of the formula syntax of R, not of the underlying model, but that is just a gut feeling.

Hadley Wickham

unread,
Jan 23, 2012, 6:15:16 PM1/23/12
to Jean-Olivier Irisson, Winston Chang, ggplot2-dev
On Mon, Jan 23, 2012 at 2:59 PM, Jean-Olivier Irisson
<jo.ir...@gmail.com> wrote:

That's a slight simplification - they're very close, but not
identical. The difference is that if you fit the models separately
each model has a different estimate for the overall standard error.

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Winston Chang

unread,
Jan 24, 2012, 1:48:42 AM1/24/12
to Hadley Wickham, Jean-Olivier Irisson, ggplot2-dev
On Mon, Jan 23, 2012 at 5:15 PM, Hadley Wickham <had...@rice.edu> wrote:
> That's because the behaviour of lm is different depending wether the explanatory variables are numeric or factors. When an explanatory variable is numeric, then it just uses that as an additional predictor (think of it as y = a.x1 + b.x2 + c instead of just y = a.x + b , in pseudo mathematical terms). When an explanatory variable is a factor, it splits the regression per level. Either the intercepts are different, or the slope is different, or both.
>
> In you example above, the difference comes from the fact that z is a factor for ggplot (hence the ability to split) and not for the regression. With this:
>
>    dat <- data.frame(
>            x = c(0,10,0,10,0,10),
>            y = c(0,10,10,0,5,5),
>            z = c(1,1,2,2,3,3))
>
>    dat$z <- factor(dat$z)
>
>    mod <- lm(y ~ x + z + x:z, data=dat)
>
>    ggplot(dat, aes(x=x, y=y)) + geom_point() + stat_smooth(method=lm) +
>        geom_line(aes(y=.fitted), colour="red", data=mod) +
>        facet_grid(. ~ factor(z))
>
> the two approaches are indeed identical.

That's a slight simplification - they're very close, but not
identical.  The difference is that if you fit the models separately
each model has a different estimate for the overall standard error.


Well, I took another stab at it and managed to solve the variable name issue. It now passes the model object to fortify() to get predicted values. I was only able to test it with lm, since fortify.xx isn't defined for very many other classes, and I'm not familiar with the ones that it is defined for.

If versions of fortify.xx are written for other types of models, I think they should work with this code, as long as the returned data frame contains a column .fitted with predicted values at each x.

A few other notes:
- There presently is no provision for including confidence regions, since that information isn't provided by fortify.lm. But if standard errors were also by fortify, it should be possible to do confidence regions. This would result in some duplication of functionality between fortify.xx and predictdf.xx in stat-smooth-methods.r. If development continue in this direction, it would probably be a good idea to look into unifying them somehow.

- Right now it will get confused if there are predictors in the model that aren't "used up" by the x axis plus the other aesthetics/facets. (e.g., if you use mod2 below without facets)

- Because it needs to access the variables in the data frame before they're renamed to x, y, colour, etc., the code must be put in the user-facing function geom_smooth, rather than the proto object GeomSmooth (because GeomSmooth is later in the pipeline and only sees the remapped data). This is definitely unusual, and hopefully won't require too much hackiness down the road.

If you want to see the code, here's the commit: https://github.com/wch/ggplot2/commit/d9578e
It can be installed using my experimental branch:
  install_github("ggplot2", "wch", "experimental")


Here are some examples, with the results attached. The shaded regions are from the stat_smooth's calculations, not from the model that's passed in manually.

set.seed(111)
dat <- data.frame(
       wvar = rep(c(1,2,3), each=3),
       xvar = rep(c(0,5,10), 3),
       yvar = c(0,5,10, 10,5,0, 5,5,5) )

# Treat wvar as a factor for lm
dat$wvar <- factor(dat$wvar)

# Add a little noise to yvar
dat$yvar <- dat$yvar + runif(9)


# Using one predictor
# Blue line and shaded region is from stat_smooth(method=lm)
# Red dashed line is from passing in model
mod1 <- lm(yvar ~ xvar, data=dat)
ggplot(dat, aes(x=xvar, y=yvar)) + geom_point() + 
  stat_smooth(method=lm) +
  geom_smooth(model=mod1, colour="red", linetype="dashed", size=1.5)

# Using two predictors, faceted on factor w
mod2 <- lm(yvar ~ xvar + wvar + xvar:wvar, data=dat)
ggplot(dat, aes(x=xvar, y=yvar)) + geom_point() +
  stat_smooth(method=lm) +
  geom_smooth(model=mod2, colour="red", linetype="dashed", size=1.5) +
  facet_grid(. ~ wvar)



-Winston
model-1.png
model-2.png

Jean-Olivier Irisson

unread,
Jan 24, 2012, 4:02:59 AM1/24/12
to Winston Chang, Hadley Wickham, ggplot2-dev
On 2012-Jan-24, at 07:48 , Winston Chang wrote:
>
> - There presently is no provision for including confidence regions, since that information isn't provided by fortify.lm. But if standard errors were also by fortify, it should be possible to do confidence regions. This would result in some duplication of functionality between fortify.xx and predictdf.xx in stat-smooth-methods.r. If development continue in this direction, it would probably be a good idea to look into unifying them somehow.

I think it should be in fortify's output. A .se variable. I'll extract the code from geom_smooth to put it in fortify in the new package unless there is an objection.

The only conceptual issue that I could see with this is that fortify should, in essence, be a way to *extract* information from the model object and that SE is not explicitly computed in the model object.
But I personally already took liberties with this: in the fortify methods for PCA, I recomputed things so that the output of a PCA performed with various R functions (prcomp, ade4:dudi.pca and FactoMineR:PCA) become identical. I think it brings additional value at the cost of making fortify's output a bit different from the content of the original object. The added values seemed worth the cost to me.

> # Using two predictors, faceted on factor w
> mod2 <- lm(yvar ~ xvar + wvar + xvar:wvar, data=dat)
> ggplot(dat, aes(x=xvar, y=yvar)) + geom_point() +
> stat_smooth(method=lm) +
> geom_smooth(model=mod2, colour="red", linetype="dashed", size=1.5) +
> facet_grid(. ~ wvar)

Strictly speaking, I would say that there is still only one predictor (xvar) used in three conditions (levels of wvar).

Winston Chang

unread,
Jan 24, 2012, 4:20:20 PM1/24/12
to Jean-Olivier Irisson, Hadley Wickham, ggplot2-dev
On Tue, Jan 24, 2012 at 3:02 AM, Jean-Olivier Irisson <jo.ir...@gmail.com> wrote:
On 2012-Jan-24, at 07:48 , Winston Chang wrote:
>
> - There presently is no provision for including confidence regions, since that information isn't provided by fortify.lm. But if standard errors were also by fortify, it should be possible to do confidence regions. This would result in some duplication of functionality between fortify.xx and predictdf.xx in stat-smooth-methods.r. If development continue in this direction, it would probably be a good idea to look into unifying them somehow.

I think it should be in fortify's output. A .se variable. I'll extract the code from geom_smooth to put it in fortify in the new package unless there is an objection.

If you're talking about moving the code I added to the new package, you might want to wait a bit, because I realized that there's probably a better way to do it. It also presently doesn't interpolate (it only gets values at the x values in the original data), so if your model is supposed to have a curve, it will end up looking like a bunch of straight line segments.
 
The only conceptual issue that I could see with this is that fortify should, in essence, be a way to *extract* information from the model object and that SE is not explicitly computed in the model object.
But I personally already took liberties with this: in the fortify methods for PCA, I recomputed things so that the output of a PCA performed with various R functions (prcomp, ade4:dudi.pca and FactoMineR:PCA) become identical. I think it brings additional value at the cost of making fortify's output a bit different from the content of the original object. The added values seemed worth the cost to me.

My view is that this isn't really a problem. The various models I've worked with seem to be inconsistent in the information they provide, and in the way the information is structured.

One concern I have is that if the SE values from predict() are used to draw confidence regions, those confidence regions might be misleading when using a single model and splitting it up on a factor. In this example, using the same data as before, the confidence regions from the single model (red) are much narrower than the confidence regions from the three separate models generated by stat_smooth (gray).

set.seed(111)
dat <- data.frame(
       wvar = rep(c(1,2,3), each=3),
       xvar = rep(c(0,5,10), 3),
       yvar = c(0,5,10, 10,5,0, 5,5,5) )

# Treat wvar as a factor for lm
dat$wvar <- factor(dat$wvar)
# Add a little noise to yvar
dat$yvar <- dat$yvar + runif(9)

mod2 <- lm(yvar ~ xvar + wvar + xvar:wvar, data=dat)
mod2pred <- cbind(mod2$mode, predict(mod2, se=TRUE, interval="confidence")$fit)

ggplot(dat, aes(x=xvar, y=yvar)) + geom_point() +
  stat_smooth(method=lm) +
  geom_line(data=mod2pred, aes(y=fit), colour="red", linetype="dashed") +
  geom_ribbon(data=mod2pred, aes(ymin=lwr, ymax=upr), fill="red", alpha=.2) +
  facet_grid(. ~ wvar)


This is the difference that Hadley mentioned. I don't know that one way is right and the other is wrong, but it would be good to make sure people know what they're getting into. (Obviously, if SE's and confidence intervals aren't provided by fortify or plotted, then this isn't an issue.)

-Winston
model-3.png

Jean-Olivier Irisson

unread,
Jan 24, 2012, 5:22:29 PM1/24/12
to Winston Chang, Hadley Wickham, ggplot2-dev
On 2012-Jan-24, at 22:20 , Winston Chang wrote:
>> I think it should be in fortify's output. A .se variable. I'll extract the code from geom_smooth to put it in fortify in the new package unless there is an objection.
>
> If you're talking about moving the code I added to the new package, you might want to wait a bit,

Not necessarily, but eventually yes.

> because I realized that there's probably a better way to do it.
> It also presently doesn't interpolate (it only gets values at the x values in the original data), so if your model is supposed to have a curve, it will end up looking like a bunch of straight line segments.

This is the way I've always seen it work. But indeed, you could use newdata in predict and create a fine, regular grid in x to compute the regression curve from it.

> One concern I have is that if the SE values from predict() are used to draw confidence regions, those confidence regions might be misleading when using a single model and splitting it up on a factor. In this example, using the same data as before, the confidence regions from the single model (red) are much narrower than the confidence regions from the three separate models generated by stat_smooth (gray).

> [...]


> This is the difference that Hadley mentioned. I don't know that one way is right and the other is wrong, but it would be good to make sure people know what they're getting into. (Obviously, if SE's and confidence intervals aren't provided by fortify or plotted, then this isn't an issue.)

I never realized that. It is probably because SE is computed with n=9 in the single model and n=3 in the split one. I can't figure out what it means with respect to the data and in which case one should be preferred over the other…

Winston Chang

unread,
Jan 25, 2012, 6:42:24 PM1/25/12
to Jean-Olivier Irisson, Hadley Wickham, ggplot2-dev
> One concern I have is that if the SE values from predict() are used to draw confidence regions, those confidence regions might be misleading when using a single model and splitting it up on a factor. In this example, using the same data as before, the confidence regions from the single model (red) are much narrower than the confidence regions from the three separate models generated by stat_smooth (gray).
> [...]
> This is the difference that Hadley mentioned. I don't know that one way is right and the other is wrong, but it would be good to make sure people know what they're getting into. (Obviously, if SE's and confidence intervals aren't provided by fortify or plotted, then this isn't an issue.)

I never realized that. It is probably because SE is computed with n=9 in the single model and n=3 in the split one. I can't figure out what it means with respect to the data and in which case one should be preferred over the other…

I think that it should be possible to have the option of (1) passing in a single model, and (2) passing in a set of models, one for each factor level combination (similar to what currently happens with stat_smooth). I'm not sure what kind of data structure would be best suited for this, though, in terms of providing a sensible user interface. It would be nice to have something like a data frame, but with a column of model objects.

-Winston
Reply all
Reply to author
Forward
0 new messages