This is completely different in a way but Does relate to graphing model objects. I think it would be nice if there could be other RHS variables. Thus for example you could have a scatter plot for x and y with one line showing the effect of x on y; and a second showing the conditional effect of x on with w and z in the model. Equivalent to:y ~ xy ~ x + w + zTo use predict() this would require the creation of a new data frame where x was the same but w and z were held at their mean/median or for factors, perhaps mode or the referent group based on contrasts. In the past, I have done all of this outside ggplot2 and ten just passed a final data frame with c and expected y values to geom_line() but if work is being done with stat_smooth that would allow models, perhaps it could also allow selection of hitch aspects of the model to plot and what values to hold the other terms at.
That's because the behaviour of lm is different depending wether the explanatory variables are numeric or factors. When an explanatory variable is numeric, then it just uses that as an additional predictor (think of it as y = a.x1 + b.x2 + c instead of just y = a.x + b , in pseudo mathematical terms). When an explanatory variable is a factor, it splits the regression per level. Either the intercepts are different, or the slope is different, or both.
In you example above, the difference comes from the fact that z is a factor for ggplot (hence the ability to split) and not for the regression. With this:
dat <- data.frame(
x = c(0,10,0,10,0,10),
y = c(0,10,10,0,5,5),
z = c(1,1,2,2,3,3))
dat$z <- factor(dat$z)
mod <- lm(y ~ x + z + x:z, data=dat)
ggplot(dat, aes(x=x, y=y)) + geom_point() + stat_smooth(method=lm) +
geom_line(aes(y=.fitted), colour="red", data=mod) +
facet_grid(. ~ factor(z))
the two approaches are indeed identical.
You can check this further in a less caricatural case this way:
library("plyr")
set.seed(123)
dat <- data.frame(x=runif(20), y=rnorm(20), z=factor(rep(c(1,2), each=10)))
# fit a model per level, separately
ddply(dat, ~z, function(X) {
lm(y ~ x, data=X)$coefficients
})
# fit a model per level, with one call
lm(y ~ x + z + x:z, data=dat)$coefficients
In that case, z=1 is taken as the reference level and the coefficient z2 is the difference in intercept between z=1 and z=2, and x:z2 is the difference in slope between z=1 and z=2. You could also write it this way
lm(y ~ z + x:z, data=dat)$coefficients
in which case the slopes (z1:x and z2:x) will be computed independently and not with one as the difference from the other (and you would find exactly what you get from the separate models). But the first syntax is the usual ANCOVA approach (because the test for the second slope has H0 = differences of slope = 0, and it therefore tells you wether you need to actually consider a separate slope model or wether the same slope can be used across levels).
And, you probably guessed it, but to get the same intercept but a different slope you do
lm(y ~ x + x:z, data=dat)
while to get the same slope but a different intercept
lm(y ~ x + z, data=dat)
I hope this clears things a little.
Of course for the representation, things get tricky when you get two (or more) numeric predictors: the regression is not a line anymore relative to anyone of those two, it is a plane in the y,x,z space (where y is the response and x and z two numeric explanatory variables). I've been looking for a good way to represent the results of multivariate regressions or a while. In the general case (n explanatory variables, n>2), I can't find one. Suggestions are welcome ! ;)
Jean-Olivier Irisson
---
Observatoire Océanologique
Station Zoologique, B.P. 28, Chemin du Lazaret
06230 Villefranche-sur-Mer
Tel: +33 04 93 76 38 04
Mob: +33 06 21 05 19 90
http://jo.irisson.com/
That's because the behaviour of lm is different depending wether the explanatory variables are numeric or factors. When an explanatory variable is numeric, then it just uses that as an additional predictor (think of it as y = a.x1 + b.x2 + c instead of just y = a.x + b , in pseudo mathematical terms). When an explanatory variable is a factor, it splits the regression per level. Either the intercepts are different, or the slope is different, or both.
In you example above, the difference comes from the fact that z is a factor for ggplot (hence the ability to split) and not for the regression. With this:
dat$z <- factor(dat$z)
dat <- data.frame(
x = c(0,10,0,10,0,10),
y = c(0,10,10,0,5,5),
z = c(1,1,2,2,3,3))
mod <- lm(y ~ x + z + x:z, data=dat)
ggplot(dat, aes(x=x, y=y)) + geom_point() + stat_smooth(method=lm) +the two approaches are indeed identical.
geom_line(aes(y=.fitted), colour="red", data=mod) +
facet_grid(. ~ factor(z))
All those that I know about (glm of course, rq, gam I think etc.). I would say this is a property of the formula syntax of R, not of the underlying model, but that is just a gut feeling.
That's a slight simplification - they're very close, but not
identical. The difference is that if you fit the models separately
each model has a different estimate for the overall standard error.
Hadley
--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/
> That's because the behaviour of lm is different depending wether the explanatory variables are numeric or factors. When an explanatory variable is numeric, then it just uses that as an additional predictor (think of it as y = a.x1 + b.x2 + c instead of just y = a.x + b , in pseudo mathematical terms). When an explanatory variable is a factor, it splits the regression per level. Either the intercepts are different, or the slope is different, or both.That's a slight simplification - they're very close, but not>
> In you example above, the difference comes from the fact that z is a factor for ggplot (hence the ability to split) and not for the regression. With this:
>
> dat <- data.frame(
> x = c(0,10,0,10,0,10),
> y = c(0,10,10,0,5,5),
> z = c(1,1,2,2,3,3))
>
> dat$z <- factor(dat$z)
>
> mod <- lm(y ~ x + z + x:z, data=dat)
>
> ggplot(dat, aes(x=x, y=y)) + geom_point() + stat_smooth(method=lm) +
> geom_line(aes(y=.fitted), colour="red", data=mod) +
> facet_grid(. ~ factor(z))
>
> the two approaches are indeed identical.
identical. The difference is that if you fit the models separately
each model has a different estimate for the overall standard error.
I think it should be in fortify's output. A .se variable. I'll extract the code from geom_smooth to put it in fortify in the new package unless there is an objection.
The only conceptual issue that I could see with this is that fortify should, in essence, be a way to *extract* information from the model object and that SE is not explicitly computed in the model object.
But I personally already took liberties with this: in the fortify methods for PCA, I recomputed things so that the output of a PCA performed with various R functions (prcomp, ade4:dudi.pca and FactoMineR:PCA) become identical. I think it brings additional value at the cost of making fortify's output a bit different from the content of the original object. The added values seemed worth the cost to me.
> # Using two predictors, faceted on factor w
> mod2 <- lm(yvar ~ xvar + wvar + xvar:wvar, data=dat)
> ggplot(dat, aes(x=xvar, y=yvar)) + geom_point() +
> stat_smooth(method=lm) +
> geom_smooth(model=mod2, colour="red", linetype="dashed", size=1.5) +
> facet_grid(. ~ wvar)
Strictly speaking, I would say that there is still only one predictor (xvar) used in three conditions (levels of wvar).
On 2012-Jan-24, at 07:48 , Winston Chang wrote:I think it should be in fortify's output. A .se variable. I'll extract the code from geom_smooth to put it in fortify in the new package unless there is an objection.
>
> - There presently is no provision for including confidence regions, since that information isn't provided by fortify.lm. But if standard errors were also by fortify, it should be possible to do confidence regions. This would result in some duplication of functionality between fortify.xx and predictdf.xx in stat-smooth-methods.r. If development continue in this direction, it would probably be a good idea to look into unifying them somehow.
The only conceptual issue that I could see with this is that fortify should, in essence, be a way to *extract* information from the model object and that SE is not explicitly computed in the model object.
But I personally already took liberties with this: in the fortify methods for PCA, I recomputed things so that the output of a PCA performed with various R functions (prcomp, ade4:dudi.pca and FactoMineR:PCA) become identical. I think it brings additional value at the cost of making fortify's output a bit different from the content of the original object. The added values seemed worth the cost to me.
Not necessarily, but eventually yes.
> because I realized that there's probably a better way to do it.
> It also presently doesn't interpolate (it only gets values at the x values in the original data), so if your model is supposed to have a curve, it will end up looking like a bunch of straight line segments.
This is the way I've always seen it work. But indeed, you could use newdata in predict and create a fine, regular grid in x to compute the regression curve from it.
> One concern I have is that if the SE values from predict() are used to draw confidence regions, those confidence regions might be misleading when using a single model and splitting it up on a factor. In this example, using the same data as before, the confidence regions from the single model (red) are much narrower than the confidence regions from the three separate models generated by stat_smooth (gray).
> [...]
> This is the difference that Hadley mentioned. I don't know that one way is right and the other is wrong, but it would be good to make sure people know what they're getting into. (Obviously, if SE's and confidence intervals aren't provided by fortify or plotted, then this isn't an issue.)
I never realized that. It is probably because SE is computed with n=9 in the single model and n=3 in the split one. I can't figure out what it means with respect to the data and in which case one should be preferred over the other…
> One concern I have is that if the SE values from predict() are used to draw confidence regions, those confidence regions might be misleading when using a single model and splitting it up on a factor. In this example, using the same data as before, the confidence regions from the single model (red) are much narrower than the confidence regions from the three separate models generated by stat_smooth (gray).> [...]
> This is the difference that Hadley mentioned. I don't know that one way is right and the other is wrong, but it would be good to make sure people know what they're getting into. (Obviously, if SE's and confidence intervals aren't provided by fortify or plotted, then this isn't an issue.)I never realized that. It is probably because SE is computed with n=9 in the single model and n=3 in the split one. I can't figure out what it means with respect to the data and in which case one should be preferred over the other…