Using stat_smooth with more than one explanatory variable

282 views
Skip to first unread message

Manuel Spínola

unread,
Oct 26, 2012, 11:46:25 AM10/26/12
to ggplot2
Dear list members,

Is it possible to use stat_smooth to fit and plot a model with more than 1 explanatory variable?

I just found the package visreg that can plot models with more than 1 explanatory variables, but not in a ggplot2 framework.

Best,

Manuel

--
Manuel Spínola, Ph.D.
Instituto Internacional en Conservación y Manejo de Vida Silvestre
Universidad Nacional
Apartado 1350-3000
Heredia
COSTA RICA
mspi...@una.ac.cr
mspin...@gmail.com
Teléfono: (506) 2277-3598
Fax: (506) 2237-7036
Personal website: Lobito de río
Institutional website: ICOMVIS

Brandon Hurr

unread,
Oct 26, 2012, 4:37:18 PM10/26/12
to Manuel Spínola, ggplot2
I just looked at the vignette for visreg and I'm still not sure what you mean... 

Early on, the plots in the vignette are just faceted plots, which is easy to do. They switch about halfway through to faceted plots with sub-facets. I'm sure this is possible, but it will probably take some hacking. Then later there is a heat-map plot, which can be done, next to a 3D plot, which cannot be done. 

--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: https://github.com/hadley/devtools/wiki/Reproducibility
 
To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+u...@googlegroups.com
More options: http://groups.google.com/group/ggplot2

Manuel Spínola

unread,
Oct 27, 2012, 9:13:31 AM10/27/12
to Brandon Hurr, ggplot2
Thank yu very much Dennis.

Now, in the stat_smooth example of glm, how is it handle the confidence interval?  Using the delta method?

qplot(Age, as.numeric(Kyphosis) - 1, data = kyphosis) + stat_smooth(method="glm", family="binomial")

Brandon, I was thinking when you have a response variable and several explanatory variable:

RV ~ EV1 + EV2 + EV3 + EV4

To make a conditional plot you need to plot, for example, EV1 against RV, but you need to keep the other variable constant, and that is not posible with stat_smooth, you do variable by variable but is not a plot of the model is a plot of a simple model with only one variable at time. With visreg you can do plots for the model keeping the other variable constant and even interaction between variables. the package effect can do something similar but the plot are no so pretty and is less flexible.

Best,

Manuel


2012/10/26 Brandon Hurr <brando...@gmail.com>

Brandon Hurr

unread,
Oct 27, 2012, 9:46:29 AM10/27/12
to Manuel Spínola, ggplot2
From what I could tell it looked like they just created a factor from a numeric variable a=1-3,b=3-5, etc...) and used that to facet. Seems like that's certainly possible in ggplot. 

I'm curious what Dennis said as he's usually more correct and thorough than me....

Dennis Murphy

unread,
Oct 27, 2012, 5:43:58 PM10/27/12
to Manuel Spínola, Brandon Hurr, ggplot2
Hi Manuel:

It appears that the method for computing confidence bands for GLMs in stat_smooth() follows this script:

(i)  find the predicted values on the linear predictor scale; call them LPhat;
(ii) compute LPhat +/- 1.96 * se.fit(LPhat)
(iii) apply the inverse link function on the lower and upper CIs.

What stat_smooth() does is to take 80 equally spaced x-values along its range, apply the above algorithm to those x's and then plot the result. We can see whether or not this hypothesis is correct on the kyphosis data. I'm using the version of the kyphosis data in the gam package, which supports the 1990 Hastie and Tibshirani book.

library(gam)
data(kyphosis)
km <- glm(as.numeric(Kyphosis) - 1 ~ Age, data = kyphosis)

# predict.glm() returns a list rather than a data frame
# so we need to do a little work to get the confidence bands
# The plot from your code indicates that at fixed x values,
# the intervals are not necessarily symmetric, which suggests
# some transformation of the limits is performed.

# Get the predictions on the linear predictor scale (default):
kmp <- predict(km, se.fit = TRUE)

# Compute intervals since no 'method =' argument exists in predict.glm():
kmpdf <- data.frame(Age = kyphosis$Age,
                   lwr = kmp$fit - 1.96 * kmp$se.fit,
                   upr = kmp$fit + 1.96 * kmp$se.fit)

The y-variable in the plot is the response, not the linear predictor. By definition, in logistic regression, under the canonical link function g() of a response probability p,
          g(p) = beta_0 + beta_1 * x1 + beta_2 * x2 + ... + beta_p * xp

The function g() in this case is the canonical logit link function, while the right hand side of the equation is the linear predictor. Life is simplest in this case, because we can then write
            p = g^{-1}(linear predictor).

The confidence bands for the probability of positive response p are obtained in stat_smooth() by applying the inverse link function g^{-1} to the upper and lower limits upr and lwr, respectively, in kmpdf:

The canonical link function for logistic regression is the logit function:
                   logit(p) = log(p/(1 - p))
The inverse link function is the logistic function
                   logis(x) = exp(x)/(1 + exp(x))
This is what will be applied to lwr and upr:
#----

logis <- function(x) exp(x)/(1 + exp(x))

library(plyr)
kmpdf <- mutate(kmpdf, llim = logis(lwr), ulim = logis(upr))

# Now do the plot:

qplot(Age, as.numeric(Kyphosis) - 1, data = kyphosis) +
    stat_smooth(method = "glm", family = "binomial") +
    geom_line(data = kmpdf, aes(x = Age, y = llim),
                        color = "red", linetype = "dashed") +
    geom_line(data = kmpdf, aes(x = Age, y = ulim),
                        color = "red", linetype = "dashed")

#----
The limits should match.

By contrast, another approach would be to compute the predicted values on the response scale and use a Wald-based CI for p, which I think is what Thierry proposed:

kmp2 <- predict(km, type = "response", se.fit = TRUE)
kmpdf2 <- data.frame(Age = kyphosis$Age,
                lwr = kmp2$fit - 1.96 * kmp2$se.fit,
                upr = kmp2$fit + 1.96 * kmp2$se.fit)
last_plot() +
    geom_line(data = kmpdf2, aes(x = Age, y = lwr), 
                       color = "black", linetype = "dotdash") +
    geom_line(data = kmpdf2, aes(x = Age, y = upr), 
                       color = "black", linetype = "dotdash")

I'm not going to get into whether which one of these is "right"; in fact, there are other ways (e.g. likelihood-based intervals) one could use. But to answer the question, it appears that stat_smooth() renders intervals that are computed on the linear predictor scale which are then transformed according to the inverse link function. I haven't tried this out with Poisson regression or different link functions in logistic regression to verify that hypothesis - I'll leave that as an exercise.

Finally, note that se.fit on the linear predictor scale (in kmp) is not the same as se.fit on the response scale (in kmp2). This is because the response (under the canonical logit link function) is the inverse link of the linear predictor, so its standard error needs to take into account that transformation - this is where the "delta method" of computing the standard error of a transformed estimator comes into play indirectly. Computationally, the details are taken care of in the GLM fitting algorithm.

HTH,
Dennis
Reply all
Reply to author
Forward
0 new messages