Plotting model objects with ggplot

1,148 views
Skip to first unread message

Winston Chang

unread,
Apr 28, 2012, 2:00:51 PM4/28/12
to ggplot2-dev
Hi everyone -

Having a way to plot prediction lines from a model is one of the most common feature requests, and after some earlier attempts a couple months ago, I have another one here that I'd like to get feedback on.

There's one easy problem and one hard problem in doing this. The easy problem is making predictions with one x variable and one y variable. The hard problem is deciding how to deal with categorical grouping variables in addition to the x and y var.

First, the easy problem. I've written a function predictvals() to predict the values from a model, here: https://gist.github.com/2520748. (Unlike predictdf() in stat-smooth-methods.r, it extracts the x range from the model.) You can load the code with:
library(devtools)
source_gist(2520748)

# Here's how to use predictvals with a single model:
loess_mod <- loess(mpg ~ wt, data=mtcars)
predicted_loess <- predictvals(loess_mod, xvar = "wt", yvar = "mpg")

# Plot
p <- ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point()
p + geom_line(data=predicted_loess, colour="blue")


Now the harder problem: dealing with categorical grouping variables. It's possible to use dlply to generate a list of models, and ldply to predict values based on those models. Here's an example of how to generate a list of models:
library(plyr)
lm_mods <- dlply(mtcars, .(cyl, am), lm, formula = mpg ~ wt)

This splits up mtcars by cyl and am, and runs lm(mpg ~ wt) on each subset. With this list of models, we can use predictvals() with ldply() to generate predicted values for each level of the grouping vars.
predicted_lm_multi <- ldply(lm_mods, predictvals, xvar = "wt", yvar = "mpg")

p + geom_line(data=predicted_lm_multi, colour="blue") +
  facet_grid(cyl ~ am)

The result looks the same as using stat_smooth with faceting:
p + stat_smooth(method=lm) + facet_grid(cyl ~ am)



There are a couple ways that these models could be used to draw the prediction lines.

Option 1: Make users do both the dlply to generate the models, and the ldply to generate the predicted values, as in the examples above. This would require that the predictvals function is exported. Having predictvals be part of ggplot2 seems like a bit of an odd fit, though.


Option 2: Make users run the dlply to generate the list of models, and then let them pass it to something like stat_model(), like this:
p + stat_model(data = lm_mods) + facet_grid(cyl ~ am)

The easiest way to generate the list of models would be to use the dlply method, because the list returned by dlply has an attribute, "split_labels", which contains information about grouping variables. It's possible for people to generate the list of models themselves and then set the "split_labels" attribute.

If we go this route, the predictvals function should still be exported so that users can generate the predicted values themselves if they want.


Here are some other notes about this general approach:
- Using dlply and ldply might be confusing for users.
- Predictor var must be continuous
- Allows only one predictor var (other vars are categorical, for splitting up data)
- Does it make sense to generalize to multiple predictors?
- The predictdf() functions in stat-smooth-methods.r could be merged with the predictvals() functions. (predictdf is a bit more restrictive in that it requires the predictor to be named "x" and the predicted to be named "y")


Feedback is welcome!
-Winston
Reply all
Reply to author
Forward
0 new messages