Suggestion: enhancing the discovery of newly created variables

395 views
Skip to first unread message

JiHO

unread,
Jan 18, 2012, 5:11:17 PM1/18/12
to ggplot2
Hi all,

Enhancing the discoverability of the variables computed by stats and,
particularly, the variables extracted by fortify would be nice.
However, I don't have a good solution (a message would probably be
annoying). Any suggestion is welcome.

The name "fortify" itself is a bit cryptic too, even though I
understand the overall intention. as.data.frame() is a function that
most people already know and is a generic so why not simply define
methods for it: fortify.lm = as.data.frame.lm, fortify.map =
as.data.frame.map etc. ? Those do not seem to exist already. Does
anyone see a conflict?

JiHO
---
http://maururu.net

Winston Chang

unread,
Jan 19, 2012, 2:27:49 AM1/19/12
to JiHO, ggplot2
On Wed, Jan 18, 2012 at 4:11 PM, JiHO <jo.l...@gmail.com> wrote:
Hi all,

Enhancing the discoverability of the variables computed by stats and,
particularly, the variables extracted by fortify would be nice.
However, I don't have a good solution (a message would probably be
annoying). Any suggestion is welcome.

I agree that this would be very useful. For example, if you are adding a linear regression line with stat_smooth(), it's very desirable to be able get the lm object out of ggplot. The way ggplot works right now, you have to do the lm() separately to get the numbers. This makes me feel uneasy because I'm not absolutely sure that the lm object is the same as what's on the graph.

Here's an example:

dat <- data.frame(x=rnorm(20), y=rnorm(20))
# graph
ggplot(dat, aes(x=x, y=y)) + geom_point() + stat_smooth(method=lm)
# linear model
lm(y~x, data=dat)

The lm object looks right, but what if I made a small mistake that is hard to see by inspecting the lm numbers and the graph?

Perhaps one way to do this is to have a function that takes a ggplot object and returns a list of models and fortify output. Something like this:

p <- ggplot(dat, aes(x=x, y=y)) + geom_point() + stat_smooth(method=lm) + stat_smooth()

extract_stat(p)
# Return a list with lm object, and fortify output

It could also do the same with loess or glm fits. I think this could also be extended to other stats, like stat_summary and stat_bin.

-Winston

Hadley Wickham

unread,
Jan 19, 2012, 8:14:01 AM1/19/12
to JiHO, ggplot2
> The name "fortify" itself is a bit cryptic too, even though I
> understand the overall intention. as.data.frame() is a function that
> most people already know and is a generic so why not simply define
> methods for it: fortify.lm = as.data.frame.lm, fortify.map =
> as.data.frame.map etc. ? Those do not seem to exist already. Does
> anyone see a conflict?

Because, in general, fortify takes two arguments: an object and a data
frame. As.data.frame only takes one.

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Hadley Wickham

unread,
Jan 19, 2012, 8:15:58 AM1/19/12
to Winston Chang, JiHO, ggplot2
> Perhaps one way to do this is to have a function that takes a ggplot object
> and returns a list of models and fortify output. Something like this:
>
> p <- ggplot(dat, aes(x=x, y=y)) + geom_point() + stat_smooth(method=lm) +
> stat_smooth()
>
> extract_stat(p)
> # Return a list with lm object, and fortify output
>
> It could also do the same with loess or glm fits. I think this could also be
> extended to other stats, like stat_summary and stat_bin.

I'd rather make it easier to do the opposite:

mod <- gam(y ~ s(x) ,data = dat)
p <- ggplot(dat, aes(x=x, y=y)) + geom_point() + geom_smooth(model = mod)

I generally think it's a bad idea to rely on a graphics package to do
your statistics for you. If you're doing anything moderately
complicated, you should do it outside of ggplot2 and then add the
results on to a plot. I agree that the tools for doing this could be
better, and they will get better as the statistical functionality of
ggplot2 is extracted out into separate, stand-alone, well-documented
functions.

JiHO

unread,
Jan 19, 2012, 12:56:26 PM1/19/12
to Hadley Wickham, ggplot2
On Thu, Jan 19, 2012 at 14:14, Hadley Wickham <had...@rice.edu> wrote:
>> The name "fortify" itself is a bit cryptic too, even though I
>> understand the overall intention. as.data.frame() is a function that
>> most people already know and is a generic so why not simply define
>> methods for it?

>
> Because, in general, fortify takes two arguments: an object and a data
> frame.  As.data.frame only takes one.

Uh, OK. I never realized it since most well formed modeling functions
in R keep the data in the model object I always extract it form there.

I thought the goal of fortify was to be able to do

ggplot(someModel) + geom_***

In this case, you can't specifiy both the model and the data, can you?

JiHO
---
http://maururu.net

JiHO

unread,
Jan 19, 2012, 1:06:12 PM1/19/12
to Hadley Wickham, Winston Chang, ggplot2
On Thu, Jan 19, 2012 at 14:15, Hadley Wickham <had...@rice.edu> wrote:
>> Perhaps one way to do this is to have a function that takes a ggplot object
>> and returns a list of models and fortify output. Something like this:
>>
>> p <- ggplot(dat, aes(x=x, y=y)) + geom_point() + stat_smooth(method=lm) +
>> stat_smooth()
>>
>> extract_stat(p)
>> # Return a list with lm object, and fortify output
>>
>> It could also do the same with loess or glm fits. I think this could also be
>> extended to other stats, like stat_summary and stat_bin.
>
> I'd rather make it easier to do the opposite:
>
> mod <- gam(y ~ s(x) ,data = dat)
> p  <-  ggplot(dat, aes(x=x, y=y)) + geom_point() + geom_smooth(model = mod)
>
> I generally think it's a bad idea to rely on a graphics package to do
> your statistics for you.

I agree. And to continue on the teacher's point of view, I always
teach my students to compute the model manually and extract necessary
data from it (fitted values, residuals etc.) to make the plots.
Otherwise, they don't understand how the display of the model and the
actual modeling process are connected.

I think that ggplot can help with this data extraction part (cf
fortify) but shouldn't try to do the whole thing. A plot of the
results of the model is not enough anyway.

As for the syntax, what about

mod <- gam(y ~ s(x) ,data = dat)

ggplot(mod) + geom_point(aes(x=x, y=y)) + geom_line(aes(x=x, y=.fitted.values.))

with fortify handling the data extraction from mod behind the scenes.
This format would allow fine control and to produce any plot
imaginable. And this

autoplot(mod)
autoplot(mod, type="resid")

to produce automatically some commonly used plots (fit, residuals vs
fitted, qqplot of residuals etc.). This is the direction I took based
on your advice for factorial analysis plots. The
geom_smooth(model=mod) strikes me as kind of in-between: neither fully
automated nor giving complete control and I am not sure how it would
scale/generalize.

JiHO
---
http://maururu.net

Hadley Wickham

unread,
Jan 20, 2012, 8:46:18 AM1/20/12
to JiHO, Winston Chang, ggplot2
> mod <- gam(y ~ s(x) ,data = dat)
> ggplot(mod) + geom_point(aes(x=x, y=y)) + geom_line(aes(x=x, y=.fitted.values.))
>
> with fortify handling the data extraction from mod behind the scenes.

Doesn't that work already? (modulo a missing fortify method for gam objects)

> This format would allow fine control and to produce any plot
> imaginable. And this
>
> autoplot(mod)
> autoplot(mod, type="resid")
>
> to produce automatically some commonly used plots (fit, residuals vs
> fitted, qqplot of residuals etc.). This is the direction I took based
> on your advice for factorial analysis plots. The
> geom_smooth(model=mod) strikes me as kind of in-between: neither fully
> automated nor giving complete control and I am not sure how it would
> scale/generalize.

The devel version has an autoplot generic for just this reason.

zekai...@netscape.net

unread,
Jan 19, 2012, 10:29:54 AM1/19/12
to had...@rice.edu, winsto...@gmail.com, jo.l...@gmail.com, ggp...@googlegroups.com
Hello all,

My vote is with Hadley, data preparation, statistics should be separate from presentation layer which is graphical display in this case. I make following assumption that the results on the graph are verified as with any data presentation.
mod <- gam(y ~ s(x) ,data = dat)

p <- ggplot(dat, aes(x=x, y=y)) + geom_point() + geom_smooth(model = mod)



Zekai 




-- 
You received this message because you are subscribed to the ggplot2 mailing
list.
Please provide a reproducible example: http://gist.github.com/270442

To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+u...@googlegroups.com
More options: http://groups.google.com/group/ggplot2

Steven Walker

unread,
Jan 19, 2012, 10:48:18 AM1/19/12
to ggp...@googlegroups.com

On 19/01/12 8:15 AM, Hadley Wickham wrote:
>> Perhaps one way to do this is to have a function that takes a ggplot object
>> and returns a list of models and fortify output. Something like this:
>>
>> p<- ggplot(dat, aes(x=x, y=y)) + geom_point() + stat_smooth(method=lm) +
>> stat_smooth()
>>
>> extract_stat(p)
>> # Return a list with lm object, and fortify output
>>
>> It could also do the same with loess or glm fits. I think this could also be
>> extended to other stats, like stat_summary and stat_bin.
> I'd rather make it easier to do the opposite:
>
> mod<- gam(y ~ s(x) ,data = dat)
> p<- ggplot(dat, aes(x=x, y=y)) + geom_point() + geom_smooth(model = mod)
>
> I generally think it's a bad idea to rely on a graphics package to do
> your statistics for you. If you're doing anything moderately
> complicated, you should do it outside of ggplot2 and then add the
> results on to a plot. I agree that the tools for doing this could be
> better, and they will get better as the statistical functionality of
> ggplot2 is extracted out into separate, stand-alone, well-documented
> functions.
>
> Hadley
>
I agree with Hadley here. I'm often wanting to pass fitted model
objects to geom_smooth -- for example, if I've already fitted a model,
why make ggplot fit it again? I'm not sure how difficult this would be
to implement, but a wish of mine is to be able to pass any fitted model
object of class with a predict method (e.g. lme):

mod <- lme(y ~ x, data = dat, random = ~ x | z)
p <- ggplot(dat) + facet_wrap( ~ z) + geom_point(aes(x=x, y=y)) +
geom_smooth(model = mod)

I can already foresee some potential implementation issues though --
e.g. lme predictions require the specification of the level in the
hierarchical structure of the model at which predictions are to be made;
and interval estimates of lme predictions are notoriously difficult to
make (so maybe force 'se = FALSE' in geom_smooth?). In any case, its
exciting to me that you're thinking along these lines Hadley. I'll give
it some more thought myself, in terms of specific implementation ideas.

Steve


--
----------------------------------------------------
Steven C Walker
Postdoctoral researcher
D�partement de Sciences Biologiques
Universit� de Montr�al
https://sites.google.com/site/stevencarlislewalker/
514-343-1233
----------------------------------------------------

Winston Chang

unread,
Jan 20, 2012, 11:55:59 AM1/20/12
to Steven Walker, ggp...@googlegroups.com

I'd rather make it easier to do the opposite:

mod<- gam(y ~ s(x) ,data = dat)
p<-  ggplot(dat, aes(x=x, y=y)) + geom_point() + geom_smooth(model = mod)

I generally think it's a bad idea to rely on a graphics package to do
your statistics for you.  If you're doing anything moderately
complicated, you should do it outside of ggplot2 and then add the
results on to a plot.  I agree that the tools for doing this could be
better, and they will get better as the statistical functionality of
ggplot2 is extracted out into separate, stand-alone, well-documented
functions.

Hadley

I agree with Hadley here.  I'm often wanting to pass fitted model objects to geom_smooth -- for example, if I've already fitted a model, why make ggplot fit it again?  I'm not sure how difficult this would be to implement, but a wish of mine is to be able to pass any fitted model object of class with a predict method (e.g. lme):

mod <- lme(y ~ x, data = dat, random = ~ x | z)
p <- ggplot(dat) + facet_wrap( ~ z) + geom_point(aes(x=x, y=y)) + geom_smooth(model = mod)

I can already foresee some potential implementation issues though -- e.g. lme predictions require the specification of the level in the hierarchical structure of the model at which predictions are to be made; and interval estimates of lme predictions are notoriously difficult to make (so maybe force 'se = FALSE' in geom_smooth?).  In any case, its exciting to me that you're thinking along these lines Hadley.  I'll give it some more thought myself, in terms of specific implementation ideas.


I have some code that takes a model object and generates a data frame that you then use with geom_line(). Here's roughly how it's used:
  pline <- predictdf( lm(y ~ x, data=dat) )
  ggplot(dat, aes(x=x, y=y)) + geom_point() + geom_line(data=pline)
 
It probably wouldn't be hard to adapt geom_smooth/stat_smooth to do this... Maybe I'll give it a shot over the weekend. (I'll probably post progress to the ggplot2-dev list instead of this one.)

One thing I'm not sure about is how to deal with aesthetic mappings. Say you have a factor mapped to colour and want to add model lines for each one. You could add each one separately and manually set the color, but this seems clunky. I guess maybe you put the model objects in a list somehow, but you'd also have to preserve information about the factor(s) levels that each model object corresponds to. I don't see a really good way to do this. Thoughts?

-Winston

JiHO

unread,
Jan 20, 2012, 4:57:48 PM1/20/12
to ggplot2
Forgot again to reply all :(

---------- Forwarded message ----------
From: JiHO <jo.l...@gmail.com>
Date: Fri, Jan 20, 2012 at 22:57
Subject: Re: Suggestion: enhancing the discovery of newly created variables

To: Hadley Wickham <had...@rice.edu>


On Fri, Jan 20, 2012 at 14:46, Hadley Wickham <had...@rice.edu> wrote:
>> mod <- gam(y ~ s(x) ,data = dat)
>> ggplot(mod) + geom_point(aes(x=x, y=y)) + geom_line(aes(x=x, y=.fitted.values.))
>>
>> with fortify handling the data extraction from mod behind the scenes.
>
> Doesn't that work already? (modulo a missing fortify method for gam objects)

It does. I was just pointing out that passing the model object to
geom_smooth is both:
- not giving as much power to the user as fortify, where the fortified
data can then be used with any geom
- not as automated as having autoplot() do a complete plot for you
I understand the conceptual appeal of it. I'm just not sure how well
it would generalize (what about multivariate models, lme as already
pointed out etc.). But maybe that's also because I more often use
techniques that don't easily sum up to a line +/- se.

>> This format would allow fine control and to produce any plot
>> imaginable. And this
>>
>> autoplot(mod)
>> autoplot(mod, type="resid")
>

> The devel version has an autoplot generic for just this reason.

Great. What's your take on this then: put the autoplot functions in
ggplot for now before moving them to a different package?

JiHO
---
http://maururu.net

JiHO

unread,
Jan 20, 2012, 5:11:28 PM1/20/12
to Winston Chang, Steven Walker, ggp...@googlegroups.com
On Fri, Jan 20, 2012 at 17:55, Winston Chang <winsto...@gmail.com> wrote:
> I have some code that takes a model object and generates a data frame that
> you then use with geom_line(). Here's roughly how it's used:
> pline <- predictdf( lm(y ~ x, data=dat) )
> ggplot(dat, aes(x=x, y=y)) + geom_point() + geom_line(data=pline)
>
> It probably wouldn't be hard to adapt geom_smooth/stat_smooth to do this...
> Maybe I'll give it a shot over the weekend. (I'll probably post progress to
> the ggplot2-dev list instead of this one.)

That seems like exactly what fortify is meant for. You probably should
wrap this into a method for fortify and follow its convention in
naming the variables (preprend a dot "." for extracted variables). See
fortify.lm() as an example.

> One thing I'm not sure about is how to deal with aesthetic mappings. Say you
> have a factor mapped to colour and want to add model lines for each one. You
> could add each one separately and manually set the color, but this seems
> clunky. I guess maybe you put the model objects in a list somehow, but you'd
> also have to preserve information about the factor(s) levels that each model
> object corresponds to. I don't see a really good way to do this. Thoughts?

I think that the "correct" way to do this would be to have the factor
in the model in a way that affects both the intercept and slope (i.e.
fit one separate model per factor level, but do this in one model
call) and then use this in ggplot. Example:

set.seed(123)
x <- runif(20, 0, 10)
a <- factor(rep(c("foo", "bar"), each=10))
y <- x*(0.5*as.numeric(a)) + (3+as.numeric(a)) + runif(20, 0, 3)
d <- data.frame(x, y, a)
library("ggplot2")
ggplot(d) + geom_point(aes(x=x, y=y, colour=a))

m <- lm(y ~ x + a + a:x, data=d)
df <- fortify(m)
ggplot(df, aes(x=x, y=y, colour=a)) + geom_point() +
geom_line(aes(y=.fitted))

Is that what you meant?

JiHO
---
http://maururu.net

Winston Chang

unread,
Jan 21, 2012, 5:55:26 PM1/21/12
to JiHO, Steven Walker, ggp...@googlegroups.com
On Fri, Jan 20, 2012 at 4:11 PM, JiHO <jo.l...@gmail.com> wrote:
On Fri, Jan 20, 2012 at 17:55, Winston Chang <winsto...@gmail.com> wrote:
> I have some code that takes a model object and generates a data frame that
> you then use with geom_line(). Here's roughly how it's used:
>   pline <- predictdf( lm(y ~ x, data=dat) )
>   ggplot(dat, aes(x=x, y=y)) + geom_point() + geom_line(data=pline)
>
> It probably wouldn't be hard to adapt geom_smooth/stat_smooth to do this...
> Maybe I'll give it a shot over the weekend. (I'll probably post progress to
> the ggplot2-dev list instead of this one.)

That seems like exactly what fortify is meant for. You probably should
wrap this into a method for fortify and follow its convention in
naming the variables (preprend a dot "." for extracted variables). See
fortify.lm() as an example.

The way I did it uses predict(), so it worked with any model object that has a predict() S3 method. These include lm, glm, loess, and so on. I think fortify, as currently written, needs to be customized for each type of model. However, it might be possible to have the default fortify method use predict().


> One thing I'm not sure about is how to deal with aesthetic mappings. Say you
> have a factor mapped to colour and want to add model lines for each one. You
> could add each one separately and manually set the color, but this seems
> clunky. I guess maybe you put the model objects in a list somehow, but you'd
> also have to preserve information about the factor(s) levels that each model
> object corresponds to. I don't see a really good way to do this. Thoughts?

I think that the "correct" way to do this would be to have the factor
in the model in a way that affects both the intercept and slope (i.e.
fit one separate model per factor level, but do this in one model
call) and then use this in ggplot. Example:

   set.seed(123)
   x <- runif(20, 0, 10)
   a <- factor(rep(c("foo", "bar"), each=10))
   y <- x*(0.5*as.numeric(a)) + (3+as.numeric(a)) + runif(20, 0, 3)
   d <- data.frame(x, y, a)
   library("ggplot2")
   ggplot(d) + geom_point(aes(x=x, y=y, colour=a))

   m <- lm(y ~ x + a + a:x, data=d)
   df <- fortify(m)
   ggplot(df, aes(x=x, y=y, colour=a)) + geom_point() +
geom_line(aes(y=.fitted))

Is that what you meant?

Yes, thanks -- this makes perfect sense.

-Winston 

JiHO

unread,
Jan 22, 2012, 2:26:56 AM1/22/12
to Winston Chang, Steven Walker, ggp...@googlegroups.com
On Sat, Jan 21, 2012 at 23:55, Winston Chang <winsto...@gmail.com> wrote:
>> That seems like exactly what fortify is meant for. You probably should
>> wrap this into a method for fortify and follow its convention in
>> naming the variables (preprend a dot "." for extracted variables). See
>> fortify.lm() as an example.
>
>
> The way I did it uses predict(), so it worked with any model object that has
> a predict() S3 method. These include lm, glm, loess, and so on. I think
> fortify, as currently written, needs to be customized for each type of
> model. However, it might be possible to have the default fortify method use
> predict().

Indeed fortify requires a method for each type of model, but that's
probably because it extract more than just the fit. fortify.lm uses
predict though:

fortify.lm
function (model, data = model$model, ...)
{
infl <- influence(model, do.coef = FALSE)
data$.hat <- infl$hat
data$.sigma <- infl$sigma
data$.cooksd <- cooks.distance(model, infl)
data$.fitted <- predict(model)
data$.resid <- resid(model)
data$.stdresid <- rstandard(model, infl)
data
}

Maybe the code above is valid for more than just lm. In which case
maybe it should be called something like fortify.regression and
fortify.lm, fortify.glm, fortify.loess would just be aliases. From the
documentation, lm and glm would be OK but loess probably not.

The default fortify method is currently just a fallback which gives an
error message:

function (model, data, ...)
{
stop("ggplot2 doesn't know how to deal with data of class ",
class(model), call. = FALSE)
}

so I don't think it would be appropriate to have default actions
there. In addition, fortify can be applied to stuff other than the
result of a regression model (maps etc.) and predict does not make
sense in these other contexts.

JiHO
---
http://maururu.net

Reply all
Reply to author
Forward
0 new messages