I'm trying to fit linear models to subsets of a dataframe and collect
the fit coefficients and statistics in another dataframe for further
processing. I am having two basic problems. One is how to handle fits
that either fail completely for lack of data or fit only a single
coefficient and thus have less output data than the typical fit. The
second problem is getting the output into some for where the
coefficients for each subset are properly labeled. I've made up a toy
example to illustrate the issues:
library("plyr")
# Define the data.
r1 <- c("A", "L", 1, 2)
r2 <- c("A", "L", 2, 5)
r3 <- c("A", "L", 1, 3)
r4 <- c("A", "R", 0, 2)
r5 <- c("A", "R", 2, 6)
r6 <- c("A", "R", 1, 4)
r7 <- c("B", "L", 5, 5)
r8 <- c("B", "L", 4, 7)
r9 <- c("B", "L", 3, 8)
r10 <- c("B", "R", 5, 6)
r11 <- c("B", "R", 4, 8)
r12 <- c("B", "R", 3, 9)
r13 <- c("C", "R", 2, NaN)
r14 <- c("C", "L", 2, 8)
# foo1. Good data.
foo <- as.data.frame(rbind(r1,r2,r3,r4,r5,r6,r7,r8,r9,r10,r11,r12))
# foo2. Add in a subset that will cause the fit to fail completely.
foo <-
as.data.frame(rbind(r1,r2,r3,r4,r5,r6,r7,r8,r9,r10,r11,r12,r13))
# foo3. Add in a subset that will cause only a single coefficient to
be fit.
foo <-
as.data.frame(rbind(r1,r2,r3,r4,r5,r6,r7,r8,r9,r10,r11,r12,r13,r14))
# Define the fit.
fit <- function(df) { lm( y ~ 1 + x, data = df) }
# Fit the models.
models <- dlply(foo, .(cond1,cond2), failwith(NA, fit))
[[Up to here, everything works as expected regardless of what data is
included in "foo".]]
# Functions to extract parameters.
params1 <- function(x) {coef(x)}
params2 <- function(x) {as.data.frame(coef(x))}
params3 <- function(x) {summary(x)$coefficients}
params4 <- function(x) {as.data.frame(summary(x)$coefficients)}
# Extract the coefficients to a data frame.
fitDF <-ldply( models, failwith(NA, params1, quiet=FALSE) )
Results:
Using "foo1" and params1, everything looks great:
> fitDF
cond1 cond2 (Intercept) x
1 A L 1.025580e-15 2.5
2 A R 2.000000e+00 2.0
3 B L 1.266667e+01 -1.5
4 B R 1.366667e+01 -1.5
Using "foo2" and params1 fails:
> source("test2.R")
Error in lm.fit(x, y, offset = offset, singular.ok =
singular.ok, ...) :
0 (non-NA) cases
Error in object$coefficients : $ operator is invalid for atomic
vectors
Error in list_to_dataframe(res, attr(.data, "split_labels")) :
Results do not have equal lengths
This line fixes it however, although the i.d is now less useful:
models <- models[!is.na(models)]
> fitDF
.id (Intercept) x
1 A.L 1.025580e-15 2.5
2 A.R 2.000000e+00 2.0
3 B.L 1.266667e+01 -1.5
4 B.R 1.366667e+01 -1.5
switching to params2 makes the output even worse as information about
which coefficient is which is lost:
> fitDF
.id coef(x)
1 A.L 1.025580e-15
2 A.L 2.500000e+00
3 A.R 2.000000e+00
4 A.R 2.000000e+00
5 B.L 1.266667e+01
6 B.L -1.500000e+00
7 B.R 1.366667e+01
8 B.R -1.500000e+00
>
What I would really like is to get the coefficients and errors, etc:
[With foo1 and params3 and no "models <- models[!is.na(models)]"]:
> fitDF
cond1 cond2 V1 V2 V3 V4 V5
1 A L 1.025580e-15 2.5 1.224745e+00 8.660254e-01 8.373826e-16
2 A R 2.000000e+00 2.0 3.040471e-16 2.355139e-16 6.577928e+15
3 B L 1.266667e+01 -1.5 1.178511e+00 2.886751e-01 1.074802e+01
4 B R 1.366667e+01 -1.5 1.178511e+00 2.886751e-01 1.159655e+01
V6 V7 V8
1 2.886751e+00 1.000000e+00 2.122956e-01
2 8.492069e+15 9.678120e-17 7.496639e-17
3 -5.196152e+00 5.906131e-02 1.210377e-01
4 -5.196152e+00 5.476187e-02 1.210377e-01
Adding the models <- models[!is.na(models)]" call results in the
following which is less desirable (ids merged):
> fitDF
.id V1 V2 V3 V4
V5 V6
1 A.L 1.025580e-15 2.5 1.224745e+00 8.660254e-01 8.373826e-16
2.886751e+00
2 A.R 2.000000e+00 2.0 3.040471e-16 2.355139e-16 6.577928e+15
8.492069e+15
3 B.L 1.266667e+01 -1.5 1.178511e+00 2.886751e-01 1.074802e+01
-5.196152e+00
4 B.R 1.366667e+01 -1.5 1.178511e+00 2.886751e-01 1.159655e+01
-5.196152e+00
V7 V8
1 1.000000e+00 2.122956e-01
2 9.678120e-17 7.496639e-17
3 5.906131e-02 1.210377e-01
4 5.476187e-02 1.210377e-01
Changing to params4 is a step in the wrong direction:
> fitDF
.id Estimate Std. Error t value Pr(>|t|)
1 A.L 1.025580e-15 1.224745e+00 8.373826e-16 1.000000e+00
2 A.L 2.500000e+00 8.660254e-01 2.886751e+00 2.122956e-01
3 A.R 2.000000e+00 3.040471e-16 6.577928e+15 9.678120e-17
4 A.R 2.000000e+00 2.355139e-16 8.492069e+15 7.496639e-17
5 B.L 1.266667e+01 1.178511e+00 1.074802e+01 5.906131e-02
6 B.L -1.500000e+00 2.886751e-01 -5.196152e+00 1.210377e-01
7 B.R 1.366667e+01 1.178511e+00 1.159655e+01 5.476187e-02
8 B.R -1.500000e+00 2.886751e-01 -5.196152e+00 1.210377e-01
However, running with params3 and foo3 (which includes data that will
give a fit with only a single coefficient) fails:
> source("test2.R")
Error in lm.fit(x, y, offset = offset, singular.ok =
singular.ok, ...) :
0 (non-NA) cases
Error in list_to_dataframe(res, attr(.data, "split_labels")) :
Results do not have equal lengths
Switching to params4 fixes the error but the coefficients have now
lost their labels:
> fitDF
.id Estimate Std. Error t value Pr(>|t|)
1 A.L 1.025580e-15 1.224745e+00 8.373826e-16 1.000000e+00
2 A.L 2.500000e+00 8.660254e-01 2.886751e+00 2.122956e-01
3 A.R 2.000000e+00 3.040471e-16 6.577928e+15 9.678120e-17
4 A.R 2.000000e+00 2.355139e-16 8.492069e+15 7.496639e-17
5 B.L 1.266667e+01 1.178511e+00 1.074802e+01 5.906131e-02
6 B.L -1.500000e+00 2.886751e-01 -5.196152e+00 1.210377e-01
7 B.R 1.366667e+01 1.178511e+00 1.159655e+01 5.476187e-02
8 B.R -1.500000e+00 2.886751e-01 -5.196152e+00 1.210377e-01
9 C.L 8.000000e+00 NaN NaN NaN
So I'm sort of going in circles.
I would be happy to eliminate the fits with less that two coefficients
from the list "models" but I haven't had much success coming up with a
way to eliminate list elements in a list of lists based on the value
of elements in a sub-list.
Any help on how to clean up "models" ahead of extracting the fit
parameters or of finding a robust way to get the output formatted
better and deal with the special cases would be appreciated.
Thanks!
-A
models <- dlply(foo, .(cond1,cond2), failwith(NULL, fit))
and then do things like
ok_models <- compact(models)
etc.
Hadley
> --
> You received this message because you are subscribed to the Google Groups "manipulatr" group.
> To post to this group, send email to manip...@googlegroups.com.
> To unsubscribe from this group, send email to manipulatr+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/manipulatr?hl=en.
>
--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/