(1) SS calculation for fit.mult.impute and (2) df's for rcs

59 views
Skip to first unread message

Nimish Thakore

unread,
Oct 10, 2015, 5:03:29 PM10/10/15
to regmod
Dear Professor Harrell,
 
Thank you for directing me to this group
 
Two questions are
 
1. I am fitting ols models on multiple imputed datasets created with aregImpute, using fit.mult.impute. Comparing fits and checking non-linearity (with pol/rcs) using the anova function applied to the pooled fit is indeed very convenient. It is easy to understand the estimation of single coefficients and test the significance of contrasts of multiple coefficients (Cbeta) using vcov obtained from Rubin's rule. However, I do not understand how anova computes sums of squares from multiple imputed datasets. Does it use the means of different Y vectors and X matrices over all imputations to get Y-Yhat? Also for model comparisons is AIC (or logLik) only from the last imputation or with some justification "averaged" for that model over all imputed datasets?

2. Restricted cubic spline df's: why are only 2 df expended in rcs(x,3) on the anova table when there should be 3 for 3 knots. 2*2 for linear segments at each end + 4*(k-1) for cubic segments between k knots - 3*k restrictions at each knot = k. Is it because a constant term is absorbed by intercept?
 
This discrepancy came up with a model that had the following formula: y~x1+x2+rcs(x3,3)+rcs(x4,3). anova table showed 1,1,2,2 dfs for x1:x4 respectively. Above logic (of absorbing 1 constant term of rcs in intercept) should mean 1+1+1+2+2 = 7 parameters, but the difference between AIC and -2*logLik was 8*2 = 16....
 
I do apologize for asking these naive questions.
 
Sincerely
Nimish Thakore

Frank Harrell

unread,
Oct 11, 2015, 9:24:20 AM10/11/15
to regmod
These are good questions.  To make anova, summary, and contrast methods work the approach has to be general.  Upon multiple imputation, fit.mult.impute stores the new Rubin variance-covariance matrix, and downstream methods such as anova use it, ignoring sums of squares.  anova for OLS fits run through fit.mult.impute uses the classical type of Wald statistic and tests against chi-square distributions and not F distributions, unlike when you do anova(ols()).

For the following, ignore the intercept because that goes for the whole model.  A regular cubic spline function with k knots has a linear term plus a regular quadratic term plus a regular cubic term plus k other cube term coefficients at knots (one nonlinear term per knot) for a total of k+3 parameters.  For restricted cubic splines (natural splines) the linear tail restriction removes the ordinary square and cube terms on the left and 2 nonlinear terms on the right for a total of k-1 parameters.  So what remains is a linear term and k-2 nonlinear terms representing differences in cubes.

Nimish Thakore

unread,
Oct 11, 2015, 6:10:27 PM10/11/15
to regmod
Thanks a lot, Dr. Harrell. 

1. So we can safely ignore reported SS in anova table obtained by anova(fit.mult.impute ()). Basically the p values are from Wald statistics using Rubin pooled parameter cov matrix and dfs.

2. Just realized the df discrepancy example I mentioned had nothing to do with rcs at all. logLik (and AIC function therefrom) shows an extra df when used with a lm() object because of additional estimation of sigma^2 by classical (ML) theory, while anova (OLS method) ignores that parameter estimation. Sorry.

Frank Harrell

unread,
Oct 12, 2015, 2:39:10 PM10/12/15
to regmod
anova(fit.mult.impute(...)) does not report sum of squares, for the reason that once you use multiple imputation the analysis even with OLS is no longer "classical".  The p-values are as you described.

Nimish Thakore

unread,
Oct 15, 2015, 7:45:44 PM10/15/15
to regmod
Dear Dr. Harrell,
Hate to be a pest, and this is maybe not the place to write
but anova(fit.mult.impute(...)) does report sum of square - see made up example below.....
maybe I am making a terrible coding error.
Sorry
Nimish


> set.seed(2)
> x1<-runif(20)*3
> x2<-runif(20)*5
> y<-x1^2+2*x2+rnorm(20)


> x1m[runif(20)<.2]<-NA
Error in x1m[runif(20) < 0.2] <- NA : object 'x1m' not found

> x1m<-x1
> x1m[runif(20)<.2]<-NA
> x2m<-x2
> x2m[runif(20)<.2]<-NA

> DATA<-data.frame(y,x1m,x2m)
> DATA
           y       x1m       x2m
1   9.017440 0.5546468 3.3094938
2   7.115533 2.1071221 1.9377477
3  12.916858        NA 4.1844459
4   3.713839 0.5041558 0.7525072
5  11.495155        NA 1.7363612
6  10.447331 2.8304249 2.4438662
7   2.119844 0.3874769 0.7462343
8   9.225800 2.5003464 1.7853130
9  12.390016 1.4040555        NA
10  4.335696 1.6499512        NA
11  3.592121 1.6580222        NA
12  2.479019 0.7166843 0.8232112
13 14.383510 2.2815399 4.0509607
14  8.698716 0.5424603 4.3443052
15  5.844425 1.2158465 2.5714088
16 12.233207        NA 3.1359814
17 15.298496 2.9291955 4.2221450
18  2.405096        NA        NA
19  7.893892        NA 3.3361282
20  1.308782 0.2249383 0.7523488


> DATA.mi<-aregImpute(~y+x1m+x2m,DATA, n.impute=5,nk=0)
Iteration 8 
> DATA.mi

Multiple Imputation using Bootstrap and PMM

aregImpute(formula = ~y + x1m + x2m, data = DATA, n.impute = 5, 
    nk = 0)

n: 20 p: 3 Imputations: 5   nk: 0 

Number of NAs:
  y x1m x2m 
  0   5   4 

    type d.f.
y      l    1
x1m    l    1
x2m    l    1

Transformation of Target Variables Forced to be Linear

R-squares for Predicting Non-Missing Values for Each Variable
Using Last Imputations of Predictors
  x1m   x2m 
0.876 0.482 

> DATA.mi$imputed
$y
NULL

$x1m
        [,1]      [,2]      [,3]      [,4]      [,5]
3  0.5546468 0.5546468 0.5546468 0.5546468 2.8304249
5  2.9291955 0.5546468 0.5546468 0.5546468 0.5546468
16 2.8304249 2.2815399 2.2815399 2.9291955 2.9291955
18 0.7166843 0.3874769 0.2249383 0.7166843 0.3874769
19 0.7166843 0.7166843 0.2249383 0.7166843 0.7166843

$x2m
        [,1]      [,2]      [,3]      [,4]      [,5]
9  4.2221450 3.3094938 4.1844459 4.1844459 1.7363612
10 0.7523488 0.7523488 0.8232112 0.8232112 0.7525072
11 0.7523488 3.3094938 0.7523488 0.7523488 0.7462343
18 0.7462343 0.7462343 0.7525072 0.7523488 0.8232112


> f<-fit.mult.impute(y~rcs(x1,3)+rcs(x2,3),ols, DATA.mi,data=DATA)

Variance Inflation Factors Due to Imputation:

Intercept        x1       x1'        x2       x2' 
        1         1         1         1         1 

Rate of Missing Information:

Intercept        x1       x1'        x2       x2' 
        0         0         0         0         0 

d.f. for t-distribution for Tests of Single Coefficients:

Intercept        x1       x1'        x2       x2' 
      Inf       Inf       Inf       Inf       Inf 

The following fit components were averaged over the 5 model fits:

  fitted.values stats linear.predictors 

> f

Linear Regression Model

fit.mult.impute(formula = y ~ rcs(x1, 3) + rcs(x2, 3), fitter = ols, 
    xtrans = DATA.mi, data = DATA)
                Model Likelihood     Discrimination    
                   Ratio Test           Indexes        
Obs       20    LR chi2     63.21    R2       0.958    
sigma 1.0282    d.f.            4    R2 adj   0.946    
d.f.      15    Pr(> chi2) 0.0000    g        5.085    

Residuals

    Min      1Q  Median      3Q     Max 
-1.1229 -0.6170 -0.1657  0.4354  1.9382 

          Coef   S.E.   t    Pr(>|t|)
Intercept 0.5935 0.8554 0.69 0.4984  
x1        1.4888 0.7384 2.02 0.0620  
x1'       1.5526 0.8784 1.77 0.0975  
x2        1.2824 0.5104 2.51 0.0239  
x2'       1.0822 0.6734 1.61 0.1289  



> anova(f)
                Analysis of Variance          Response: y 

 Factor          d.f. Partial SS MS        F     P     
 x1               2    96.789733 48.394867 45.78 <.0001
  Nonlinear       1     3.302695  3.302695  3.12 0.0975
 x2               2   165.089955 82.544978 78.08 <.0001
  Nonlinear       1     2.729978  2.729978  2.58 0.1289
 TOTAL NONLINEAR  2     4.992265  2.496133  2.36 0.1284
 REGRESSION       4   358.084106 89.521026 84.68 <.0001
 ERROR           15    15.857043  1.057136             


Frank Harrell

unread,
Oct 16, 2015, 9:09:05 AM10/16/15
to regmod
You are right - my apologies.  What anova is doing with ols and fit.mult.impute is getting an estimate of sigma from each completed dataset, then using the average sigma to compute the sum of squares using the Rubin rule variances from multiple imputation.   The functions should have been averaging the squared sigmas then taking the square root. I'll fix that in the future.  I would just have anova print the correct chi-square statistics using the "test" option.  The following code helps to understand a bit of this.

require(rms)
set.seed(1)
x1 <- rnorm(100)
x2 <- rnorm(100)
y <- x1 + x2 + rnorm(100)
x1[1:10] <- NA
a <- aregImpute(~ y + x1 + x2)
f <- fit.mult.impute(y ~ x1 + x2, ols, a, data=data.frame(x1,x2,y),
                     n.impute=3, fit.reps=TRUE)
## Show how fit.mult.impute estimates sigma^2
s <- 0
for(i in 1 : 3) s <- s + f$fits[[i]]$stats['Sigma']
c(s / 3, f$stats['Sigma'])

anova(f, test='Chisq')
## Make sure the chi-squares and sums of squares are not from one of the models
for(i in 1 : 3) print(anova(f$fits[[i]], test='Chisq'))




Nimish Thakore

unread,
Oct 16, 2015, 1:05:58 PM10/16/15
to regmod
Get it, thanks a lot!
Won't bother you more
Nimish
Reply all
Reply to author
Forward
0 new messages