(1) SS calculation for fit.mult.impute and (2) df's for rcs

Nimish Thakore

unread,

Oct 10, 2015, 5:03:29 PM10/10/15

to regmod

Dear Professor Harrell,

Thank you for directing me to this group

Two questions are

1. I am fitting ols models on multiple imputed datasets created with aregImpute, using fit.mult.impute. Comparing fits and checking non-linearity (with pol/rcs) using the anova function applied to the pooled fit is indeed very convenient. It is easy to understand the estimation of single coefficients and test the significance of contrasts of multiple coefficients (Cbeta) using vcov obtained from Rubin's rule. However, I do not understand how anova computes sums of squares from multiple imputed datasets. Does it use the means of different Y vectors and X matrices over all imputations to get Y-Yhat? Also for model comparisons is AIC (or logLik) only from the last imputation or with some justification "averaged" for that model over all imputed datasets?

2. Restricted cubic spline df's: why are only 2 df expended in rcs(x,3) on the anova table when there should be 3 for 3 knots. 2*2 for linear segments at each end + 4*(k-1) for cubic segments between k knots - 3*k restrictions at each knot = k. Is it because a constant term is absorbed by intercept?

This discrepancy came up with a model that had the following formula: y~x1+x2+rcs(x3,3)+rcs(x4,3). anova table showed 1,1,2,2 dfs for x1:x4 respectively. Above logic (of absorbing 1 constant term of rcs in intercept) should mean 1+1+1+2+2 = 7 parameters, but the difference between AIC and -2*logLik was 8*2 = 16....

I do apologize for asking these naive questions.

Sincerely

Nimish Thakore

Frank Harrell

unread,

Oct 11, 2015, 9:24:20 AM10/11/15

to regmod

These are good questions. To make anova, summary, and contrast methods work the approach has to be general. Upon multiple imputation, fit.mult.impute stores the new Rubin variance-covariance matrix, and downstream methods such as anova use it, ignoring sums of squares. anova for OLS fits run through fit.mult.impute uses the classical type of Wald statistic and tests against chi-square distributions and not F distributions, unlike when you do anova(ols()).

For the following, ignore the intercept because that goes for the whole model. A regular cubic spline function with k knots has a linear term plus a regular quadratic term plus a regular cubic term plus k other cube term coefficients at knots (one nonlinear term per knot) for a total of k+3 parameters. For restricted cubic splines (natural splines) the linear tail restriction removes the ordinary square and cube terms on the left and 2 nonlinear terms on the right for a total of k-1 parameters. So what remains is a linear term and k-2 nonlinear terms representing differences in cubes.

Nimish Thakore

unread,

Oct 11, 2015, 6:10:27 PM10/11/15

to regmod

Thanks a lot, Dr. Harrell.

1. So we can safely ignore reported SS in anova table obtained by anova(fit.mult.impute ()). Basically the p values are from Wald statistics using Rubin pooled parameter cov matrix and dfs.

2. Just realized the df discrepancy example I mentioned had nothing to do with rcs at all. logLik (and AIC function therefrom) shows an extra df when used with a lm() object because of additional estimation of sigma^2 by classical (ML) theory, while anova (OLS method) ignores that parameter estimation. Sorry.

Frank Harrell

unread,

Oct 12, 2015, 2:39:10 PM10/12/15

to regmod

anova(fit.mult.impute(...)) does not report sum of squares, for the reason that once you use multiple imputation the analysis even with OLS is no longer "classical". The p-values are as you described.

Nimish Thakore

unread,

Oct 15, 2015, 7:45:44 PM10/15/15

to regmod

Dear Dr. Harrell,

Hate to be a pest, and this is maybe not the place to write

but anova(fit.mult.impute(...)) does report sum of square - see made up example below.....

maybe I am making a terrible coding error.

Sorry

Nimish

> set.seed(2)

> x1<-runif(20)*3

> x2<-runif(20)*5

> y<-x1^2+2*x2+rnorm(20)

> x1m[runif(20)<.2]<-NA

Error in x1m[runif(20) < 0.2] <- NA : object 'x1m' not found

> x1m<-x1

> x1m[runif(20)<.2]<-NA

> x2m<-x2

> x2m[runif(20)<.2]<-NA

> DATA<-data.frame(y,x1m,x2m)

> DATA

y x1m x2m

1 9.017440 0.5546468 3.3094938

2 7.115533 2.1071221 1.9377477

3 12.916858 NA 4.1844459

4 3.713839 0.5041558 0.7525072

5 11.495155 NA 1.7363612

6 10.447331 2.8304249 2.4438662

7 2.119844 0.3874769 0.7462343

8 9.225800 2.5003464 1.7853130

9 12.390016 1.4040555 NA

10 4.335696 1.6499512 NA

11 3.592121 1.6580222 NA

12 2.479019 0.7166843 0.8232112

13 14.383510 2.2815399 4.0509607

14 8.698716 0.5424603 4.3443052

15 5.844425 1.2158465 2.5714088

16 12.233207 NA 3.1359814

17 15.298496 2.9291955 4.2221450

18 2.405096 NA NA

19 7.893892 NA 3.3361282

20 1.308782 0.2249383 0.7523488

> DATA.mi<-aregImpute(~y+x1m+x2m,DATA, n.impute=5,nk=0)

Iteration 8

> DATA.mi

Multiple Imputation using Bootstrap and PMM

aregImpute(formula = ~y + x1m + x2m, data = DATA, n.impute = 5,

nk = 0)

n: 20 p: 3 Imputations: 5 nk: 0

Number of NAs:

y x1m x2m

0 5 4

type d.f.

y l 1

x1m l 1

x2m l 1

Transformation of Target Variables Forced to be Linear

R-squares for Predicting Non-Missing Values for Each Variable

Using Last Imputations of Predictors

x1m x2m

0.876 0.482

> DATA.mi$imputed

$y

NULL

$x1m

[,1] [,2] [,3] [,4] [,5]

3 0.5546468 0.5546468 0.5546468 0.5546468 2.8304249

5 2.9291955 0.5546468 0.5546468 0.5546468 0.5546468

16 2.8304249 2.2815399 2.2815399 2.9291955 2.9291955

18 0.7166843 0.3874769 0.2249383 0.7166843 0.3874769

19 0.7166843 0.7166843 0.2249383 0.7166843 0.7166843

$x2m

[,1] [,2] [,3] [,4] [,5]

9 4.2221450 3.3094938 4.1844459 4.1844459 1.7363612

10 0.7523488 0.7523488 0.8232112 0.8232112 0.7525072

11 0.7523488 3.3094938 0.7523488 0.7523488 0.7462343

18 0.7462343 0.7462343 0.7525072 0.7523488 0.8232112

> f<-fit.mult.impute(y~rcs(x1,3)+rcs(x2,3),ols, DATA.mi,data=DATA)

Variance Inflation Factors Due to Imputation:

Intercept x1 x1' x2 x2'

1 1 1 1 1

Rate of Missing Information:

Intercept x1 x1' x2 x2'

0 0 0 0 0

d.f. for t-distribution for Tests of Single Coefficients:

Intercept x1 x1' x2 x2'

Inf Inf Inf Inf Inf

The following fit components were averaged over the 5 model fits:

fitted.values stats linear.predictors

> f

Linear Regression Model

fit.mult.impute(formula = y ~ rcs(x1, 3) + rcs(x2, 3), fitter = ols,

xtrans = DATA.mi, data = DATA)

Model Likelihood Discrimination

Ratio Test Indexes

Obs 20 LR chi2 63.21 R2 0.958

sigma 1.0282 d.f. 4 R2 adj 0.946

d.f. 15 Pr(> chi2) 0.0000 g 5.085

Residuals

Min 1Q Median 3Q Max

-1.1229 -0.6170 -0.1657 0.4354 1.9382

Coef S.E. t Pr(>|t|)

Intercept 0.5935 0.8554 0.69 0.4984

x1 1.4888 0.7384 2.02 0.0620

x1' 1.5526 0.8784 1.77 0.0975

x2 1.2824 0.5104 2.51 0.0239

x2' 1.0822 0.6734 1.61 0.1289

> anova(f)

Analysis of Variance Response: y

Factor d.f. Partial SS MS F P

x1 2 96.789733 48.394867 45.78 <.0001

Nonlinear 1 3.302695 3.302695 3.12 0.0975

x2 2 165.089955 82.544978 78.08 <.0001

Nonlinear 1 2.729978 2.729978 2.58 0.1289

TOTAL NONLINEAR 2 4.992265 2.496133 2.36 0.1284

REGRESSION 4 358.084106 89.521026 84.68 <.0001

ERROR 15 15.857043 1.057136

Frank Harrell

unread,

Oct 16, 2015, 9:09:05 AM10/16/15

to regmod

You are right - my apologies. What anova is doing with ols and fit.mult.impute is getting an estimate of sigma from each completed dataset, then using the average sigma to compute the sum of squares using the Rubin rule variances from multiple imputation. The functions should have been averaging the squared sigmas then taking the square root. I'll fix that in the future. I would just have anova print the correct chi-square statistics using the "test" option. The following code helps to understand a bit of this.

require(rms)
set.seed(1)
x1 <- rnorm(100)
x2 <- rnorm(100)
y <- x1 + x2 + rnorm(100)
x1[1:10] <- NA
a <- aregImpute(~ y + x1 + x2)
f <- fit.mult.impute(y ~ x1 + x2, ols, a, data=data.frame(x1,x2,y),
                     n.impute=3, fit.reps=TRUE)
## Show how fit.mult.impute estimates sigma^2
s <- 0
for(i in 1 : 3) s <- s + f$fits[[i]]$stats['Sigma']
c(s / 3, f$stats['Sigma'])

anova(f, test='Chisq')
## Make sure the chi-squares and sums of squares are not from one of the models
for(i in 1 : 3) print(anova(f$fits[[i]], test='Chisq'))

Nimish Thakore

unread,

Oct 16, 2015, 1:05:58 PM10/16/15

to regmod

Get it, thanks a lot!
Won't bother you more
Nimish

Reply all

Reply to author

Forward