Aggregate cfa.mi results with anova() - fit.indices

431 views
Skip to first unread message

Ulrich Schroeders

unread,
Jun 19, 2017, 3:29:14 AM6/19/17
to lavaan
Dear lavaan/semTools-supporters-maintainers,

actually, I do not know if it's a semTools or a lavaan problem. I estimated a model with WLSMV-estimator and 15 imputations:
fit.bg.mi <- cfa.mi(mod.bg.mi, data=dat.list, chi="all", 
                    ordered = items.gf, estimator = "WLSMV", 
                    missing="pairwise")
summary(fit.bg.mi, standardized=TRUE)
lavaan::anova(fit.bg.mi, test = "D2", indices=TRUE)

All the aggregating of parameter estimates and their SEs looks fine.
But I do get some weird results for CFI, TLI, etc - even though fit indices are fine for single imputation.

Any suggestions?

Thanks in advance for your help, kind regards
Ulrich

Ulrich Schroeders

unread,
Jun 19, 2017, 3:31:30 AM6/19/17
to lavaan
Sorry, I should have added the versions. So, here they are:
"This is lavaan 0.6-1.1137
This is semTools 0.4-15.910".

Kind regards
Ulrich

Terrence Jorgensen

unread,
Jun 19, 2017, 7:23:18 AM6/19/17
to lavaan
lavaan::anova(fit.bg.mi, test = "D2", indices=TRUE)

fit.bg.mi is not an object of class?lavaan object, but of class?lavaan.mi, so lavaan::anova() is not going to point to the correct anova() method to use on it.  Since semTools is loaded, you should not need to point to the package.  The anova() function will find the appropriate method for the class of the object, when it exists.

If anova(fit.bg.mi, test = "D2", indices = TRUE) still returns fit indices that seem strange, I don't know how to evaluate whether they are actually strange.  The only thing to do would be to calculate CFI yourself using the D2 method to pool chi-squared statistics for both the hypothesized and null models.  If you also check the CFI using scaled/shifted statistics, note that for the cfi.scaled version, semTools currently scales/shifts the pooled naïve chi-squared statistic using the average scale/shift parameters across imputations.  If you do end up checking these for your data and can show that your calculations do not reproduce what semTools claims to do, please share your script and some data with me in a private email so that I can track down the problem.

Terrence D. Jorgensen
Postdoctoral Researcher, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam

Ulrich Schroeders

unread,
Jun 19, 2017, 7:58:48 AM6/19/17
to lavaan
Dear Terrence,

thank you for the quick reply. Please forget about the "lavaan::"; the results are the same for the anova() function.
Why do I think the results are strange? First, if I ran the analyses for one data set only, results are far way better, that is, cfi.scaled = 0.952 for the first and second data set.
Second, if I run my analyses in Mplus, I got the somehow correctly pooled CFI's. Maybe not the best argument...
I also checked the chisq.scaling.factor and baseline.chisq.scaling.factor, respectively. They are identical across imputed data sets. Thus, the naive scaling factor shouldn't be a problem.

Any other suggestions or time for sharing? :-)

Kind regards
Ulrich

Terrence Jorgensen

unread,
Jun 21, 2017, 11:17:11 AM6/21/17
to lavaan
Second, if I run my analyses in Mplus, I got the somehow correctly pooled CFI's. Maybe not the best argument...

Mplus does not pool CFI, "correctly" or otherwise.  It reports an average and distribution across imputations.  That simply describes variability across imputations, it does not pool the values with that variability in mind.  semTools, on the other hand, tries to do something logically consistent, although there has not been a study showing whether it "works" (I think we would need objective criterion to evaluate that, which may be hard to define for fit indices).  Specifcally, semTools calculates CFI based on the (single) pooled chi-squared statistic -- the same one you see for the test of model fit. Compared to D1 and D3, D2 is a notoriously poor pooling method (sometimes resulting in higher or lower Type I errors than nominal alpha levels, depending on the data), so that might explain why CFI looks weird when calculated from D2 statistics.  Unfortunately, D3 is only available with continuous data, and D1 is only available if you can specify constraints on an unrestricted model's parameters to form the restricted model.  So D1 is not a good approach if you want to evaluate overall model fit.  Of course, fit indices aren't necessarily helpful in that arena either, depending who you talk to.

Instead of relying on CFI, you could evaluate local misfit by using the resid(fit, type = "cor") method, to see which thresholds and polychoric correlations are most poorly reproduced by the model in a standardized metric. 

I also checked the chisq.scaling.factor and baseline.chisq.scaling.factor, respectively. They are identical across imputed data sets. Thus, the naive scaling factor shouldn't be a problem.

That does not get at what I was suggesting.  I was suggesting that you manually calculate the D2 statistic yourself (for both the hypothesized and null models) from the naïve chi-squared values across imputations; then "robustify" those using the average scale/shift parameters; and finally use those values to calculate CFI.  That way, you can verify whether there is a bug in semTools, or whether D2-based fit indices are just not useful for evaluating model fit with multiply imputed data.

robi...@ipn.uni-kiel.de

unread,
Jun 22, 2017, 4:11:14 AM6/22/17
to lavaan

Maybe I would disagree with the general statement, that D2 is a "poor method". With large sample sizes (say N > 200) and a sufficiently large number of multiply imputed datasets (say more than 20, but this depends on the fraction of missing of information), D2 can have quite comparable performance compared to D1 and D3.

Alexander

Terrence Jorgensen

unread,
Jun 23, 2017, 6:39:21 AM6/23/17
to lavaan
With large sample sizes (say N > 200) and a sufficiently large number of multiply imputed datasets (say more than 20, but this depends on the fraction of missing of information), D2 can have quite comparable performance compared to D1 and D3.

Thanks for clarifying what I meant by "depending on the data."  Admittedly, my advice is based on only a couple studies comparing (some of) these statistics.  Namely, the original source for D2 (Li, Meng, Raghunathan, & Rubin, 1991) only advised using it as a screening statistic, not to rely on it for inference.  But in a more recent study (to which I assume you are referring, as you are a coauthor), Grund et al. (2016) found that D2 did perform well with moderate-to-large N and low FMI, at least in the context of pooling ANOVA results (i.e., testing mean structure), which are models that do not require large samples anyway.

I'd love to see some research involving more complex analyses of covariance structure (CFAs and SEMs) that compare these pooling methods, especially in situations when the statistics being pooled already do not follow their expected distributions (e.g., scaled/shifted statistics for categorical indicators might not be chi-squared distributed until N is very large, depending on how symmetric the thresholds are; Bandalos, 2014).  Even the Mplus technical paper only looks at D3 (because that is the only one they implement).  If you are aware of any already published, please share links/citations with us (or perhaps you and Oliver are already planning a follow-up study?).  I'd be happy to update my opinion if I had more information.  In the meantime, I'll express appropriate skepticism about D2 instead of claiming outright that it is probably untrustworthy.

Terrence Jorgensen

unread,
Mar 9, 2018, 9:11:00 PM3/9/18
to lavaan
Why do I think the results are strange? First, if I ran the analyses for one data set only, results are far way better, that is, cfi.scaled = 0.952 for the first and second data set.

Someone else posted about the same issue on the semTools github page, and provided data for a reproducible example.  So I looked further into this, and I still found no evidence that his is a software problem.  (I did fix another bug related to the shift parameters when there are multiple groups, but you were running a single-group model, so that shouldn't have affected your analysis.)

I attached a script and data so that you can follow my logic, and feel free to investigate other details I might be overlooking.  Note that it relies on semTools >= 0.4-15.915, so install the latest development version if necessary:

devtools::install_github("simsem/semTools/semTools")

Recall that there is no theoretical justification for assuming that CFI or RMSEA calculated from pooled chi-squared statistics will look anything like the average CFI or RMSEA across imputations (although that heuristic might work out often in practice).  In the script, I try to show why by comparing the scaled/shifted chi-squared statistic (WLSMV) of the target and baseline models in each imputation (and their average across imputations) to the pooled statistics returned by anova().  This is a nice extreme example with 50% missing data (and fractions of missing information as high as 90% for some polychorics and thresholds) that really shows what happens when the models not only fit very poorly, but model fit varies widely across imputations.

In this example, the chisq(.scaled) values are in the range of 6000 (11,000) across 5 imputations, but the baseline.chisq(.scaled) are in the range of 400,000 (200,000) across 5 imputations.  So the target model appears to fit a LOT better relative to the baseline model, making the incremental fit indices very high in each imputation (e.g., CFI ~ 0.95), even though the RMSEA(.scaled) is in the very-poor range of 0.13 (0.17).  Now, because there is so much between-imputation variance in the test statistics, the pooled statistics (using the D2 method) are much lower than the average across imputations:
  • chisq = 1353 (vs. 6000)
  • chisq.scaled = 2442 (vs. 11,000)
  • baseline.chisq = 5070 (vs. 400,000)
  • baseline.chisq.scaled = 2583 (vs. 200,000)
Note that the chisq.scaling.factor was 0.56, whereas the baseline.chisq.scaling.factor was almost 2, which explains why the scaled tests above make the target model look worse but the baseline model look better.  Consequently, the scaled RMSEA from pooled stats fell to 0.083 (mediocre fit, but much better than the average 0.17 of individual imputations), reflecting the big drop in the pooled chi-squared for the target model.  But relative to the even bigger drop in the baseline model's chi-squared stats, the scaled CFI from pooled stats in this example is only around 0.05, reflecting the similarity between the scaled pooled stats of the target and baseline models.

At the end of the script, I calculate measures of variability across imputations, then manually calculate the D2 pooled statistic (asymptotic/chi-squared version, by multiplying the pooled F stat by the numerator df) from the information available, to check that it matches the semTools::anova() output.  It is not proof there is no bug at all, but this gives you an idea what you can inspect in your own model/data to see why your pooled results are so different from the average across imputations. 

Again, I'm open to looking at this another way.  Any further discussion can only help organize thoughts, which hopefully will turn into a paper providing much-needed guidance in this area.
TESTrunMI.R
data.txt
Reply all
Reply to author
Forward
0 new messages