Someone else posted about the same issue on the
semTools github page, and provided data for a reproducible example. So I looked further into this, and I still found no evidence that his is a software problem. (I did fix another bug related to the shift parameters when there are multiple groups, but you were running a single-group model, so that shouldn't have affected your analysis.)
I attached a script and data so that you can follow my logic, and feel free to investigate other details I might be overlooking. Note that it relies on semTools >= 0.4-15.915, so install the latest development version if necessary:
devtools::install_github("simsem/semTools/semTools")
Recall that there is no theoretical justification for assuming that CFI or RMSEA calculated from pooled chi-squared statistics will look anything like the average CFI or RMSEA across imputations (although that heuristic might work out often in practice). In the script, I try to show why by comparing the scaled/shifted chi-squared statistic (WLSMV) of the target and baseline models in each imputation (and their average across imputations) to the pooled statistics returned by anova(). This is a nice extreme example with 50% missing data (and fractions of missing information as high as 90% for some polychorics and thresholds) that really shows what happens when the models not only fit very poorly, but model fit varies widely across imputations.
In this example, the chisq(.scaled) values are in the range of 6000 (11,000) across 5 imputations, but the baseline.chisq(.scaled) are in the range of 400,000 (200,000) across 5 imputations. So the target model appears to fit a LOT better relative to the baseline model, making the incremental fit indices very high in each imputation (e.g., CFI ~ 0.95), even though the RMSEA(.scaled) is in the very-poor range of 0.13 (0.17). Now, because there is so much between-imputation variance in the test statistics, the pooled statistics (using the D2 method) are much lower than the average across imputations:
- chisq = 1353 (vs. 6000)
- chisq.scaled = 2442 (vs. 11,000)
- baseline.chisq = 5070 (vs. 400,000)
- baseline.chisq.scaled = 2583 (vs. 200,000)
Note that the
chisq.scaling.factor was 0.56, whereas the
baseline.chisq.scaling.factor was almost 2, which explains why the scaled tests above make the target model look worse but the baseline model look better. Consequently, the scaled RMSEA from pooled stats fell to 0.083 (mediocre fit, but much better than the average 0.17 of individual imputations), reflecting the big drop in the pooled chi-squared for the target model. But relative to the even bigger drop in the baseline model's chi-squared stats, the scaled CFI from pooled stats in this example is only around 0.05, reflecting the similarity between the scaled pooled stats of the target and baseline models.
At the end of the script, I calculate measures of variability across imputations, then manually calculate the D2 pooled statistic (asymptotic/chi-squared version, by multiplying the pooled F stat by the numerator df) from the information available, to check that it matches the semTools::anova() output. It is not proof there is no bug at all, but this gives you an idea what you can inspect in your own model/data to see why your pooled results are so different from the average across imputations.
Again, I'm open to looking at this another way. Any further discussion can only help organize thoughts, which hopefully will turn into a paper providing much-needed guidance in this area.