Attempting EFA in lavaan with multiple imputed datasets from mice

Bryan Stiles

unread,

Mar 16, 2025, 3:21:06 PMMar 16

to lavaan

I'm very new to mice and using MI to handling missing data. I have an imputed dataset ("imp2final" below) that I am attempting to use for an EFA. I have not had much luck on finding guidance on how to do this, and the only package I've come across (the mifa package) seems to only do PCA, which is not aligned with my aims.

It occurred to me that I could fit EFAs within lavaan, and I have been successfully using lavaan.mi for fitting some other CFAs with another dataset. I attempted this code fitting a simple 1-factor EFA:

f1 <- '
efa("efa")*f1 =~ SCS_1rev + SCS_2 + SCS_3 + SCS_4rev + SCS_5 +
SCS_6 + SCS_7 + SCS_8rev + SCS_9rev + SCS_10 + SCS_11rev + SCS_12rev
'
efa_f1 <-
cfa.mi(model = f1,
data = imp2final,
rotation = "oblimin",
estimator = "WLSMV",
ordered = TRUE)

And received this warning/error message:

Error in lavaan.mi::lavaan.mi(model = f1, data = imp2final, rotation = "oblimin", :
efa()* blocks detected in model= syntax. (Un)rotated factors cannot be matched across imputations, threatening validity of pooled results.
Instead, EFA/ESEM should be conducted on a single set of (pre)pooled summary statistics (see the ?poolSat help page).

I would greatly appreciate any advice on how to troubleshoot the error I seem to be making, and/or clarity on whether this approach using lavaan.mi to fit EFAs is appropriate.

Bryan Stiles

unread,

Mar 16, 2025, 8:15:57 PMMar 16

to lavaan

Alternatively, I am wondering if I could subset from the imputed datasets the variables I intend to use for EFA, use the poolSat() function from lavaan.mi to call forth a pooled polychoric covariance matrix, and then conduct a parallel analysis and EFA on this? For example:

#Subsetting the variables needed from mids

columns_to_subset <- c("SCS_1rev", "SCS_2", "SCS_3", "SCS_4rev", "SCS_5",
"SCS_6", "SCS_7", "SCS_8rev", "SCS_9rev", "SCS_10", "SCS_11rev", "SCS_12rev")
impSubset1 <- lapply(complete(imp2final, "all"), function(df) df[, columns_to_subset])

#Pooling to pull out a summary covariance matrix
prePooledData <- poolSat(impSubset1, ordered = TRUE)
prePooledData$sample.cov #Seems to run fine

#Conducting parallel analysis on pooled covariance matrix

fa.parallel(prePooledData$sample.cov, n.obs = 250, fa = "fa", n.iter = 500)

#Running EFA

efa_result3 <- fa(prePooledData$sample.cov, nfactors = 3, fm = "wls", rotate = "oblimin")

Curious if folks think this is an appropriate use of the poolSat() function? It seems to run fine, but not sure if this is procedurally correct.

Terrence Jorgensen

unread,

Mar 17, 2025, 4:47:22 AMMar 17

to lavaan

I would greatly appreciate any advice on how to troubleshoot the error

I tried to make the error message sufficiently informative. It describes both the problem and the solution.

Error in lavaan.mi::lavaan.mi(model = f1, data = imp2final, rotation = "oblimin", :
efa()* blocks detected in model= syntax. (Un)rotated factors cannot be matched across imputations, threatening validity of pooled results.
Instead, EFA/ESEM should be conducted on a single set of (pre)pooled summary statistics (see the ?poolSat help page).

So the answer to your second question:

I am wondering if I could subset from the imputed datasets the variables I intend to use for EFA, use the poolSat() function

is "yes". There are references on the help page you can read for details, and there is example syntax showing you how it works.

It is meant to provide a set of arguments that are all passed to lavaan(data=), so replace fa() with lavaan's efa() in your syntax.

https://lavaan.ugent.be/tutorial/efa.html

If you only pass the pooled covariance matrix, then your SEs and test statistics cannot be trusted because they ignore the additional uncertainty due to missing data. The poolSat() function also returns the NACOV= and WLS.V= arguments, which pool both the within- and between-imputation sampling (co)variances (read the help-page references for details).

I think you can still just use fa.parallel() for your parallel analysis, since you don't draw inferences from that (only establish a baseline for determining how many factors efa() should extract).

Terrence D. Jorgensen (he, him, his)
Assistant Professor, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam
http://www.uva.nl/profile/t.d.jorgensen

Bryan Stiles

unread,

Mar 17, 2025, 10:56:05 AMMar 17

to lavaan

Dr. Jorgensen, thank you for your helpful guidance.

In attempting to pass the pooled summary statistics from poolSat() to the lavaan (data=) in efa(), I return the following error:

columns_to_subset <- c("SCS_1rev", "SCS_2", "SCS_3", "SCS_4rev", "SCS_5",
"SCS_6", "SCS_7", "SCS_8rev", "SCS_9rev", "SCS_10", "SCS_11rev", "SCS_12rev")

impSubset1 <- lapply(complete(imp2final, "all"), function(df) df[, columns_to_subset])

prePooledData <- poolSat(impSubset1, ordered = TRUE)

fit = efa(data = prePooledData,
nfactors = 3,
rotation = "oblimin")

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 12, 1, 114, 48

Is this somehow stemming from poolSat()? Or does something need to be modified within efa()?

I appreciate your continued help.

Bryan

Terrence Jorgensen

unread,

Mar 18, 2025, 6:31:08 AMMar 18

to lavaan

Without access to your data, I'm not sure what the problem could be. The ?poolSat help-page example works fine with efa() using lavaan.mi_0.1-0 and lavaan_0.6-19

data(HS20imps) # import a list of 20 imputed data sets

## fit saturated model to imputations, pool those results
impSubset1 <- lapply(HS20imps, "[", i = paste0("x", 1:9)) # only modeled variables
(prePooledData <- poolSat(impSubset1))

efa(prePooledData, nfactors = 3)

If you can post your own prePooledData object in an RDS file, I might be able to track down a problem by running the syntax you provided.

saveRDS(prePooledData, file = "Bryan.rds")

Bryan Stiles

unread,

Mar 18, 2025, 1:16:59 PMMar 18

to lavaan

Dr. Jorgensen, here is an RDS file of my prePooledData object. Thank you for looking into this.

Bryan.rds

Terrence Jorgensen

unread,

Mar 19, 2025, 5:27:14 AMMar 19

to lavaan

Dr. Jorgensen, here is an RDS file of my prePooledData object.

Thanks. I have no problem running your syntax on this object, after installing lavaan and lavaan.mi from CRAN (just to be sure I am using those versions).

> library(lavaan.mi)
Loading required package: lavaan
This is lavaan 0.6-19
lavaan is FREE software! Please report any bugs.

###################################################################
This is lavaan.mi 0.1-0
See the README file on github.com/TDJorgensen/lavaan.mi
for a table comparing it with deprecated semTools features.
###################################################################
> prePooledData <- readRDS("~/Downloads/Bryan.rds")
> fit <- efa(data = prePooledData, nfactors = 3, rotation = "oblimin")
> summary(fit)
This is lavaan 0.6-19 -- running exploratory factor analysis

Estimator DWLS
Rotation method OBLIMIN OBLIQUE
Oblimin gamma 0
Rotation algorithm (rstarts) GPA (30)
Standardized metric TRUE
Row weights None

Number of observations 250

Fit measures:
chisq df pvalue cfi rmsea
nfactors = 3 77.194 33 0 0.948 0.085

Eigenvalues correlation matrix:

ev1 ev2 ev3 ev4 ev5 ev6 ev7 ev8 ev9 ev10 ev11 ev12
5.057 1.462 0.876 0.777 0.731 0.700 0.509 0.496 0.445 0.399 0.289 0.258

Standardized loadings: (* = significant at 1% level)

f1 f2 f3 unique.var communalities
SCS_1rev . 0.617* 0.511 0.489
SCS_2 0.575* . 0.541 0.459
SCS_3 0.684* 0.462 0.538
SCS_4rev .* 0.345* . 0.565 0.435
SCS_5 0.603* . 0.707 0.293
SCS_6 0.465* 0.302* 0.577 0.423
SCS_7 0.676* 0.514 0.486
SCS_8rev 0.862* 0.303 0.697
SCS_9rev . 0.549* . 0.486 0.514
SCS_10 0.593* . 0.663 0.337
SCS_11rev 0.966* 0.059 0.941
SCS_12rev . 0.600* 0.489 0.511

f1 f2 f3 total
Sum of sq (obliq) loadings 2.599 1.899 1.624 6.121
Proportion of total 0.425 0.310 0.265 1.000
Proportion var 0.217 0.158 0.135 0.510
Cumulative var 0.217 0.375 0.510 0.510

Factor correlations: (* = significant at 1% level)

f1 f2 f3
f1 1.000
f2 0.453* 1.000
f3 0.491* 0.614* 1.000

> sessionInfo()
R version 4.4.3 (2025-02-28)
Platform: x86_64-apple-darwin20
Running under: macOS Sonoma 14.7.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Amsterdam
tzcode source: internal

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:

[1] lavaan.mi_0.1-0 lavaan_0.6-19

Bryan Stiles

unread,

Mar 19, 2025, 11:33:18 AMMar 19

to lavaan

Hi Dr. Jorgensen, that worked! I think I was running an old version of lavaan.

A couple of remaining questions. Is it safe to assume that the CFI and RMSEA that are being returned here are robust CFI and robust RMSEA? I'm not sure how comparable these values are to what I would return running a CFA in lavaan, that provides both non-robust, scaled, and robust estimates.

Lastly, how does one interpret a loading that is denoted with " .* "? Is this indicating that the loading is below a certain default cutoff that efa() is using, but is significantly different from zero at a <.01 level?

Thank you again for all your help, I really appreciate it.

> summary(efafit3, fit.measures = TRUE)

Terrence Jorgensen

unread,

Mar 20, 2025, 7:21:45 AMMar 20

to lavaan

Is it safe to assume that the CFI and RMSEA that are being returned here are robust CFI and robust RMSEA?

You can run fitMeasures() to look for those, which will be explicitly labeled "cfi.robust" and "rmsea.robust". I just adapted the ?efa help-page example to check, and it looks like the robust ones are in the summary().

how does one interpret a loading that is denoted with " .* "?

See ?summary.efaList to read about the cutoff=, dot.cutoff=, and alpha.level= arguments.

Bryan Stiles

unread,

Apr 10, 2025, 10:58:28 AMApr 10

to lavaan

Dr. Jorgensen,

I am swinging back to this discussion with a question. I originally fit a series of priori CFA models using the lavaan.mi() approach, and in the presence of poor model fit/misspecification, ran through the procedure elaborated earlier in the thread using poolSat() to derive a pooled set of summary statistics upon which an EFA could be conducted. However, it occurred to me that this might present a problem if the CFA models fit with lavaan.mi() and EFA model with poolSat() and subsequently efa() are drawing on different yielding essentially different mean and covariance structures. From your perspective, if someone were using CFA and EFA this way with multiple imputed datasets, would it be ideal to retain only one inferential approach rather than mixing them?

Thank you for your perspective.

Best,

Bryan

Bryan Stiles

unread,

Apr 10, 2025, 12:34:08 PMApr 10

to lavaan

P.S.: I ask this because fitting the WLSMV-estimated CFA with the different methods yields relatively similar parameter estimates (e.g., factor loadings and variances), but the fit indices, especially the Robust CFI/TLI/RMSEA, differ considerably.

Fitting a two-factor model with lavaan.mi()

# lavaan model for two first-order factor
cfa2 <- '
pos =~ NA*SCS_SF_q2 + SCS_SF_q6 + SCS_SF_q5 + SCS_SF_q10 + SCS_SF_q3 + SCS_SF_q7
neg =~ NA*SCS_SF_q11rev + SCS_SF_q12rev + SCS_SF_q4rev + SCS_SF_q8rev + SCS_SF_q1rev + SCS_SF_q9rev
pos ~~ 1*pos
neg ~~ 1*neg'
cfa2out <- cfa.mi(cfa2, data=imp2final, ordered=c("SCS_SF_q1rev", "SCS_SF_q2", "SCS_SF_q3", "SCS_SF_q4rev", "SCS_SF_q5", "SCS_SF_q6",
"SCS_SF_q7", "SCS_SF_q8rev", "SCS_SF_q9rev", "SCS_SF_q10", "SCS_SF_q11rev", "SCS_SF_q12rev"), estimator = "WLSMV")
summary(cfa2out, standardized = TRUE, rsquare=TRUE, fit.measures=TRUE, test = "D2")

lavaan.mi object fit to 20 imputed data sets using:
- lavaan (0.6-19)
- lavaan.mi (0.1-0)
See class?lavaan.mi help page for available methods.

Convergence information:
The model converged on 20 imputed data sets.
Standard errors were available for all imputations.

Estimator DWLS
Optimization method NLMINB
Number of model parameters 61

Number of observations 170

Model Test User Model:

Standard Scaled
Test statistic 103.339 123.851
Degrees of freedom 53 53
P-value 0.000 0.000
Average scaling correction factor 0.984
Average shift parameter 18.806
simple second-order correction
Pooling method D2
Pooled statistic “standard”
“scaled.shifted” correction applied AFTER pooling

Model Test Baseline Model:

Test statistic 757.618 406.820
Degrees of freedom 66 66
P-value 0.000 0.000
Scaling correction factor 2.029

User Model versus Baseline Model:

Comparative Fit Index (CFI) 0.927 0.792
Tucker-Lewis Index (TLI) 0.909 0.741

Robust Comparative Fit Index (CFI) 0.854
Robust Tucker-Lewis Index (TLI) 0.818

Root Mean Square Error of Approximation:

RMSEA 0.075 0.089
90 Percent confidence interval - lower 0.053 0.069
90 Percent confidence interval - upper 0.096 0.109
P-value H_0: RMSEA <= 0.050 0.031 0.001
P-value H_0: RMSEA >= 0.080 0.367 0.777

Robust RMSEA 0.111
90 Percent confidence interval - lower 0.092
90 Percent confidence interval - upper 0.130
P-value H_0: Robust RMSEA <= 0.050 0.000
P-value H_0: Robust RMSEA >= 0.080 0.996

Standardized Root Mean Square Residual:

SRMR 0.095 0.095

Fitting with poolSat() method:

## fit saturated model to imputations, pool those results

columns_to_subset <- c("SCS_SF_q1rev", "SCS_SF_q2", "SCS_SF_q3", "SCS_SF_q4rev", "SCS_SF_q5", "SCS_SF_q6",
"SCS_SF_q7", "SCS_SF_q8rev", "SCS_SF_q9rev", "SCS_SF_q10", "SCS_SF_q11rev", "SCS_SF_q12rev")

impSubset1 <- lapply(complete(imp2final, "all"), function(df) df[, columns_to_subset])
prePooledData <- poolSat(impSubset1, ordered = TRUE)

# lavaan model for two first-order factor
cfa2.pooled <- '
pos =~ NA*SCS_SF_q2 + SCS_SF_q6 + SCS_SF_q5 + SCS_SF_q10 + SCS_SF_q3 + SCS_SF_q7
neg =~ NA*SCS_SF_q11rev + SCS_SF_q12rev + SCS_SF_q4rev + SCS_SF_q8rev + SCS_SF_q1rev + SCS_SF_q9rev

pos ~~ 1*pos
neg ~~ 1*neg'
cfa2.pooled.out <- cfa(cfa2.pooled, data=prePooledData, estimator = "WLSMV")
summary(cfa2.pooled.out, rsquare=TRUE, fit.measures=TRUE)

lavaan 0.6-19 ended normally after 28 iterations

Estimator DWLS
Optimization method NLMINB
Number of model parameters 61

Number of observations 170

Model Test User Model:
Standard Scaled
Test Statistic 91.157 113.371
Degrees of freedom 53 53
P-value (Chi-square) 0.001 0.000
Scaling correction factor 0.952
Shift parameter 17.659
simple second-order correction

Model Test Baseline Model:

Test statistic 1419.303 742.264
Degrees of freedom 66 66
P-value 0.000 0.000
Scaling correction factor 2.001

User Model versus Baseline Model:

Comparative Fit Index (CFI) 0.972 0.911
Tucker-Lewis Index (TLI) 0.965 0.889

Robust Comparative Fit Index (CFI) 0.187
Robust Tucker-Lewis Index (TLI) -0.013

Root Mean Square Error of Approximation:

RMSEA 0.065 0.082
90 Percent confidence interval - lower 0.042 0.061
90 Percent confidence interval - upper 0.088 0.103
P-value H_0: RMSEA <= 0.050 0.132 0.008
P-value H_0: RMSEA >= 0.080 0.146 0.585

Robust RMSEA 0.255
90 Percent confidence interval - lower 0.227
90 Percent confidence interval - upper 0.284
P-value H_0: Robust RMSEA <= 0.050 0.000
P-value H_0: Robust RMSEA >= 0.080 1.000

Standardized Root Mean Square Residual:

SRMR 0.091 0.091

Terrence Jorgensen

unread,

Apr 10, 2025, 8:51:20 PMApr 10

to lavaan

From your perspective, if someone were using CFA and EFA this way with multiple imputed datasets, would it be ideal to retain only one inferential approach rather than mixing them?

Personally, I would prefer the consistency. Asymptotically, it probably doesn't matter.

fit indices, especially the Robust CFI/TLI/RMSEA, differ considerably

Indeed, the baseline model in particular seems to fit much worse in one method than the other. The methods should perform similarly in large samples and for models that fit reasonably well. Neither of those hold for your case, so it is less predictable how different you should expect the results of each method to be.

Bryan Stiles

unread,

Apr 12, 2025, 5:38:24 PMApr 12

to lavaan

I see; thank you. Out of curiosity, is there an approach for EFA that would mimic the procedures used in something like cfa.mi(), where a SEM is fitted to each imputed dataset and then pooled? I recall the warning message I received from trying to fit the pooled covariance matrix to efa(), and I wonder if one could fit separate EFA models to each imputed dataset and then apply Rubin's rules to pool the model parameters (e.g., rotated factor loadings). Similarly for the parallel analysis (I suppose here running separate parallel analyses on each imputed dataset, though I imagine this could become problematic if the number of factors varies across imputations).

Terrence Jorgensen

unread,

Apr 13, 2025, 3:35:22 PMApr 13

to lavaan

I wonder if one could fit separate EFA models to each imputed dataset and then apply Rubin's rules to pool the model parameters (e.g., rotated factor loadings).

You can, but it is not in general certain that "factor 1" in the first imputation is the same extracted factor as "factor 1" in any other imputation. Extract factors can "switch places" across imputations, just as they can across replications in a simulation study. The same problem occurs with latent classes. The higher the fraction missing information, the more frequently "switching" can occur.

Reply all

Reply to author

Forward