Attempting EFA in lavaan with multiple imputed datasets from mice

288 views
Skip to first unread message

Bryan Stiles

unread,
Mar 16, 2025, 3:21:06 PMMar 16
to lavaan
I'm very new to mice and using MI to handling missing data. I have an imputed dataset ("imp2final" below) that I am attempting to use for an EFA. I have not had much luck on finding guidance on how to do this, and the only package I've come across (the mifa package) seems to only do PCA, which is not aligned with my aims. 

It occurred to me that I could fit EFAs within lavaan, and I have been successfully using lavaan.mi for fitting some other CFAs with another dataset. I attempted this code fitting a simple 1-factor EFA:

f1 <- '
    efa("efa")*f1 =~ SCS_1rev + SCS_2 + SCS_3 + SCS_4rev + SCS_5 +
SCS_6 + SCS_7 + SCS_8rev + SCS_9rev + SCS_10 + SCS_11rev + SCS_12rev
'
efa_f1 <-
  cfa.mi(model = f1,
      data = imp2final,
      rotation = "oblimin",
      estimator = "WLSMV",
      ordered = TRUE)

And received this warning/error message: 
Error in lavaan.mi::lavaan.mi(model = f1, data = imp2final, rotation = "oblimin",  :
  efa()* blocks detected in model= syntax. (Un)rotated factors cannot be matched across imputations, threatening validity of pooled results.
Instead, EFA/ESEM should be conducted on a single set of (pre)pooled summary statistics (see the ?poolSat help page).

I would greatly appreciate any advice on how to troubleshoot the error I seem to be making, and/or clarity on whether this approach using lavaan.mi to fit EFAs is appropriate.

Bryan Stiles

unread,
Mar 16, 2025, 8:15:57 PMMar 16
to lavaan
Alternatively, I am wondering if I could subset from the imputed datasets the variables I intend to use for EFA, use the poolSat() function from lavaan.mi to call forth a pooled polychoric covariance matrix, and then conduct a parallel analysis and EFA on this? For example:

#Subsetting the variables needed from mids
columns_to_subset <- c("SCS_1rev", "SCS_2", "SCS_3", "SCS_4rev", "SCS_5",
                         "SCS_6", "SCS_7", "SCS_8rev", "SCS_9rev", "SCS_10", "SCS_11rev", "SCS_12rev")
impSubset1 <- lapply(complete(imp2final, "all"), function(df) df[, columns_to_subset])

#Pooling to pull out a summary covariance matrix
prePooledData <- poolSat(impSubset1, ordered = TRUE)
prePooledData$sample.cov #Seems to run fine

#Conducting parallel analysis on pooled covariance matrix
fa.parallel(prePooledData$sample.cov, n.obs = 250, fa = "fa", n.iter = 500)

#Running EFA
efa_result3 <- fa(prePooledData$sample.cov, nfactors = 3, fm = "wls", rotate = "oblimin")

Curious if folks think this is an appropriate use of the poolSat() function? It seems to run fine, but not sure if this is procedurally correct.

Terrence Jorgensen

unread,
Mar 17, 2025, 4:47:22 AMMar 17
to lavaan
I would greatly appreciate any advice on how to troubleshoot the error

I tried to make the error message sufficiently informative.  It describes both the problem and the solution.
 
Error in lavaan.mi::lavaan.mi(model = f1, data = imp2final, rotation = "oblimin",  :
  efa()* blocks detected in model= syntax. (Un)rotated factors cannot be matched across imputations, threatening validity of pooled results.
Instead, EFA/ESEM should be conducted on a single set of (pre)pooled summary statistics (see the ?poolSat help page).

So the answer to your second question:

I am wondering if I could subset from the imputed datasets the variables I intend to use for EFA, use the poolSat() function

is "yes".  There are references on the help page you can read for details, and there is example syntax showing you how it works. 
It is meant to provide a set of arguments that are all passed to lavaan(data=), so replace fa() with lavaan's efa() in your syntax.  

If you only pass the pooled covariance matrix, then your SEs and test statistics cannot be trusted because they ignore the additional uncertainty due to missing data.  The poolSat() function also returns the NACOV= and WLS.V= arguments, which pool both the within- and between-imputation sampling (co)variances (read the help-page references for details).

I think you can still just use fa.parallel() for your parallel analysis, since you don't draw inferences from that (only establish a baseline for determining how many factors efa() should extract).

Terrence D. Jorgensen    (he, him, his)
Assistant Professor, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam
http://www.uva.nl/profile/t.d.jorgensen

Bryan Stiles

unread,
Mar 17, 2025, 10:56:05 AMMar 17
to lavaan
Dr. Jorgensen, thank you for your helpful guidance. 

In attempting to pass the pooled summary statistics from poolSat() to the lavaan (data=) in efa(), I return the following error:

columns_to_subset <- c("SCS_1rev", "SCS_2", "SCS_3", "SCS_4rev", "SCS_5",
                         "SCS_6", "SCS_7", "SCS_8rev", "SCS_9rev", "SCS_10", "SCS_11rev", "SCS_12rev")
impSubset1 <- lapply(complete(imp2final, "all"), function(df) df[, columns_to_subset])
prePooledData <- poolSat(impSubset1, ordered = TRUE)
fit = efa(data = prePooledData,
          nfactors = 3,
          rotation = "oblimin")

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  :
  arguments imply differing number of rows: 12, 1, 114, 48


Is this somehow stemming from  poolSat()? Or does something need to be modified within efa()?

I appreciate your continued help.
Bryan

Terrence Jorgensen

unread,
Mar 18, 2025, 6:31:08 AMMar 18
to lavaan
Without access to your data, I'm not sure what the problem could be.  The ?poolSat help-page example works fine with efa() using lavaan.mi_0.1-0 and lavaan_0.6-19  

data(HS20imps) # import a list of 20 imputed data sets

## fit saturated model to imputations, pool those results
impSubset1 <- lapply(HS20imps, "[", i = paste0("x", 1:9)) # only modeled variables
(prePooledData <- poolSat(impSubset1))

efa(prePooledData, nfactors = 3)

If you can post your own prePooledData object in an RDS file, I might be able to track down a problem by running the syntax you provided.

saveRDS(prePooledData, file = "Bryan.rds") 

Bryan Stiles

unread,
Mar 18, 2025, 1:16:59 PMMar 18
to lavaan
Dr. Jorgensen, here is an RDS file of my prePooledData object. Thank you for looking into this. 
Bryan.rds

Terrence Jorgensen

unread,
Mar 19, 2025, 5:27:14 AMMar 19
to lavaan

Dr. Jorgensen, here is an RDS file of my prePooledData object.

Thanks.  I have no problem running your syntax on this object, after installing lavaan and lavaan.mi from CRAN (just to be sure I am using those versions).

> library(lavaan.mi)
Loading required package: lavaan
This is lavaan 0.6-19
lavaan is FREE software! Please report any bugs.
 
###################################################################
This is lavaan.mi 0.1-0
See the README file on github.com/TDJorgensen/lavaan.mi
for a table comparing it with deprecated semTools features.
###################################################################
> prePooledData <- readRDS("~/Downloads/Bryan.rds")
> fit <- efa(data = prePooledData, nfactors = 3, rotation = "oblimin")
> summary(fit)

This is lavaan 0.6-19 -- running exploratory factor analysis

  Estimator                                       DWLS
  Rotation method                      OBLIMIN OBLIQUE
  Oblimin gamma                                      0
  Rotation algorithm (rstarts)                GPA (30)
  Standardized metric                             TRUE
  Row weights                                     None

  Number of observations                           250

Fit measures:
                chisq df pvalue   cfi rmsea
  nfactors = 3 77.194 33      0 0.948 0.085

Eigenvalues correlation matrix:

    ev1     ev2     ev3     ev4     ev5     ev6     ev7     ev8     ev9    ev10    ev11    ev12
  5.057   1.462   0.876   0.777   0.731   0.700   0.509   0.496   0.445   0.399   0.289   0.258

Standardized loadings: (* = significant at 1% level)

              f1      f2      f3       unique.var   communalities
SCS_1rev       .   0.617*                   0.511           0.489
SCS_2      0.575*              .            0.541           0.459
SCS_3      0.684*                           0.462           0.538
SCS_4rev       .*  0.345*      .            0.565           0.435
SCS_5      0.603*              .            0.707           0.293
SCS_6      0.465*  0.302*                   0.577           0.423
SCS_7      0.676*                           0.514           0.486
SCS_8rev           0.862*                   0.303           0.697
SCS_9rev       .   0.549*      .            0.486           0.514
SCS_10     0.593*      .                    0.663           0.337
SCS_11rev                  0.966*           0.059           0.941
SCS_12rev      .           0.600*           0.489           0.511

                              f1    f2    f3 total
Sum of sq (obliq) loadings 2.599 1.899 1.624 6.121
Proportion of total        0.425 0.310 0.265 1.000
Proportion var             0.217 0.158 0.135 0.510
Cumulative var             0.217 0.375 0.510 0.510

Factor correlations: (* = significant at 1% level)

       f1      f2      f3
f1  1.000                
f2  0.453*  1.000        
f3  0.491*  0.614*  1.000

> sessionInfo()
R version 4.4.3 (2025-02-28)
Platform: x86_64-apple-darwin20
Running under: macOS Sonoma 14.7.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Amsterdam
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] lavaan.mi_0.1-0 lavaan_0.6-19   

Bryan Stiles

unread,
Mar 19, 2025, 11:33:18 AMMar 19
to lavaan
Hi Dr. Jorgensen, that worked! I think I was running an old version of lavaan.

A couple of remaining questions. Is it safe to assume that the CFI and RMSEA that are being returned here are robust CFI and robust RMSEA? I'm not sure how comparable these values are to what I would return running a CFA in lavaan, that provides both non-robust, scaled, and robust estimates. 

Lastly, how does one interpret a loading that is denoted with " .* "? Is this indicating that the loading is below a certain default cutoff that efa() is using, but is significantly different from zero at a <.01 level? 

Thank you again for all your help, I really appreciate it.


> summary(efafit3, fit.measures = TRUE)

Terrence Jorgensen

unread,
Mar 20, 2025, 7:21:45 AMMar 20
to lavaan
Is it safe to assume that the CFI and RMSEA that are being returned here are robust CFI and robust RMSEA?

You can run fitMeasures() to look for those, which will be explicitly labeled "cfi.robust" and "rmsea.robust".  I just adapted the ?efa help-page example to check, and it looks like the robust ones are in the summary().
 
how does one interpret a loading that is denoted with " .* "? 

See ?summary.efaList to read about the cutoff=, dot.cutoff=, and alpha.level= arguments.

Bryan Stiles

unread,
Apr 10, 2025, 10:58:28 AMApr 10
to lavaan
Dr. Jorgensen, 

I am swinging back to this discussion with a question. I originally fit a series of priori CFA models using the lavaan.mi() approach, and in the presence of poor model fit/misspecification, ran through the procedure elaborated earlier in the thread using poolSat() to derive a pooled set of summary statistics upon which an EFA could be conducted. However, it occurred to me that this might present a problem if the CFA models fit with lavaan.mi() and EFA model with poolSat() and subsequently efa() are drawing on different yielding essentially different mean and covariance structures. From your perspective, if someone were using CFA and EFA this way with multiple imputed datasets, would it be ideal to retain only one inferential approach rather than mixing them?

Thank you for your perspective.

Best,
Bryan

Bryan Stiles

unread,
Apr 10, 2025, 12:34:08 PMApr 10
to lavaan
P.S.: I ask this because fitting the WLSMV-estimated CFA with the different methods yields relatively similar parameter estimates (e.g., factor loadings and variances), but the fit indices, especially the Robust CFI/TLI/RMSEA, differ considerably.

Fitting a two-factor model with lavaan.mi() 
# lavaan model for two first-order factor
cfa2 <- '
  pos  =~ NA*SCS_SF_q2 + SCS_SF_q6 + SCS_SF_q5 + SCS_SF_q10 + SCS_SF_q3 + SCS_SF_q7
  neg =~ NA*SCS_SF_q11rev + SCS_SF_q12rev + SCS_SF_q4rev + SCS_SF_q8rev + SCS_SF_q1rev + SCS_SF_q9rev
  pos ~~ 1*pos
  neg ~~ 1*neg'
cfa2out <- cfa.mi(cfa2, data=imp2final, ordered=c("SCS_SF_q1rev", "SCS_SF_q2", "SCS_SF_q3", "SCS_SF_q4rev", "SCS_SF_q5", "SCS_SF_q6",
                            "SCS_SF_q7", "SCS_SF_q8rev", "SCS_SF_q9rev", "SCS_SF_q10", "SCS_SF_q11rev", "SCS_SF_q12rev"),  estimator = "WLSMV")
summary(cfa2out, standardized = TRUE, rsquare=TRUE, fit.measures=TRUE, test = "D2")

lavaan.mi object fit to 20 imputed data sets using:
 - lavaan    (0.6-19)
 - lavaan.mi (0.1-0)
See class?lavaan.mi help page for available methods.

Convergence information:
The model converged on 20 imputed data sets.
Standard errors were available for all imputations.

  Estimator                                       DWLS
  Optimization method                           NLMINB
  Number of model parameters                        61

  Number of observations                           170

Model Test User Model:

                                                Standard      Scaled
  Test statistic                                 103.339     123.851
  Degrees of freedom                                  53          53
  P-value                                          0.000       0.000
  Average scaling correction factor                            0.984
  Average shift parameter                                     18.806
    simple second-order correction                                  
  Pooling method                                      D2            
    Pooled statistic                          “standard”            
    “scaled.shifted” correction applied            AFTER     pooling

Model Test Baseline Model:

  Test statistic                               757.618     406.820
  Degrees of freedom                                66          66
  P-value                                        0.000       0.000
  Scaling correction factor                                  2.029

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.927       0.792
  Tucker-Lewis Index (TLI)                       0.909       0.741
                                                                 
  Robust Comparative Fit Index (CFI)                         0.854
  Robust Tucker-Lewis Index (TLI)                            0.818

Root Mean Square Error of Approximation:

  RMSEA                                          0.075       0.089
  90 Percent confidence interval - lower         0.053       0.069
  90 Percent confidence interval - upper         0.096       0.109
  P-value H_0: RMSEA <= 0.050                    0.031       0.001
  P-value H_0: RMSEA >= 0.080                    0.367       0.777
                                                                 
  Robust RMSEA                                               0.111
  90 Percent confidence interval - lower                     0.092
  90 Percent confidence interval - upper                     0.130
  P-value H_0: Robust RMSEA <= 0.050                         0.000
  P-value H_0: Robust RMSEA >= 0.080                         0.996

Standardized Root Mean Square Residual:

  SRMR                                           0.095       0.095


Fitting with poolSat() method:

## fit saturated model to imputations, pool those results
columns_to_subset <- c("SCS_SF_q1rev", "SCS_SF_q2", "SCS_SF_q3", "SCS_SF_q4rev", "SCS_SF_q5", "SCS_SF_q6",
                       "SCS_SF_q7", "SCS_SF_q8rev", "SCS_SF_q9rev", "SCS_SF_q10", "SCS_SF_q11rev", "SCS_SF_q12rev")

impSubset1 <- lapply(complete(imp2final, "all"), function(df) df[, columns_to_subset])
prePooledData <- poolSat(impSubset1, ordered = TRUE)

# lavaan model for two first-order factor
cfa2.pooled <- '
  pos  =~ NA*SCS_SF_q2 + SCS_SF_q6 + SCS_SF_q5 + SCS_SF_q10 + SCS_SF_q3 + SCS_SF_q7
  neg =~ NA*SCS_SF_q11rev + SCS_SF_q12rev + SCS_SF_q4rev + SCS_SF_q8rev + SCS_SF_q1rev + SCS_SF_q9rev
 
  pos ~~ 1*pos
  neg ~~ 1*neg'
cfa2.pooled.out <- cfa(cfa2.pooled, data=prePooledData, estimator = "WLSMV")
summary(cfa2.pooled.out, rsquare=TRUE, fit.measures=TRUE)

lavaan 0.6-19 ended normally after 28 iterations

  Estimator                                       DWLS
  Optimization method                           NLMINB
  Number of model parameters                        61

  Number of observations                           170

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                91.157     113.371
  Degrees of freedom                                53          53
  P-value (Chi-square)                           0.001       0.000
  Scaling correction factor                                  0.952
  Shift parameter                                           17.659
    simple second-order correction                                

Model Test Baseline Model:

  Test statistic                              1419.303     742.264
  Degrees of freedom                                66          66
  P-value                                        0.000       0.000
  Scaling correction factor                                  2.001

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.972       0.911
  Tucker-Lewis Index (TLI)                       0.965       0.889
                                                                 
  Robust Comparative Fit Index (CFI)                         0.187
  Robust Tucker-Lewis Index (TLI)                           -0.013

Root Mean Square Error of Approximation:

  RMSEA                                          0.065       0.082
  90 Percent confidence interval - lower         0.042       0.061
  90 Percent confidence interval - upper         0.088       0.103
  P-value H_0: RMSEA <= 0.050                    0.132       0.008
  P-value H_0: RMSEA >= 0.080                    0.146       0.585
                                                                 
  Robust RMSEA                                               0.255
  90 Percent confidence interval - lower                     0.227
  90 Percent confidence interval - upper                     0.284
  P-value H_0: Robust RMSEA <= 0.050                         0.000
  P-value H_0: Robust RMSEA >= 0.080                         1.000

Standardized Root Mean Square Residual:

  SRMR                                           0.091       0.091

Terrence Jorgensen

unread,
Apr 10, 2025, 8:51:20 PMApr 10
to lavaan
From your perspective, if someone were using CFA and EFA this way with multiple imputed datasets, would it be ideal to retain only one inferential approach rather than mixing them?

Personally, I would prefer the consistency.  Asymptotically, it probably doesn't matter.  

fit indices, especially the Robust CFI/TLI/RMSEA, differ considerably

Indeed, the baseline model in particular seems to fit much worse in one method than the other.  The methods should perform similarly in large samples and for models that fit reasonably well.  Neither of those hold for your case, so it is less predictable how different you should expect the results of each method to be.

Bryan Stiles

unread,
Apr 12, 2025, 5:38:24 PMApr 12
to lavaan
I see; thank you. Out of curiosity, is there an approach for EFA that would mimic the procedures used in something like cfa.mi(), where a SEM is fitted to each imputed dataset and then pooled? I recall the warning message I received from trying to fit the pooled covariance matrix to efa(), and I wonder if one could fit separate EFA models to each imputed dataset and then apply Rubin's rules to pool the model parameters (e.g., rotated factor loadings). Similarly for the parallel analysis (I suppose here running separate parallel analyses on each imputed dataset, though I imagine this could become problematic if the number of factors varies across imputations). 

Terrence Jorgensen

unread,
Apr 13, 2025, 3:35:22 PMApr 13
to lavaan
I wonder if one could fit separate EFA models to each imputed dataset and then apply Rubin's rules to pool the model parameters (e.g., rotated factor loadings). 

You can, but it is not in general certain that "factor 1" in the first imputation is the same extracted factor as "factor 1" in any other imputation.  Extract factors can "switch places" across imputations, just as they can across replications in a simulation study.  The same problem occurs with latent classes.  The higher the fraction missing information, the more frequently "switching" can occur.
Reply all
Reply to author
Forward
0 new messages