Pooling factor scores across imputed data sets?

348 views
Skip to first unread message

garc...@unlv.nevada.edu

unread,
Aug 8, 2018, 8:38:52 PM8/8/18
to lavaan
Hello,

I have recently fit several CFA models across a set of imputed data sets (i.e. 30 imputed data sets). My question is, is there a way to pool factor scores across these data sets? The LavPredict() function only works on lavaan objects (at least that is my understanding). Is there any similar function in place for lavaan.mi objects? And if not (which is my worry), does anyone have any suggestions on how I may calculate these factor scores?

Thanks you in advance for any responses!
Breanna Garcia 

Mauricio Garnier-Villarreal

unread,
Aug 9, 2018, 2:01:42 AM8/9/18
to lavaan
There is no equivalent for factor sectors for the lavaan.mi objects. There are rules on how to pool parameters and standard errors with multiple imputations. I would see the factor scores as parameters, and you could follow those rules to pool them. For parameters, the pool factor score would be average across imputations. 

This is something you would have to do on you own with a loop, estimate the cfa with each imputed data, save the factor score from each one and then average them

Terrence Jorgensen

unread,
Aug 9, 2018, 3:49:11 AM8/9/18
to lavaan
There are rules on how to pool parameters and standard errors with multiple imputations. I would see the factor scores as parameters, and you could follow those rules to pool them. For parameters, the pool factor score would be average across imputations. 

I would hesitate to see it that way unless factor scores are only linear functions of the parameters, in which case the average of transformation values is the same as transforming the average.  For instance, the fitted() and resid() methods for lavaan.mi objects do not simply average the model-implied moments or residuals across imputations, because those are not estimated parameters.  Instead, the model-implied moments and residuals are calculated from the pooled parameters, as they would be if using complete data (and those 2 methods do not yield the same result).

I haven't written a predict() method for lavaan.mi objects because I don't support the use of treating factor scores as though they were observed values that can be used in a subsequent analysis.  Failing to take uncertainty of estimated factor scores into account yields biased SEs and test statistics, so error rates can be greatly inflated.  You can, however, adjust those SEs by using a 2-step estimation process. 


Yves is working on implementing that method in lavaan, but I think the new multilevel features are taking priority right now.  Once that is available, I would actually recommend using the 2-step factor-score regression analysis on each imputed data set, then pooling those estimates and (correct!) SEs and test statistics.  I would make sure runMI() can incorporate such an analysis, but I can't know when it would be available.

Terrence D. Jorgensen
Postdoctoral Researcher, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam

Inga

unread,
Sep 15, 2023, 5:43:56 AM9/15/23
to lavaan

Dear lavaan community,

 I would like to follow up on this question as I am encountering a similar issue/question (tbh the issue relates to other threads as well, however, I couldn’t come up with an “aggregate” solution). I am attempting to fit a CFA with one latent factor, using a combination of continuous and binary indicator variables. The dataset also contains missing values. My objective is to obtain a factor score that I can subsequently use in analyses to examine its relationship with disease progression—I am aware that this is approach is debatable.

 Given the data characteristics, my preference was to estimate the model WLSMV with multiple imputations. As a sanity check, I also estimated the same model structure on the complete datasets for comparison, and the results aligned well with each other. However, I am currently facing the challenge of deriving factor scores, as plausible values and SEs appear to be unattainable with models containing categorical data. Are there new solutions implemented in lavaan that I may not have come across yet, or would I need to rely on blavaan for this (as here: https://groups.google.com/g/lavaan/c/P5n1XILPo0M/m/n6u5EwgYBAAJ)? I’m not yet experienced with this Bayesian approach and I would highly appreciate some guidance.

 Another option I have been contemplating is the aggregation of the categorical indicator variables, resulting in a risk factor sum score—a widely used practice in my field. With this approach, I might be able to transform the categorical indicators to exhibit continuous characteristics. I could then address the missing data in the dataset using MLR/FIML in the CFA when combining it with the other continuous indicators. However, at present, this appears to be more of a "last resort" or compromise solution.

 I would greatly appreciate any advice and suggestions. Thank you in advance for your valuable feedback! If details are too hazy here, I’m happy to provide more information of course.

Terrence Jorgensen

unread,
Sep 15, 2023, 6:41:54 AM9/15/23
to lavaan

would I need to rely on blavaan for this (as here: https://groups.google.com/g/lavaan/c/P5n1XILPo0M/m/n6u5EwgYBAAJ)?

Yes, that is what I would suggest
 

I’m not yet experienced with this Bayesian approach and I would highly appreciate some guidance.

The nice part there is that you can estimate the CFA in blavaan using incomplete data, assuming you can incorporate any auxiliary variables as saturated correlates to justify the MAR assumption.  But if your data are already imputed, you could also fit the CFA using one chain for each imputation, and pool all those for a single posterior distribution.  That latter method isn't automated in blavaan though.

Terrence D. Jorgensen
Assistant Professor, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam

Stas Kolenikov

unread,
Sep 15, 2023, 12:10:18 PM9/15/23
to lav...@googlegroups.com
Inga,

if you have access to Stata, you can do your analysis as a combination of -mi- for missing value imputation / plausible values, and -gsem- for modelling of any type of response data.

Doing Bayesian stuff properly takes two semester of graduate statistics to understand what's going on. Of course you always throw your data and your syntax into a black box, and with some luck you will get some numbers in the output. But there is a fair amount of both fine-tuning the MCMC algorithms and diagnostics that has to be done for the answers to be trustworthy. Importantly, proper Bayesian computing produces convergence red flags with multiple imputations / plausible values... because it wants to see infinitely many draws rather than 5 draws. So things are not what they seem.

-- Stas Kolenikov, PhD, PStat (ASA, SSC)  @StatStas 
-- Principal Statistician, NORC @NORCnews
-- Opinions stated in this email are mine only, and do not reflect the position of my employer
-- http://stas.kolenikov.name
 


--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lavaan/b58ea5ee-7c62-418f-be06-6cb531395805n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages