How can I make my ML & DWLS & MLR & WLSMV SEM as reproducible as possible without sharing data?

Charly Marie

unread,

Oct 1, 2024, 8:28:33 AM10/1/24

to lavaan

Hi all,

I am comparing different models, with proprietary data that cannot be shared, even de-identified.

I still want to make my results as transparent and reproducible as possible. I have read the lavaan group and know that I can make the analyses reproducible by sharing some components of my model. But I am a bit confused as to what I should report for each type of estimator and how to access it using lavaan.

Indeed:

Some of my main SEMs are fit using continuous data with maximum likelihood;
Some of my other main SEMs are fit using ordered data with DWLS;
Robustness checks include WLSMV and MLR;

Can you help me with this?

Thanks a lot.

Charly Marie

Terrence Jorgensen

unread,

Oct 2, 2024, 8:17:40 AM10/2/24

to lavaan

I have read the lavaan group and know that I can make the analyses reproducible by sharing some components of my model. But I am a bit confused as to what I should report for each type of estimator and how to access it using lavaan.

Wow, I am writing a grant proposal right now, to make an open-science tool for researchers to do exactly this. Now I can link to this public post in my proposal, to show the need for this is not in my imagination :-)

Some of my main SEMs are fit using continuous data with maximum likelihood;

If you analyze complete data (i.e., not using missing = "FIML"), then you can provide the covariance matrix (and sample means, if relevant) that lavaan fit the model to. You can obtain summary statistics from your fitted model:

STATS <- lavInspect(fit, "sampstat")

These can be passed to the sample.cov= (and sample.mean=, if relevant) argument.

You also need your sample size(s) to pass to the sample.nobs= argument.

N <- lavInspect(fit0, "nobs")

An analysis of incomplete data is not currently fully reproducible without raw data. I think in principle it could be, but that would require some software development (which is what my grant proposal is about).

Note that if you used multiple imputation of incomplete data, then you could then use complete-data methods for the analyses. Those can be fully reproduced by using pooled summary statistics. See the lavaan.mi package's poolSat() function, and read the help-page references and examples to see how it works.

Some of my other main SEMs are fit using ordered data with DWLS

In this case, the summary statistics (provided using the same syntax as above for continuous data) will include thresholds for any ordinal outcomes. In order to reproduce the results, you would need to add an "attribute" that tells lavaan which thresholds belong to which observed variables. You can extract and assign the necessary attribute as follows:

## if there is only 1 group:

THR <- STATS$th

## if there are multiple groups:

THR <- sapply(STATS, "[[", i = "th", simplify = FALSE)

attr(THR, "th.idx") <- lavInspect(fit, "th.idx")

Note that the same sapply() trick can be used to create a list of covariance matrices (or mean vectors) for those arguments when fitting a multigroup model.

Robustness checks include WLSMV and MLR;

You mean robust corrections for SEs and test statistics.

MLR is only necessary for incomplete data analyzed using FIML, which I said above is not yet reproducible without raw data. With complete data, you should use MLM, which is reproducible (and more efficient, working better in small samples than MLR). The data are analyzed as numeric, so you need to provide the summary statistics mentioned above (sample.cov= and sample.mean=) , along with their robust asymptotic covariance matrix.

NACOV <- lavInspect(fit, "gamma")

Your MLM results should be reproducible using these additional arguments:

estimator = "ML"
se = "robust.sem"
test = "satorra.bentler"
NACOV = NACOV

For DWLS estimation, you also need NACOV to adjust SEs and tests, along with the diagonal weight matrix used during estimation:

W <- lavInspect(fit, "WLS.V")

Your DWLS results should be reproducible using these additional arguments:

estimator = "DWLS"
se = "robust.sem"
test = "scaled.shifted"
NACOV = NACOV
WLS.V = W

Also note that all the methods described above require setting conditional.x=FALSE when you have exogenous observed variables. That is usually the default, unless you have categorical endogenous variables (i.e., using DWLS), so in the latter case you need to explicitly set that argument. Another software development would allow reproducibility even when conditional.x=TRUE.

Good luck,

Terrence D. Jorgensen (he, him, his)
Assistant Professor, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam
http://www.uva.nl/profile/t.d.jorgensen

Charly Marie

unread,

Oct 2, 2024, 9:50:53 AM10/2/24

to lavaan

Thanks, this is exactly what I thought was missing: all the statistics one needs to report, and how to report them, centralized in one place, for different types of models.

Feel free to link this post to your proposal, and thank you for doing so. Your project would definitely advance (open) science :).

Reply all

Reply to author

Forward