Measurement invariance in lavaan - several questions

73 views
Skip to first unread message

Andrija

unread,
Mar 28, 2026, 6:37:05 PM (7 days ago) Mar 28
to lavaan
Hi,

I hope this message finds you in good health.

My colleague and I are comparing UK and Montenegro data for 15-item scale, 5-factor solution, 3 items per factor. 

I analyzed the data in lavaan, using WLSMV, std.lv (which was true in all models), theta parametrization, I specified variables were ordinal, etc. The results confirmed configural, metric, and scalar invariance, but not the strict invariance.

My colleague analyzed the data in Mplus and metric invariance was not reached (theta parametrization, DIFFTEST). So, it is not a small difference in results, it changes the analysis completely.

On top of everything, I have learned that robust indices should be use since they are superior to scaled ones (Mplus only produces scaled ones), but comparison of CFI and RMSEA is not the way to determine invariance. I implemented RMSEA-D, but still not sure about how to interpret it.

There are several things I am not sure about:
1) Is there an official explanation on how to do measurement invariance for ordinal variables using WLSMV in lavaan?

2) Is there official recommendation which indices to use or which statistic to use to determine if there is configural, metric, scalar, strict (I would also like to hear what is the official order for WLSMV in lavaan: threshold, loadings, residuals...) invariance? Is lavTestLRT enough with its chi-square difference? Should RMSEAD also be used? What are cutoffs for RMSEAD?

3) Which statistics and indices should be reported anyway? If delta CFI is not the way to compare models, should we even report robust CFI (which is not available for scalar model in jamovi and JASP, but I do get it when I code through R studio)?

4) lavaan and Mplus yield the same result for configural invariance, but different for next levels, and that concerns number of free parameters, probably constraints, x2 is different, along with df. What is the logic behind that and which approach is better, since the results are opposite?

I am sorry if I am asking something that is available somewhere, but I couldn't find the complete explanation anywhere. I have watched this tutorial, but the papers do not seem to follow this, everyone uses delta CFI. I really need structure, clear answers and clarification on Mplus/lavaan differences.

Also, thank you guys for contributing to the science.

Kind regards,
Andrija


Yves Rosseel

unread,
Mar 29, 2026, 8:57:42 AM (7 days ago) Mar 29
to lav...@googlegroups.com
Is is possible for you to share the R/Mplus code and the dataset? That
would make it easier to pinpoint the discrepancy.

In any case: are the models that you fitted for the metric (and scalar)
invariance setting the same across programs? You can check the
free/fixed parameters in lavaan by typing

lavPredict(fit)

and in Mplus by adding 'tech1' in the 'Output:' section of the input.

Since 0.6-20, lavaan switched to the 'Wu and Estabrook' approach to
specify models with metric/scalar invariance in the categorical setting.
Perhaps this may be the source of the discrepancy?

Yves.

Terrence Jorgensen

unread,
Apr 1, 2026, 6:37:56 AM (4 days ago) Apr 1
to lavaan
My colleague analyzed the data in Mplus and metric invariance was not reached

If they did not test equality of loadings AFTER equating thresholds, then that test is meaningless.
 
I implemented RMSEA-D, but still not sure about how to interpret it.

It is the average amount of the increased misfit per parameter constraint, which is why it was originally called the "root-deterioration per restriction (RDR)".  I do not find that (or any global-fit index, really) to be a useful interpretation because the same value could be due to (a) several small misspecifications or (b) one large misspecification.  Translated to the context of testing invariance, RMSEA_D is not capable of distinguishing between (a) nearly/approximately invariant items and (b) most items being perfectly invariance, except 1 or 2 items exhibit large DIF (differential item functioning).  

If you are looking for a more interpretable effect size to describe your model's misfit (to pair with the test of perfect fit provided by the chi-squared test), then I usually recommend inspecting lavResiduals(), which reveal which specific pairwise relationships your model fails to capture closely.  But those don't provide clear guidance when invariance doesn't hold, in which case I find EPC-interest more useful.



You can search the archives of this forum to find my past examples showing how to use lavTestScore() to obtain these.


1) Is there an official explanation on how to do measurement invariance for ordinal variables using WLSMV in lavaan?
 
The Wu & Estabrook (2016) article is what I consider the current cornerstone:


This article provided a decent practical tutorial:



2) Is there official recommendation which indices to use or which statistic to use to determine if there is configural, metric, scalar, strict (I would also like to hear what is the official order for WLSMV in lavaan: threshold, loadings, residuals...) invariance? Is lavTestLRT enough with its chi-square difference? Should RMSEAD also be used? What are cutoffs for RMSEAD?

Comparing nested models with lavTestLRT() gives you a statistical test of the H0 of perfect equivalence.
Any NHST should be accompanied by a measure of effect size, which RMSEA_D could serve as (but again, its interpretation has quite limited usefulness).  EPC-interest is a more targeted type of effect size, but it is geared toward revealing how much impact DIF has on other model parameters, so it is useful when you are drawing inferences using CFA/SEM (e.g., to compare latent means, instead of running a t test on scale composites).  If your goals are more about scale development, then you probably want to understand why an item functions differently, so you need to identify what is important to quantify that answers your research question.
 

3) Which statistics and indices should be reported anyway? If delta CFI is not the way to compare models, should we even report robust CFI (which is not available for scalar model in jamovi and JASP, but I do get it when I code through R studio)?

I wouldn't use JASP or Jamovi if they don't provide what you get when you run the model in R.
Journals tend to expect fit indices when reporting SEMs, so you probably want to report some.  But that doesn't mean you have to base your decision on them.  Cutoff criteria don't turn them into tests, they are just suggested ranges of interpreting the "effect size" of misfit as small or large (like interpreting correlations using Cohen's guidelines: .10, .30, or .50).

Terrence D. Jorgensen    (he, him, his)
Assistant Professor, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam
http://www.uva.nl/profile/t.d.jorgensen


Andrija

unread,
Apr 2, 2026, 5:06:36 AM (3 days ago) Apr 2
to lavaan
Hi,

Thanks.

So, if this is what I got:


> lavTestLRT(fit_config, fit_thresh)

 

Scaled Chi-Squared Difference Test (method = “satorra.2000”)

 

lavaan->lavTestLRT():  

   lavaan NOTE: The “Chisq” column contains standard test statistics, not the robust test that should 

   be reported per model. A robust difference test is a function of two standard (not robust) 

   statistics.

 

            Df AIC BIC  Chisq Chisq diff RMSEA Df diff Pr(>Chisq)  

fit_config 160         325.77                                      

fit_thresh 190         344.70     40.753     0      30    0.09108 .

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> lavTestLRT(fit_thresh, fit_th_load)

 

Scaled Chi-Squared Difference Test (method = “satorra.2000”)

 

lavaan->lavTestLRT():  

   lavaan NOTE: The “Chisq” column contains standard test statistics, not the robust test that should 

   be reported per model. A robust difference test is a function of two standard (not robust) 

   statistics.

 

             Df AIC BIC Chisq Chisq diff    RMSEA Df diff Pr(>Chisq)  

fit_thresh  190         344.7                                         

fit_th_load 200         367.7     16.258 0.070703      10    0.09248 .

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> lavTestLRT(fit_th_load, fit_strict)

 

Scaled Chi-Squared Difference Test (method = “satorra.2000”)

 

lavaan->lavTestLRT():  

   lavaan NOTE: The “Chisq” column contains standard test statistics, not the robust test that should 

   be reported per model. A robust difference test is a function of two standard (not robust) 

   statistics.

 

             Df AIC BIC  Chisq Chisq diff   RMSEA Df diff Pr(>Chisq)    

fit_th_load 200         367.70                                          

fit_strict  215         425.57     62.414 0.10485      15  9.655e-08 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

then I can say that there is invariance between groups for configural, threshold, threshold+loadings levels, while there isn't for "strict" invariance (residuals)?

Chi-square is enough, even though everyone says it is too sensitive?

In Svetina et al. (2019) they basically say that authors should figure it out based on several propositions from other authors. Also, they use scaled CFI (probably because robust one isn't available), but you said it doesn't make sense.

This is the only thing that seems blurry.

Kind regards,
Andrija

Andrija

unread,
Apr 2, 2026, 6:48:57 PM (2 days ago) Apr 2
to lavaan
OK, so I would like to recap and make it easier for all of us:

1) I should do chi-square diff test and choose one of recommended models for differences in fit indices to back it up. If chi-square diff is significant, but model keeps good indices and diff in indices in lower than allowed, that I can reject chi-square diff test and declare invariance?

2) the levels of invariance for ordinal data: configural, thresholds (proposition 4), thresholds + loadings (proposition 7), thresholds + loadings + intercepts (proposition 11). Wu and Estabrook say: " For example, since for continuous outcomes the invariance of loadings and intercepts guarantees the comparison of both factor means and variances, for ordered polytomous data one can do the same comparison with invariant thresholds, loadings, and intercepts"  -  but Svetina et al. (2019) stop at the proposition 7, making it sound like comparing means is allowed if invariance exists on that level? Somehow intercepts are fixed before, so now proposition 11 level is basiscally not reachable?

3) I should use delta parametrization instead of theta? 

Kind regards,
Andrija

Terrence Jorgensen

unread,
Apr 3, 2026, 4:32:34 AM (yesterday) Apr 3
to lavaan
then I can say that there is invariance between groups for configural, threshold, threshold+loadings levels, while there isn't for "strict" invariance (residuals)?

You should test equality of intercepts after equality of loadings, and before equality of residual variances.
That's the order I show you on the ?measEq.syntax help page.

Chi-square is enough, even though everyone says it is too sensitive?

Not everyone, just people who want to ignore evidence that requires them to consider that their assumptions might not be met.  In most contexts, we enjoy rejecting the H0, in which case nobody complains about a test statistic's sensitivity (i.e., power).  That being said, we also need to place emphasis on effect size in all those contexts, as complementary information to a H0 test.  Even if H0 is not exactly true, it might be nearly true (i.e., negligible effect size).  

A more legitimate argument against basing a decision solely on the test statistic is that the H0 of perfect model fit does not need to be exactly true for a model to make useful predictions.  In the context of testing invariance, meaningful comparisons can probably still be made under partial or approximate invariance.  Unfortunately, (differences in) fit indices often lack a theoretically meaningful interpretation as effect sizes.  That's why I recommended other approaches in my previous reply.

In Svetina et al. (2019) they basically say that authors should figure it out based on several propositions from other authors. Also, they use scaled CFI (probably because robust one isn't available), but you said it doesn't make sense.

There is nothing about the change in CFI (regardless of scaling) that communicates information about why (or to what degree) invariance might not hold exactly.

1) I should do chi-square diff test and choose one of recommended models for differences in fit indices to back it up. If chi-square diff is significant, but model keeps good indices and diff in indices in lower than allowed, that I can reject chi-square diff test and declare invariance?

No, you don't reject a test.  The test allows you to decide whether to reject a H0.  That decision should be supplemented with information about how false the H0 appears to be.  Differences in fit indices don't really communicate that, and the rule-of-thumb cutoffs often fail to capture important/impactful DIF, a problem that counter-intuitively gets worse in larger samples.

https://doi.org/10.1037/met0000152 (continuous indicators)


2) the levels of invariance for ordinal data: configural, thresholds (proposition 4), thresholds + loadings (proposition 7), thresholds + loadings + intercepts (proposition 11).

Right, that's the order, if you want to use CFA to compare latent means.

but Svetina et al. (2019) stop at the proposition 7, making it sound like comparing means is allowed

Did they compare latent means from that model?  I don't think so...

3) I should use delta parametrization instead of theta? 

Only if you are interested in testing equality of residual variances.
Reply all
Reply to author
Forward
0 new messages