Measurement invariance using WLSMV: lower chi-square in a more restrictive model (DWLS)?

Luka Komidar

unread,

Jun 6, 2019, 6:28:41 AM6/6/19

to lavaan

Hi everyone,

I'm conducting a simple measurement invariance analysis across gender for a 6-factor model. The data are ordinal (a 3-point response scale) so I'm using WLSMV. I understand that the robust chi-square (WLSMV) can be lower in a more restrictive model (e.g. a lower chi^2 in the scalar than the metric model) due to scaling adjustments. However, in my case, the DWLS chi^2 is lower in the scalar model (compared to the metric model). This of course leads to a negative chi^2 difference and problems when comparing the nested models. Does anyone have any idea how is that possible? Also, Mplus returns very different results (e.g. lavaan showed that the metric invariance holds, while Mplus gave the opposite result) - I've pasted the two outputs from lavaan and Mplus below.

lavaan MI testing

(I'm posting the results of LRT with the default method, since I just want to present the strangely behaving (DWLS) chi^2)

METRIC vs. CONFIGURAL

Scaled Chi Square Difference Test (method = "satorra.2000")

Df AIC BIC Chisq Chisq diff Df diff Pr(>Chisq)

child_conf 2038 3655.8

child_metric 2079 3994.7 48.835 41 0.1872

SCALAR vs. METRIC

Scaled Chi Square Difference Test (method = "satorra.2000")

Df AIC BIC Chisq Chisq diff Df diff Pr(>Chisq)

child_metric 2079 3994.7

child_scalar 2120 3856.9 -139.9 41 1

Mplus

Invariance Testing

Number of Degrees of

Model Parameters Chi-Square Freedom P-Value

Configural 402 3905.570 2038 0.0000

Metric 361 3956.441 2079 0.0000

Scalar 275 3892.857 2165 0.0000

Degrees of

Models Compared Chi-Square Freedom P-Value

Metric against Configural 91.933 41 0.0000

Scalar against Configural 172.653 127 0.0044

Scalar against Metric 104.125 86 0.0893

Terrence Jorgensen

unread,

Jun 6, 2019, 7:37:40 AM6/6/19

to lavaan

negative chi^2 difference and problems when comparing the nested models. Does anyone have any idea how is that possible?

Judging from your change in df between models, I'm guessing your models are not nested. You didn't provide all your syntax, but I'm guessing you made the common mistake of constraining loadings, then constraining thresholds. With 3-category indicators, your 2 thresholds are only sufficient to identify the indicator's latent intercept and (residual) variance, so your configural model should simply constrain thresholds to equality and estimate loadings and intercepts. Then you constrain loadings, then intercepts, as you would with continuous data. See the help page:

?measEq.syntax

And find details in the Wu & Estabrook paper in the References.

Terrence D. Jorgensen

Assistant Professor, Methods and Statistics

Research Institute for Child Development and Education, the University of Amsterdam

http://www.uva.nl/profile/t.d.jorgensen

Luka Komidar

unread,

Jun 6, 2019, 7:49:28 AM6/6/19

to lavaan

Thanks for your answer! I'll check the measEq.syntax help page.

Judging from your change in df between models, I'm guessing your models are not nested.

I've used the group.equal argument in the lavaan's cfa() function, i.e. group.equal = c("loadings") for the metric model and group.equal = c("loadings", "thresholds") for the scalar model. If I understand correctly, one should not rely on the use of this argument when using WLSMV on 3-category data? Btw, in Mplus, I've also used the shortened syntax MODEL = CONFIGURAL METRIC SCALAR to obtain the results I've pasted in the first post (and also used WLSMV on items defined as categorical).

Terrence Jorgensen

unread,

Jun 6, 2019, 8:08:35 AM6/6/19

to lavaan

in Mplus, I've also used the shortened syntax MODEL = CONFIGURAL METRIC SCALAR to obtain the results I've pasted in the first post (and also used WLSMV on items defined as categorical).

Yes, your results are consistent with how they recommend testing invariance, which is why that sequence of models is so popular. Even they recommend simultaneously constraining loadings and thresholds, though, and Mplus doesn't even provide the option of freeing intercepts of the latent item responses. The Wu & Estabrook article is the first to thoroughly illuminate a very confusing set of identification issues, about which different programmers (Mplus, LISREL) and other methodologists (Roger Millsap) have provided radically different/contradictory advice, all of which have made major assumptions about which users remain unaware. I designed that function to simplify the issues a bit (well, at least automate the complex choices). Ultimately, it should provide a less restrictive set of tests than any other standard advice, but there is yet to be a simulation demonstrating that. I hope it helps in your case. If you do follow Wu & Estabrook's advice (using the default arguments), be prepared to cite their choices and defend it to a reviewer who thinks Muthén can do no wrong ;-)

Luka Komidar

unread,

Jun 7, 2019, 4:12:25 AM6/7/19

to lavaan

Thanks for a very illuminating answer, I've already started reading the Wu & Estabrook paper!