categorical variables: lavaan and Mplus

520 views
Skip to first unread message

char86

unread,
Aug 2, 2019, 12:11:27 PM8/2/19
to lavaan

Hi, 

 

I have a dataset containing ordinal variables, which was shared with me when I attended on Mplus course recently. I know what the answers should be in Mplus, but I cannot get the same result using lavaan. 

 

The ordinal variables are: TEAMWORK1, TEAMWORK2, SOCSUP1, SOCSUP2, JOBSAT1 AND JOBSAT2. 

 

I specify the model as follows:


HW.model <- "
  # create endogenous variables
  tmwork =~ 1*TEAMWORK1 + TEAMWORK2
  socsup =~ 1*SOCSUP1 + SOCSUP2
  jobsat =~ 1*JOBSAT1 + JOBSAT2
  # residual variances observed variables
  TEAMWORK1 ~~ TEAMWORK1
  TEAMWORK2 ~~ TEAMWORK2
  SOCSUP1 ~~ SOCSUP1
  SOCSUP2 ~~ SOCSUP2
  JOBSAT1 ~~ JOBSAT1
  JOBSAT2 ~~ JOBSAT2
  # factor covariances
  tmwork ~~ tmwork
  tmwork ~~ socsup
  tmwork ~~ jobsat
  socsup ~~ socsup
  socsup ~~ jobsat
  jobsat ~~ jobsat
"


If I make the following call, then I can get the same results in Mplus (in both cases, a underlying continuous distribution would be assumed):


fit <- lavaan(HW.model, data = healthworkers)
summary
(fit, fit.measures = TRUE)


However, if I change the fit specification to (with the aim of correctly specifying the variables as being categorical):


healthworkers[,c("TEAMWORK1", "TEAMWORK2", "SOCSUP1", "SOCSUP2", "JOBSAT1", "JOBSAT2")] <-
  lapply
(healthworkers[,c("TEAMWORK1", "TEAMWORK2", "SOCSUP1", "SOCSUP2", "JOBSAT1", "JOBSAT2")], ordered)

fit
<- lavaan(HW.model, data = healthworkers,
              parameterization
= "theta",
              ordered
= c("TEAMWORK1", "TEAMWORK2",
                         
"SOCSUP1", "SOCSUP2",
                         
"JOBSAT1", "JOBSAT2"))


I get:

  Estimator                                       DWLS      Robust

 
Model Fit Test Statistic                       7.311      17.038
 
Degrees of freedom                                28          28
  P
-value (Chi-square)                           1.000       0.948
 
Scaling correction factor                                  1.415
 
Shift parameter                                           11.870


This contrast with the Mplus output which is:


Chi-Square Test of Model Fit

         
Value                             19.211*
         
Degrees of Freedom                     6
          P
-Value                           0.0038
*   The chi-square value for MLM, MLMV, MLR, ULSMV, WLSM and WLSMV cannot be used
   
for chi-square difference testing in the regular way.  MLM, MLR and WLSM
    chi
-square difference testing is described on the Mplus website.  MLMV, WLSMV,
   
and ULSMV difference testing is done using the DIFFTEST option.


I do not understand what specifications I need for the fit in lavaan to make R do the same thing as Mplus. I inferred from a presentation by Yves Rosseel, which I found online, that setting the estimator argument in lavaan to:


estimator="WLSMV"

would prompt lavaan to analyse the data in the same way as Mplus, but it does not make a difference: I still get diverging results in Mplus and lavaan. 


Any ideas or insights on this issue would be most welcome. I have also attached the original SPSS file for the data.


Many thanks, 


healthworkersfree.dat

Emmanuel W

unread,
Aug 2, 2019, 12:34:51 PM8/2/19
to lavaan
I think that your variables are not categorical when you use Mplus (see Degrees of freedom).

char86

unread,
Aug 5, 2019, 5:51:56 AM8/5/19
to lavaan
Thanks for your contribution, Emmanuel. I use the command:
CATEGORICAL =
      SOCSUP1 SOCSUP2 TMWORK1 TMWORK2 JOBSAT1 JOBSAT2
;
under
 VARIABLE:
in Mplus. So I think they are. I also do not understand why Mplus and lavaan report vastly different degrees of freedom. 

Emmanuel W

unread,
Aug 5, 2019, 7:44:55 AM8/5/19
to lavaan
I don't use MPlus but according to this link (http://web.pdx.edu/~newsomj/semclass/ho_categorical.pdf), degrees of freedom should be the same. 

Nickname

unread,
Aug 6, 2019, 9:03:51 AM8/6/19
to lavaan
Char86,
Here is a strategy that you can follow to solve your problem:

1. Compare the list of estimated parameters to confirm that they are the same for both the Lavaan and Mplus output.  If you find a difference, then you have not fit the same model using each program.  If Lavaan reports threshold parameters and Mplus does not, then you have not successfully instructed Mplus to treat the variables as categorical.

2. If the problem is with treating variables as categorical, then run an example from the software documentation to confirm that you obtain thresholds.  If so, then compare your script to the sample script to determine why the example produces the intended result but your script does not.  (You can further check the diagnosis of the problem by running the model for continuous variables in Lavaan and confirming that the output matches the current Mplus output.)

Keith
------------------------
Keith A. Markus
John Jay College of Criminal Justice, CUNY
http://jjcweb.jjay.cuny.edu/kmarkus
Frontiers of Test Validity Theory: Measurement, Causation and Meaning.
http://www.routledge.com/books/details/9781841692203/

char86

unread,
Aug 22, 2019, 9:40:03 AM8/22/19
to lavaan

Keith,

 

Thank you for your suggestions. I am confident that the problem arises because the variables are not being treated as categorical rather than model specification. When I pretend the variables are continuous both Mplus and lavaan outputs agree. 

 

Having a sample R script for some example categorical data that is known to work would be ideal. I have searched around for a good example, but I have not found good one. Did you have a particular example and script in mind?


A side note (but related): I do not understand why categorical exogenous and endogenous variables need to be treated differently (see: http://lavaan.ugent.be/tutorial/cat.html). 

 

My variables TEAMWORK1, TEAMWORK2 etc are all exogenous variables, but if all data is assumed to be continuous these would need to be converted to categorical variables regardless of whether they are exogenous or endogenous. What am I missing? I have read the page for categorical variables on the lavaan site multiple times but I still do not get it.

 

Charlie 

Terrence Jorgensen

unread,
Aug 23, 2019, 5:10:10 AM8/23/19
to lavaan

 If I make the following call, then I can get the same results in Mplus (in both cases, a underlying continuous distribution would be assumed):


fit <- lavaan(HW.model, data = healthworkers)


No, this does not assume an underlying continuous distribution, unless your variables are already stored as class c("ordered","factor") in your data.frame

class(healthworkers$TEAMWORK1)

If not, then lavaan will assume the observed (not latent) distribution is continuous.  To make the assumption of a latent normal item response, you need to use the ordered= argument to name any endogenous variables that are stored as class "numeric".

I am confident that the problem arises because the variables are not being treated as categorical rather than model specification. When I pretend the variables are continuous both Mplus and lavaan outputs agree. 


lavaan does a great job converging on the same solution as Mplus when data are (treated as) continuous IN BOTH PROGRAMS.  It is harder to reproduce Mplus results when data are declared ordinal IN BOTH PROGRAMS, but the results I've seen typically only diverge in the third decimal place.

Having a sample R script for some example categorical data that is known to work would be ideal.


This script reproduces the Mplus User Guide example 5.16 found online:  https://www.statmodel.com/usersguide/chap5/ex5.16.html

myData <- read.table("http://www.statmodel.com/usersguide/chap5/ex5.16.dat")
names
(myData) <- c("u1","u2","u3","u4","u5","u6","x1","x2","x3","g")


mod5
.16 <- ' # measurement invariance, except for u3
  f1 =~ u1 + u2 + c(l3, l3b)*u3
  f2 =~ u4 + u5 + u6
# mimic
  f1 + f2 ~ x1 + x2 + x3
# equal thresholds, but free u3|1 in second group
  u3 | c(u3, u3b)*t1
# fix scale of u3* to 1 in second group
  u3 ~*~ c(1, 1)*u3
'

fit5
.16 <- cfa(mod5.16, data = myData, ordered = paste0("u", 1:6),
           
# mimic = "Mplus",   # optional, still not exactly identical
           
group = "g", group.equal = c("loadings","thresholds"))
summary
(fit, fit.measures = TRUE, rsquare = TRUE)


 

A side note (but related): I do not understand why categorical exogenous and endogenous variables need to be treated differently (see: http://lavaan.ugent.be/tutorial/cat.html). 


Because exogenous means we don't make assumptions about its distribution.  

If you assume there is a normal latent response underlying a single observed ordered variable, and you want to estimate the effect of that latent response (as a predictor) on another outcome, then the latent response is endogenous (i.e., you are modeling it, rather than taking it as it is), then you can model the latent response by defining it as a latent variable with the ordinal variable as the only indicator.  Using cfa() should automate sensible default identification restrictions, but I would still include them in your script: 
  • fix the loading to 1 
  • fix the residual variance to 0. For this, you need to use parameterization="theta")
  • estimate only the latent variance

If you want to treat a single ordinal variable as exogenous AND assume its effect on any outcome is linear (in the sense that any 1-category increase leads to the same expected change in an outcome, so only one slope is necessary to estimate), then just treat the observed distribution as continuous.  If you do not feel safe making the linearity assumption (e.g., if there is no reason to think categories are "equally spaced" in the sense that any 1-category increase means the same thing as any other 1-category increase), then make dummy codes to allow slopes to differ for different 1-category increases.

 

My variables TEAMWORK1, TEAMWORK2 etc are all exogenous variables


No they aren't.  They are indicators of common factors, which makes then endogenous by definition.  Indicators are outcomes, explained by their common predictor (the common factor / latent construct).

Terrence D. Jorgensen
Assistant Professor, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam

char86

unread,
Aug 23, 2019, 12:05:46 PM8/23/19
to lavaan
Thank you for the correction and clarification. By using the following syntax:

targetColumns <- c("TEAMWORK1","TEAMWORK2","SOCSUP1","SOCSUP2","JOBSAT1","JOBSAT2")


HW
.modelB <- "
  # measure endogenous variables

  tmwork =~ 1*TEAMWORK1 + TEAMWORK2
  socsup =~ 1*SOCSUP1 + SOCSUP2
  jobsat =~ 1*JOBSAT1 + JOBSAT2
"

fit
<- cfa(HW.modelB,
           data
= healthworkers,
           ordered
= targetColumns,
           parameterization
= "theta")


summary
(fit, fit.measures = TRUE)

I get a chi square result that is much closer to the Mplus result:

Estimator                                       DWLS      Robust
 
Model Fit Test Statistic                       7.311      19.178
 
Degrees of freedom                                 6           6

But I want to use the lavaan() function which requires explicit model definition. If I use lavaan(), then I do not get the same result:

HW.modelA <- "
  # measure endogenous variables

  tmwork =~ 1*TEAMWORK1 + TEAMWORK2
  socsup =~ 1*SOCSUP1 + SOCSUP2
  jobsat =~ 1*JOBSAT1 + JOBSAT2
  # residual (co)variances observed variables

  TEAMWORK1 ~~ TEAMWORK1
  TEAMWORK2 ~~ TEAMWORK2
  SOCSUP1 ~~ SOCSUP1
  SOCSUP2 ~~ SOCSUP2
  JOBSAT1 ~~ JOBSAT1
  JOBSAT2 ~~ JOBSAT2
  # factor (co)variances
  tmwork ~~ tmwork + socsup + jobsat
  socsup ~~ socsup + jobsat
  jobsat ~~ jobsat
"

healthworkers_copy
<- healthworkers
fit
<- lavaan(HW.modelA,
           data
= healthworkers_copy,
           ordered
= targetColumns,
           parameterization
= "theta")


summary
(fit, fit.measures = TRUE)


Estimator                                       DWLS      Robust
 
Model Fit Test Statistic                       7.311      17.038
 
Degrees of freedom                                28          28
 

fixing the residual variances to 0 only causes the result to diverge further. 
How can the model specification for the lavaan function be changed to make the outputs similar if not the same?

Thank you again for your help!

Nickname

unread,
Aug 24, 2019, 9:53:04 AM8/24/19
to lavaan
Char86,
  You forgot the thresholds in your lavaan specification.  To see this, try the following with each fit (you might want to give them different names so that they do not overwrite one another).  In one case you will see positive numbers indicating free parameters and in the other case you will see zeros indicating fixed parameters.

lavInspect(fit, what='free')

  If you add something like the following to your second model specification, the df should match.

# Thresholds
TEAMWORK1
| t1
TEAMWORK2
| t1
SOCSUP1
| t1
SOCSUP2
| t1
JOBSAT1
| t1
JOBSAT2
| t1

  However, you need to adjust the number of thresholds to be consistent with your data (I used binary data).
See the model.syntax help file for more details.

Terrence Jorgensen

unread,
Aug 25, 2019, 8:40:50 AM8/25/19
to lavaan
I want to use the lavaan() function which requires explicit model definition

Then check the call that lavaan uses when you use the cfa() function, to see what arguments it sets by default in cfa() that are turned off by default using lavaan().

fit <- cfa(...)
as.call(lavInspect(fit, "call"))

That is how it automatically frees threshold estimates, as Keith pointed out that you omitted.  Also, you freely estimate the residual variances, which were only identified because your thresholds were fixed to zero.  When you estimate your thresholds, you will need to specify the residual variances are fixed to 1 for identification.

Terrence D. Jorgensen
Assistant Professor, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam

char86

unread,
Sep 27, 2019, 11:10:15 AM9/27/19
to lavaan
Thank you both Keith and Terrence. This is already helping me a lot. I do not have much information on the thresholds you are referring to. In which articles could I find out more about these thresholds?

Nickname

unread,
Sep 28, 2019, 8:42:00 AM9/28/19
to lavaan
char86,
  One option:

Edwards, M. C., Wirth, R. J., Houts, C. R. & Nue, X. (2012).  Categorical data in the structural equation modeling framework.  In R. H. Hoyle (Ed.), Handbook of Structural Equation Modeling (pp. 195-208).  New York: Guilford.

char86

unread,
Oct 2, 2019, 10:10:31 AM10/2/19
to lavaan
Thanks for the suggestion, Keith. I read the chapter attentively and although it was generally relevant and interesting, I could not find anything to elaborate on what these thresholds are or how I might specify them in lavaan (while at the same time understand what I am doing). 

I will continue to search and read more generally about sem. If I find what I am looking for I will share it here. In the meantime, if anyone finds a source that might be useful please post it here!

Terrence Jorgensen

unread,
Oct 2, 2019, 11:41:51 AM10/2/19
to lavaan
Thanks for the suggestion, Keith. I read the chapter attentively and although it was generally relevant and interesting, I could not find anything to elaborate on what these thresholds are

They are explicitly defined on pp. 198-199.  See also https://doi.org/10.1080/10705510701758406

or how I might specify them in lavaan

I don't know if you are testing invariance, but it might still help to walk through the help-page examples of semTools::measEq.syntax() to see how to write syntax for models with thresholds.  Also see my lecture slides:  https://www.uva.nl/binaries/content/documents/personalpages/j/o/t.d.jorgensen/en/tab-three/tab-three/cpitem%5B2%5D/asset?1556027804259

Terrence D. Jorgensen
Assistant Professor, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam

char86

unread,
Jun 23, 2020, 1:44:15 PM6/23/20
to lavaan
Thank you for the continued guidance. I made some progress reproducing the models you specify in your slides and I managed to use the lavaan() function with the data set I refer to above, which then produced an output that is closer to one produced by Mplus. The main issue was that I was not specifying constraints for the mean and variance of the first order latent variables, so thresholds could not be estimated.

How can this approach be applied if the data is non-normal? The results I have shared so far is just practice data for me. The real data I am interested in analysing is actually very non-normal. In this case, is it appropriate to apply the constraints mean = 0 and var = 1

Terrence Jorgensen

unread,
Jun 25, 2020, 4:45:16 AM6/25/20
to lavaan
How can this approach be applied if the data is non-normal?

Categorical data are already not normal.  Do you mean the underlying latent item-response (LIR) is not normal?  You can generate a continuous LIR using any distribution you want (e.g., runif), but once data are discretized, lavaan could not know anything about the true distribution of the LIR.  It can only assume LIRs are normal.  If you are savvy enough to program your own likelihood functions, you could use OpenMx to specify alternative distributions of the LIR.
 
The real data I am interested in analysing is actually very non-normal. In this case, is it appropriate to apply the constraints mean = 0 and var = 1

All continuous distributions have means and variances, and the same identification conditions would apply.  But again, lavaan is only capable of assuming normality of LIRs.

char86

unread,
Jun 25, 2020, 5:20:39 AM6/25/20
to lavaan
Well, the categorical data could be suggesting that the underlying latent item-response is non-normal. Either that or the LIR is normal but the participants give skewed responses. I am not sure which way to look at it.

Terrence Jorgensen

unread,
Jun 26, 2020, 5:46:55 PM6/26/20
to lavaan
Well, the categorical data could be suggesting that the underlying latent item-response is non-normal. Either that or the LIR is normal but the participants give skewed responses. I am not sure which way to look at it.

Unfortunately, that is the point.  There is no empirical way to distinguish between a normal LIR with asymmetric thresholds and a nonnormal LIR with symmetric thresholds.
Reply all
Reply to author
Forward
0 new messages