Discrepancy between values set when generating data and values obtained in subsequent analyses

73 views
Skip to first unread message

gaia...@usal.es

unread,
Apr 5, 2019, 4:41:11 AM4/5/19
to lavaan
Dear users,
Im using lavaan package to create a dataset for the simplest mediation possible (1 exogenous variable, 1 mediator, 1 dependent varible, DV; NO latent variable). As a reference I take the values and structure of Figure 3.5 of Grace's 2006 book. Then, the data generated have been analyzed under the same lavaan and under MPlus. 

I simulated 3 datasets where: 1) all variables were continuous and normally distributed; 2) Mediator is a binary variable and; 3) reponse vble is a binary variable.
Results:
1) Simplest mediation with only continuous variables: There is no discrepancy between the values set when generating data and values obtained when analysed in MPlus and lavaan
2) Simplest mediation with binary mediator It doesn’t discrepancy between set and abotained values both in lavaan and MPlus
3) Simplest mediation with binary DV Data generated in lavaan works in MPlus, not in lavaan. In this case, it can be that the differences between lavaan and MPlus can be on the estimator. Lavaan cannot work with MLR or estimators for binary variables

Copy below the commands for the 3 cases. Sorry for the length!

1.       Simplest mediation with all continuousvariablez It works

#Generated data (I use the example from Grace 2006, p. 46, Fig. 3.5)

N <- 10000

x <- rnorm(N)

y1 <- 0.55*x + rnorm(N)

y2 <- 0.24*x + 0.47*y1 + rnorm(N)

LinearMedData <- data.frame(x, y1,y2)

write.xlsx(LinearMedData, "C:/Users/gaiarrido/Desktop/LinearMedData.xlsx") # write sheet

 

#Data analysis

LinearMedMod <- 'y1~a*x

y2~b*x+ c*y1

'

fitLinearMedMod <- sem(LinearMedMod, data = LinearMedData)

summary(fitLinearMedMod)

Regressions:

                   Estimate  Std.Err  z-value  P(>|z|)

  y1 ~                                               

    x          (a)    0.548    0.010   55.612    0.000

  y2 ~                                                

    x          (b)    0.244    0.011   21.539    0.000

    y1         (c)    0.456    0.010   45.379    0.000

 

Variances:    Estimate  Std.Err  z-value  P(>|z|)

   .y1                0.981    0.014   70.711    0.000

   .y2                0.990    0.014   70.711    0.000

 

same values in MPlus (LinearMedData.csv is the data generated in lavaan)

DATA: FILE IS LinearMedData.csv ;

VARIABLE:        NAMES sort x y1 y2;

                 USEVARIABLES x y1 y2;

                

ANALYSIS: ESTIMATOR = ML;

MODEL:    y1 on x;

          y2 on y1;

          y2 on x;

OUTPUT: SAMPSTAT

        STDYX

        CINTERVAL

        TECH3;


Results 

                    Estimate       S.E.  Est./S.E.    P-Value

 Y1       ON

    X                  0.548      0.010     55.611      0.000

 Y2       ON

    Y1                 0.456      0.010     45.378      0.000

    X                  0.244      0.011     21.542      0.000

 Intercepts

    Y1                 0.002      0.010      0.230      0.818

    Y2                -0.011      0.010     -1.080      0.280

 Residual Variances

    Y1                 0.981      0.014     70.709      0.000

    Y2                 0.990      0.014     70.710      0.000

 

2.       Simplest mediation with binary mediator 

#Generated data

N <- 10000

x <- rnorm(N)

y1dummy <- inv.logit(0.55* x) > runif(N)

#I have to do a trick to create a dummy variable depending on other when generating dataset cause lavaan cannot work still with dummy, binary or categorical variables. In words of Terrence: It will be probit by default (logit unavailable) if you declare the outcome as categorical using the ordered= argument

y2 <- 0.24*x + 0.47*y1 + rnorm(N)

LogitMedData <- data.frame(x, y1dummy,y2)

write.xlsx(LogitMedData, "C:/Users/gaiarrido/Desktop/LogitMedData.xlsx") # write sheet

 

#Data analysis

LogitMedMod <- 'y1dummy~a*x

y2~b*x+ c*y1dummy

'

fitLogitMedMod <- sem(LogitMedMod, data = LogitMedData, estimator="WLSMV" , ordered="y1dummy")

summary(fitLogitMedMod)

Regressions:

                   Estimate  Std.Err  z-value  P(>|z|)

  y1dummy ~                                           

    x          (a)    0.345    0.013   25.990    0.000

  y2 ~                                               

    x          (b)    0.498    0.012   41.974    0.000

    y1dummy    (c)   -0.013    0.014   -0.940    0.347


Same two problems here

1)      I have to say that is ordered, not dummy cause lavaan cannot work still with dummy

2)      I cannot use estimator MLR

 

In MPlus

DATA: FILE IS LogitMedData.csv ;

VARIABLE:        NAMES sort x y1dummy y2;

                 USEVARIABLES x y1dummy y2;

                 CATEGORICAL ARE y1dummy;

ANALYSIS:   TYPE = random;

            ALGORITHM = INTEGRATION;

            ESTIMATOR = MLR;

MODEL:    y1dummy on x;

          y2 on y1dummy;

          y2 on x;

OUTPUT: SAMPSTAT

        STDYX

        CINTERVAL

        TECH3;


Results

                                     Estimate       S.E.  Est./S.E.    P-Value

 Y1DUMMY    ON

    X                                0.558          0.022     25.433      0.000

 Y2         ON

    Y1DUMMY              -0.020          0.023     -0.898      0.369

    X                                0.496          0.011     43.987      0.000

 

3.       Simplest mediation with binary DV

#Generated data

N <- 10000

x <- rnorm(N)

y1 <- 0.55*x + rnorm(N)

y2dummy <- inv.logit(0.24*x+0.47*y1)> runif(N) #In the forum told me that this is the good option

#I have to do a trick to create a dummy variable depending on other when generating dataset cause lavaan cannot work still with dummy, binary or categorical variables. In words of Terrence: It will be probit by default (logit unavailable) if you declare the outcome as categorical using the ordered= argument

LogitDvData <- data.frame(x, y1,y2dummy)

write.xlsx(LogitDvData, "C:/Users/gaiarrido/Desktop/LogitDvData.xlsx") # write sheet

 

#Data analysis

LogitDvMod <- 'y1~a*x

y2dummy~b*x+ c*y1

'

fitLogitDvMod <- sem(LogitDvMod, data = LogitDvData, estimator="WLSMV" , ordered="y2dummy")

summary(fitLogitDvMod)

Regressions:

                   Estimate  Std.Err  z-value  P(>|z|)

  y1 ~                                               

    x          (a)    0.548    0.010   55.807    0.000

  y2dummy ~                                           

    x          (b)    0.148    0.014   10.340    0.000

    y1         (c)    0.267    0.012   22.401    0.000

 

Same two problems here

3)      I have to say that is ordered, not dummy cause lavaan cannot work still with dummy

4)      I cannot use estimator MLR

 

In MPlus

DATA: FILE IS LogitDvData.csv ;

VARIABLE:        NAMES sort x y1 y2dummy;

                 USEVARIABLES x y1 y2dummy;

                 CATEGORICAL ARE y2dummy;

ANALYSIS:   TYPE = random;

            ALGORITHM = INTEGRATION;

            ESTIMATOR = MLR;

MODEL:    y1 on x;

          y2dummy on y1;

          y2dummy on x;

OUTPUT: SAMPSTAT

        STDYX

        CINTERVAL

        TECH3;

Results are good

                    Estimate       S.E.  Est./S.E.    P-Value

 Y1         ON

    X                  0.548      0.010     55.402      0.000

 Y2DUMMY    ON

    Y1                 0.452      0.022     20.323      0.000

    X                  0.251      0.024     10.338      0.000

Terrence Jorgensen

unread,
Apr 5, 2019, 5:59:40 AM4/5/19
to lavaan
FYI, I didn't see a single question in you entire post.

I have to do a trick to create a dummy variable depending on other when generating dataset cause lavaan cannot work still with dummy, binary or categorical variables.


Yes it can.  If the variables are binary or ordered, it will essentially replace linear regression with probit regression.  That is the same thing Mplus does by default, although Mplus also has the option to use a logit link if you want to use a much slower numerical algorithm for obtaining ML estimates.

In words of Terrence: It will be probit by default (logit unavailable) if you declare the outcome as categorical using the ordered= argument


Exactly.

1)      I have to say that is ordered, not dummy cause lavaan cannot work still with dummy


Yes it can, the same way that Mplus treats it.  I think you did not understand when I told you that binary data are equivalently described as nominal or ordinal.  Take sex for example (although not truly binary).  If you code it such that 1 = male, then it is ordered in the sense that 1 indicates more "male-sexed-ness" than 0.  If you code it such that 1 = female, then it is ordered in the sense that 1 indicates more "female-sexed-ness" than 0.  Because you do not use additional (nominal) categories, there is no practical distinction between treating these 2 categories as ordered or unordered.  

2)      I cannot use estimator MLR 


That is only a big advantage of using a logit link if you want to use FIML to handle missing data.  In lavaan, you can only set missing="pairwise" with the probit link, which assumes MCAR data rather than MAR.  Or you can use multiple imputation, which (like FIML) assumes MAR.  

But whether you have missing or complete data, using DWLS estimation with a probit link also comes with robust SEs (but unlike the logit link with MLR, the probit link also provides a robust test statistic and fit indices).  Logit and probit links provide notoriously similar results, so there is no substantive reason to prefer a logit except that it might be more familiar to you.

Terrence D. Jorgensen
Assistant Professor, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam
 

MARIO GARRIDO ESCUDERO

unread,
Apr 5, 2019, 12:35:18 PM4/5/19
to lav...@googlegroups.com
Dear Dr. Jorgensen,
thanks a lot for your reply and sorry. It seems that I copy the draft and not the last version I wanted to post. That's why no questions and the references to your words. My apologies.

My main question is why, when I include one binary variable, I dont get in the analyses stage the values I defined in data generation . As in the case 2 & 3. While in case 1, with only continuous variables, I see no discrepancy. I try to explain myself:

Regarding binary variables and the discreepancy between the generated data and the subsequent analyses for lavaan, while concordance in MPlus: 
1. I created a dataset with one dummy variable, is it well defined?

N <- 10000

x <- rnorm(N)

y1dummy <- inv.logit(0.55* x) > runif(N)

y2 <- 0.24*x + 0.47*y1dummy + rnorm(N)

LogitMedData <- data.frame(x, y1dummy,y2)    

2. I create a model where the dummy variable is the mediator

LogitMedMod <- 'y1dummy~a*x

y2~b*x+ c*y1dummy

'

fitLogitMedMod <- sem(LogitMedMod, data = LogitMedData, ordered="y1dummy")


3. I run the model in lavaan and the path coefficients are not the ones defined in step 1.
summary(fitLogitMedMod)    

Regressions:        Estimate  Std.Err  z-value  P(>|z|)

  y1dummy ~                                          

    x          (a)         0.355       0.013   26.719    0.000

  y2 ~                                               

    x          (b)         0.215       0.011   19.284    0.000

    y1dummy  (c)    0.275       0.013   21.205    0.000    

4. However, when I run the same model with the same data in MPlus, the path coefficients are the ones defined in step 1
QUESTION: are those differences attributable only to the use of different estimators (DWLS in lavaan and MLR in MPlus)?

In MPlus

DATA: FILE IS LogitMedData.csv ;

VARIABLE:        NAMES sort x y1dummy y2;

                 USEVARIABLES x y1dummy y2;

                 CATEGORICAL ARE y1dummy;

ANALYSIS:   TYPE = random;

            ALGORITHM = INTEGRATION;

            ESTIMATOR = MLR;

MODEL:    y1dummy on x;

          y2 on y1dummy;

          y2 on x;


Results 

                                    Estimate       S.E.    Est./S.E.    P-Value

 Y1DUMMY    ON

    X                                 0.576        0.022     25.973      0.000

 Y2         ON

       X                              0.253        0.010     24.416      0.000

 Y1DUMMY                   0.446        0.021     21.234      0.000    




--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.
To post to this group, send email to lav...@googlegroups.com.
Visit this group at https://groups.google.com/group/lavaan.
For more options, visit https://groups.google.com/d/optout.

Mauricio Garnier-Villarreal

unread,
Apr 5, 2019, 4:47:13 PM4/5/19
to lavaan
Mario

The difference is the link function.
lavaan uses the probit link function, while Mplus is using the logit link function.

If you want to transform it you can multiple the probit slope for 1.6 to get a close estimate of the equivalent logit version https://statmodeling.stat.columbia.edu/2006/06/06/take_logit_coef/?fbclid=IwAR1B4npbUC4Ng7w4IU6CRBG02QH0YLh7kj_wZD9DjtIQggGrOoaHxxqY-d8

You can save the slopes in logit metric

a_logit := a*1.6
c_logit := c*1.6


Also, a clarification of what Mplus is doing. Mplus is NOT using MLR, it is using Marginal Maximum Likelihood (MML), this is part of the confusion caused by their shortcuts. When ordered variables are included, and MLR Mplus will use MML which is the ML estimator when using categorical variables, for example is the most common estimator with IRT
To unsubscribe from this group and stop receiving emails from it, send an email to lav...@googlegroups.com.

MARIO GARRIDO ESCUDERO

unread,
Apr 7, 2019, 7:51:33 AM4/7/19
to lav...@googlegroups.com
Thanks so much Mauricio. This is exactly what I needed to know. Now things are clear for me

To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.

Mauricio Garnier-Villarreal

unread,
Apr 7, 2019, 5:39:01 PM4/7/19
to lavaan
Mario

Actually, as Gelman stated in his post, the factor for the relation between logit and probit models is not fixed, but it varies. I did a further test, and the "ideal" ration between them vary in function of the absolute value of the slope. Where the ratio ranges from 1.6 to 1.87 (as you can see in the plot). Given my results I would suggest to use 1.7 as the factor to relate logit and probit regression slopes. As using this the average difference between the logistic slope and the transformed probit would be 0.02, while using the 1.6 the average difference would be 0.05.
I got these results by looking at the ratio beteween the slopes over 50000 replications treating the logistic slope as random from a Normal(mean=0.55, sd=2)

My suggestion:
a_logit := a*1.7
c_logit := c*1.7


Terrence Jorgensen

unread,
Apr 8, 2019, 7:26:54 AM4/8/19
to lavaan
My suggestion:
a_logit := a*1.7
c_logit := c*1.7

Or you could simply simulate binary data from the probit model:

y1dummy <- pnorm(0.55* x) > runif(N)

Reply all
Reply to author
Forward
0 new messages