GOF indices for a probit model with ordinal covariates in lavaan

107 views
Skip to first unread message

lemos.mari...@gmail.com

unread,
Jun 28, 2021, 12:20:43 PM6/28/21
to lavaan

I am trying to compute a probit regression model where all the covariates are categorical. Here are the model specification, declaration of ordered categorical variables, and call to "sem":

model <- 'CHG.UNAFFILIATED ~ CHG.UNBELIEVER + CONFIDENCE.CHURCHES + POWER.CHURCHES + REL.RELEVANCE.TODAY + REL.REPRESENTS.PAST + NUM.UNAFFILIATED.PARENTS + ATTEND.11.12 + SEX + AGE.GROUP + DEGREE + HHCHILDR + URBRURAL'

ordered <- c("CHG.UNAFFILIATED","CHG.UNBELIEVER","CONFIDENCE.CHURCHES","POWER.CHURCHES","REL.RELEVANCE.TODAY","REL.REPRESENTS.PAST","ATTEND.11.12","SEX","AGE.GROUP","DEGREE","URBRURAL","NUM.UNAFFILIATED.PARENTS","HHCHILDR")

> fit <- sem(model, data = df, ordered = ordered, std.lv = TRUE, estimator = "WLSMV")

Warning message:

In lav_data_full(data = data, group = group, cluster = cluster,  :

  lavaan WARNING: exogenous variable(s) declared as ordered in data: CHG.UNBELIEVER CONFIDENCE.CHURCHES POWER.CHURCHES REL.RELEVANCE.TODAY REL.REPRESENTS.PAST NUM.UNAFFILIATED.PARENTS ATTEND.11.12 SEX AGE.GROUP DEGREE HHCHILDR URBRURAL


When compared with the results of the “glm” function with probit link the estimates regression coefficients, standard errors and p-values look OK:

> summary(fit)

lavaan 0.6-8 ended normally after 45 iterations

 

  Estimator                                       DWLS

  Optimization method                           NLMINB

  Number of model parameters                        13

                                                     

                                                  Used       Total

  Number of observations                           928        1334

                                                                  

Model Test User Model:

                                              Standard      Robust

  Test Statistic                                 0.000       0.000

  Degrees of freedom                                 0           0

 

Parameter Estimates:

 

  Standard errors                           Robust.sem

  Information                                 Expected

  Information saturated (h1) model        Unstructured

 

Regressions:

                     Estimate  Std.Err  z-value  P(>|z|)

  CHG.UNAFFILIATED ~                                   

    CHG.UNBELIEVER      0.864    0.150    5.740    0.000

    CONFIDENCE.CHU      0.032    0.081    0.397    0.692

    POWER.CHURCHES      0.004    0.089    0.045    0.964

    REL.RELEVANCE.     -0.009    0.102   -0.090    0.928

    REL.REPRESENTS     -0.216    0.094   -2.296    0.022

    NUM.UNAFFILIAT      1.105    0.084   13.179    0.000

    ATTEND.11.12        0.255    0.056    4.539    0.000

    SEX                 0.135    0.117    1.154    0.248

    AGE.GROUP           0.456    0.105    4.340    0.000

    DEGREE             -0.264    0.115   -2.305    0.021

    HHCHILDR           -0.671    0.319   -2.106    0.035

    URBRURAL           -0.045    0.047   -0.940    0.347

 

Intercepts:

                   Estimate  Std.Err  z-value  P(>|z|)

   .CHG.UNAFFILIAT    0.000                          

 

Thresholds:

                   Estimate  Std.Err  z-value  P(>|z|)

    CHG.UNAFFILIAT    3.312    0.645    5.137    0.000

 

Variances:

                   Estimate  Std.Err  z-value  P(>|z|)

   .CHG.UNAFFILIAT    1.000                          

 

Scales y*:

                   Estimate  Std.Err  z-value  P(>|z|)

    CHG.UNAFFILIAT    1.000          

 However, the intercept and variance of the single response variable (CHG.UNAFFILIATED) are fixed to one and zero, respectively, and when I try to obtain the fit indices usually reported I obtain the following:

> round(inspect(fit, what = "fit")[indices],3)

chisq  df  pvalue cfi rmsea rmsea.ci.lower rmsea.ci.upper  srmr

0      1   1      1   0     0              0               0

 

How can I specify the model so that it provides measures of the overall model misfit like the Chisq and RMSEA in this situation?

Thank you very much for your attention.

Sincerely,

 

Carlos M. Lemos

Pat Malone

unread,
Jun 28, 2021, 1:25:11 PM6/28/21
to lav...@googlegroups.com
Carlos,

Two things.

1) In SEM, exogenous variables are treated as interval, even if they are not continuous. That is why lavaan rejects declaring them as categorical. This is generally fine for continuous and binary variables, but not for multi-category discrete variables, whether ordinal or nominal. For those, you need to create dummy or effect codes.

2) For your other question, your model is saturated and necessarily fits perfectly. So you are getting fit measures, but they look odd because there are no degrees of freedom that would you give you basis to reject the model.

As a side note, regression procedures outside SEM almost always use saturated models. Since your model is a univariate multiple regression, your results should be parallel to those of glm (aside from differences introduced by the estimator or missing data handling).

Pat

--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lavaan/5159863c-c387-4c55-8727-1576010666f0n%40googlegroups.com.


--
Patrick S. Malone, PhD
Sr Research Statistician, FAR HARBΦR
This message may contain confidential information; if you are not the intended recipient please notify the sender and delete the message.

Carlos Lemos

unread,
Jun 29, 2021, 5:23:26 AM6/29/21
to lav...@googlegroups.com

Dear Pat,

 

Thank you very much for your attention and prompt reply.

About 1), I thought that “lavaan” was handling the ordinal covariates correctly because the program issued a warning and not an error message.

About 2), I noticed that the model is saturated, but my problem is precisely on how to overcome this situation (of course, the model is not predicting the response exactly). With “glm”, the same model is not saturated, and I get the deviance D, from which I can derive other measures of model misfit. In the “lavaan” model, the variance and intercept of the response variable are both fixed. “lavaan” works from the variance-covariance matrix, which is different from “glm”.

I want to compare this basic probit regression model with both "glm" and one or two SEM based on the same variables, to see whether the SEMs provide evidence on the plausibility of the causal hypotheses embodied in them. For this, I need a lavaan probit model that yields measures of overall misfit.

I will try your suggestion of generating dummy variables for the ordinal covariates, testing first with one or two variables, and see if this change leads to a model that is not saturated.

Once again, many thanks for your time and attention,

 

Carlos M. Lemos


Pat Malone

unread,
Jun 29, 2021, 8:40:20 AM6/29/21
to lav...@googlegroups.com
Carlos,

Since your lavaan model as specified is a generalized linear multiple regression, the only way to introduce misfit would be to constrain one or more of the regression coefficients. This seems pretty arbitrary.

And getting a deviance statistic from glm() does not mean the model is not saturated. You didn't share your glm() function call, but unless you went out of your way (such as my previous paragraph), it is saturated. If you ask for fit measures from lavaan, you'll also see -2LL, AIC, etc., even from a saturated model.

Pat

Carlos Lemos

unread,
Jun 29, 2021, 1:19:37 PM6/29/21
to lav...@googlegroups.com

Dear Pat,

Thank you very much for your time and patience in hanging on and helping me! 

Here is the call to “glm” (I used probit link so that the results could be more closely compared with those by “lavaan”):

> fit <- glm(model, family = binomial(link = "probit"), data = na.omit(df))

 

The formula is as follows, with the same data frame and variables:

> fit$formula

[1] "CHG.UNAFFILIATED  ~  CHG.UNBELIEVER + CONFIDENCE.CHURCHES + POWER.CHURCHES + REL.RELEVANCE.TODAY + REL.REPRESENTS.PAST + NUM.UNAFFILIATED.PARENTS + ATTEND.11.12 + SEX + AGE.GROUP + DEGREE + HHCHILDR + URBRURAL"

 

The model computed by “glm” has 28 parameters

> fit$rank

[1] 28

and is not saturated because

> fit$df.residual

[1] 703

i.e. there are a lot more data points than coefficients (parameters) to be estimated. With this, I can get several GOF indices, for example:

> with(fit,c(deviance,aic))

[1] 559.2648 615.2648

 

Returning to “lavaan”, I inspected the solution (also named "fit") in my first post:

> fitted(fit)

$res.cov

                 CHG.UN

CHG.UNAFFILIATED 1    

 

$res.int

CHG.UNAFFILIATED

               0

 

$res.th

CHG.UNAFFILIATED|t1

              3.312

 

$res.slopes

                 CHG.UN CONFID POWER. REL.REL REL.REP NUM.UN ATTEND   SEX AGE.GR DEGREE HHCHIL URBRUR

CHG.UNAFFILIATED  0.864  0.032  0.004  -0.009  -0.216  1.105  0.255 0.135  0.456 -0.264 -0.671 -0.045

 

$cov.x

                         CHG.UN CONFID POWER. REL.REL REL.REP NUM.UN ATTEND SEX    AGE.GR DEGREE HHCHIL URBRUR

CHG.UNBELIEVER            0.188                                                                              

CONFIDENCE.CHURCHES      -0.008  0.656                                                                       

POWER.CHURCHES            0.011  0.287  0.570                                                                

REL.RELEVANCE.TODAY       0.012  0.058  0.033  0.369                                                         

REL.REPRESENTS.PAST       0.006  0.150  0.129  0.132   0.442                                                 

NUM.UNAFFILIATED.PARENTS  0.058  0.004 -0.002 -0.017  -0.018   0.615                                         

ATTEND.11.12              0.115 -0.103 -0.035 -0.063  -0.091   0.184  0.935                                  

SEX                       0.013 -0.008 -0.019 -0.007  -0.027  -0.013  0.015  0.234                            

AGE.GROUP                 0.009 -0.031 -0.020 -0.032  -0.033   0.060  0.017 -0.011  0.282                    

DEGREE                   -0.008 -0.065 -0.024  0.007  -0.008  -0.018  0.003  0.015  0.018  0.240             

HHCHILDR                  0.002  0.001  0.001 -0.002   0.000   0.003  0.009 -0.001 -0.002 -0.001  0.024      

URBRURAL                  0.012 -0.069 -0.025  0.074   0.025  -0.085 -0.012  0.001 -0.075  0.090 -0.010  1.268

 

$mean.x

          CHG.UNBELIEVER      CONFIDENCE.CHURCHES           POWER.CHURCHES      REL.RELEVANCE.TODAY      REL.REPRESENTS.PAST NUM.UNAFFILIATED.PARENTS

                   1.250                    2.087                    1.909                    2.722                    2.571                    2.505

            ATTEND.11.12                      SEX                AGE.GROUP                   DEGREE                 HHCHILDR                 URBRURAL

                   2.078                    1.371                    2.097                    1.602                    1.025                    2.724

 

and found that the only free parameters are the threshold of the response variable and the regression coefficients. Using inspect(fit, what = “free”), I found that the whole variance-covariance is fixed. I guess that is the reason why lavaan estimates the regression coefficients, but cannot compute indices like chisq, rmsea, etc. in this situation. I will see if I can free for example the variance of the response variable, or other parameters, in a meaningful way.

Once again, many thanks for your time and help,

Carlos M. Lemos

 

 

 

 

 


Pat Malone

unread,
Jun 29, 2021, 1:32:44 PM6/29/21
to lav...@googlegroups.com
You're welcome.

And if it helps, df means something completely different in the two analyses. df in SEM packages doesn't depend on N. And as I said, you can get things like AIC from a saturated model in lavaan by asking for fit measures. D and AIC are not measures of absolute fit--you can't reject a model with them. They're only useful for relative fit.

Pat

Reply all
Reply to author
Forward
0 new messages