Degrees of freedom and fit measures with simple regressions in lavaan?

341 views
Skip to first unread message

AM

unread,
May 6, 2023, 3:41:22 PM5/6/23
to lavaan
Hi,

As a part of a bigger project, I am using some simple regressions in lavaan and I am confused by the fit measures which lavaan outputs. Try this very simple model:

sem("mpg ~ cyl", mtcars)

The output says:

lavaan 0.6.15 ended normally after 1 iteration Estimator ML Optimization method NLMINB Number of model parameters 2 Number of observations 32 Model Test User Model: Test statistic 0.000 Degrees of freedom 0

I am trying to understand why is the chi-square, as well as the degrees of freedom 0. I want to obtain the equivalent value to the F-test from simple regression, which I think should be chi square divided by the degrees of freedom. However, that doesn't seem to be possible here. Why?

Edward Rigdon

unread,
May 6, 2023, 4:57:19 PM5/6/23
to lav...@googlegroups.com
The regression F test is a test of variance explained in the outcome variable. The CFA / SEM chi-square test is a test of consistency between empirical covariance matrix and model-implied covariance matrix, across all variables in the model. The two tests have almost nothing in common. CFA / SEM degrees of freedom for a simple regression are almost always 0, because such models are almost always saturated. There are 3 nonredundant elements in the covariance matrix, but lavaan here excludes 1 by default, because it is treating the predictor as being "fixed" and exogenous, as opposed to being random and endogenous. So with only two elements being fitted and two free parameters in the model, degrees of freedom = 2 - 2 = 0.

--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lavaan/a68b1467-cff9-460f-85f7-16f3415dfe14n%40googlegroups.com.

Shu Fai Cheung

unread,
May 7, 2023, 10:21:27 AM5/7/23
to lav...@googlegroups.com
Ed's explanation is very clear. I would like to explore another perspective, a model comparison one. Please correct me if I got anything wrong.

First, I will use a dataset with two predictors and one outcome variable for discussion. I set the sample size to 10000 just to make the values to be compared more similar, as shown later. The population regression coefficients are zero such that the p-values are far from zero even for a sample so large.
library(lavaan)
#> This is lavaan 0.6-15
#> lavaan is FREE software! Please report any bugs.
n <- 10000
set.seed(85471)
x1 <- rnorm(n)
x2 <- rnorm(n)
y <- rnorm(n)
dat <- data.frame(x1, x2, y)
The is the usual OLS regression result, using x1 and x2 to predict y:
lm_out <- lm(y ~ x1 + x2, dat)
summary(lm_out)
#> 
#> Call:
#> lm(formula = y ~ x1 + x2, data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.1187 -0.6718 -0.0015  0.6708  3.9153 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.013791   0.009957  -1.385    0.166
#> x1          -0.005108   0.009907  -0.516    0.606
#> x2           0.005309   0.009915   0.535    0.592
#> 
#> Residual standard error: 0.9956 on 9997 degrees of freedom
#> Multiple R-squared:  5.516e-05,  Adjusted R-squared:  -0.0001449 
#> F-statistic: 0.2757 on 2 and 9997 DF,  p-value: 0.759
In addition to interpreting the test of the R-square as a test of the variance explained (the hypothesis that the population R-square is zero), we also also treat this test as a test to compare two models, one with all variables present, and one with intercept only, fitted by y ~ 1 in lm(). This model is equivalent to a model with the two regression coefficients "fixed to zero" (and so x1 and x2 are so-called "not in the model" in t):
lm_out0 <- lm(y ~ 1, dat)
summary(lm_out0)
#> 
#> Call:
#> lm(formula = y ~ 1, data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.1224 -0.6712 -0.0017  0.6733  3.9284 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.013856   0.009955  -1.392    0.164
#> 
#> Residual standard error: 0.9955 on 9999 degrees of freedom

anova(lm_out0, lm_out)
#> Analysis of Variance Table
#> 
#> Model 1: y ~ 1
#> Model 2: y ~ x1 + x2
#>   Res.Df    RSS Df Sum of Sq      F Pr(>F)
#> 1   9999 9910.2                           
#> 2   9997 9909.6  2   0.54668 0.2757  0.759
The F statistic and p-value in the ANOVA test above are identical to the F statistic and p-value of the R-square in the first model, with x1 and x2 included.

A conceptually equivalent test, though numerically not equivalent, in SEM is also a model comparison test. First, this is the model using x1 and x2 to predict y, which is saturated and df = 0:
fit <- sem("y ~ x1 + x2", dat)
summary(fit)
#> lavaan 0.6.15 ended normally after 6 iterations
#> 
#>   Estimator                                         ML
#>   Optimization method                           NLMINB
#>   Number of model parameters                         3
#> 
#>   Number of observations                         10000
#> 
#> Model Test User Model:
#>                                                       
#>   Test statistic                                 0.000
#>   Degrees of freedom                                 0
#> 
#> Parameter Estimates:
#> 
#>   Standard errors                             Standard
#>   Information                                 Expected
#>   Information saturated (h1) model          Structured
#> 
#> Regressions:
#>                    Estimate  Std.Err  z-value  P(>|z|)
#>   y ~                                                 
#>     x1               -0.005    0.010   -0.516    0.606
#>     x2                0.005    0.010    0.536    0.592
#> 
#> Variances:
#>                    Estimate  Std.Err  z-value  P(>|z|)
#>    .y                 0.991    0.014   70.711    0.000
For this covariance matrix with three variables, we can fit a model "without predictors" or, more precisely, as model with the coefficients of all predictors fixed to zero:
fit0 <- sem("y ~ 0 * x1 + 0 * x2", dat)
summary(fit0) #> lavaan 0.6.15 ended normally after 7 iterations #> #> Estimator ML #> Optimization method NLMINB #> Number of model parameters 1 #> #> Number of observations 10000 #> #> Model Test User Model: #> #> Test statistic 0.552 #> Degrees of freedom 2 #> P-value (Chi-square) 0.759 #> #> Parameter Estimates: #> #> Standard errors Standard #> Information Expected #> Information saturated (h1) model Structured #> #> Regressions: #> Estimate Std.Err z-value P(>|z|) #> y ~ #> x1 0.000 #> x2 0.000 #> #> Variances: #> Estimate Std.Err z-value P(>|z|) #> .y 0.991 0.014 70.711 0.000
This model is not a saturated model. The model df is 2, and the model chi-square p-value is .759.

Though redundant, just for illustration, we can do a chi-square different test to compare these two models:
lavTestLRT(fit, fit0)
#> 
#> Chi-Squared Difference Test
#> 
#>      Df   AIC   BIC  Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
#> fit   0 28294 28316 0.0000                                    
#> fit0  2 28291 28298 0.5516    0.55165     0       2     0.7589
This is redundant because we know in advance that the model chi-square p-value will be equal to the chi-square different test p-value comparing the model with all coefficients fitted to zero with a saturated model. Nevertheless, this illustrates that, in SEM, we can indeed get a test which is conceptually equivalent to the F-test of the R-square in regression, when viewing both tests as tests of comparing two models.

Numerically, they are not equivalent even for a sample size so large because OLS estimation is used in regression and ML is used in SEM. However, the difference is minor in this case:
lavTestLRT(fit, fit0)[2, 8]
#> [1] 0.7589457
anova(lm_out0, lm_out)[2, 6]
#> [1] 0.7590085
Two notes. First, nothing new from me. I vaguely recall there was a thread in this group discussing a related issue, something like testing R-square increase in SEM. What I presented above is basically what I recalled from that discussion. Second, I did not include mean in the SEM above and so, strictly speaking, the models compared in SEM are not the same two models in ANOVA, but this is not an issue here.

Regards,
Shu Fai Cheung (張樹輝)


--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.

Keith Markus

unread,
May 7, 2023, 11:36:43 AM5/7/23
to lavaan
Here is a result that I did not anticipate.

Out of curiosity, I ran the original model in lavaan but using the ULS estimator and including the intercept term to see how similar the results would be.  For this model at least, ULS produced smaller standard errors than OLS.  That was not too surprising, I did not have strong expectations either way.

However, the curious result is that the chi-square difference test reports a p value but the p value prints as NA for the model with the zero effect coefficient despite having 1 degree of freedom.  Does anyone know the reason for that?

data(mtcars)
summary(mtcars)
require(lavaan)
# Fit regression model
semFit <- sem("mpg ~ cyl", mtcars, estimator='uls', meanstructure=TRUE)
# Fit null model with zero weight for predictor
semFitNull <- sem("mpg ~ 0*cyl", mtcars, estimator='uls', meanstructure=TRUE)
# Test the difference in fit
anova(semFit, semFitNull)
# Because the nesting model is saturated, this
#  is the same as the fit for the nested model.
lavInspect(semFitNull, what='fit')[3:5]
# For comparison
regFit <- lm(mpg ~ cyl, data=mtcars)
summary(regFit)
summary(semFit)
summary(semFitNull)


Keith
------------------------
Keith A. Markus
John Jay College of Criminal Justice, CUNY
http://jjcweb.jjay.cuny.edu/kmarkus
Frontiers of Test Validity Theory: Measurement, Causation and Meaning.
http://www.routledge.com/books/details/9781841692203/

Edward Rigdon

unread,
May 7, 2023, 1:32:51 PM5/7/23
to lav...@googlegroups.com
Don't "know," but with the estimator changed to "ml", you get a p-value (0.000), and a much smaller chi-square. Could be that the 1.000 p-value for the saturated model under "uls" is just plugged in once degrees of freedom is found to be 0.
--Ed R. 

--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.

Shu Fai Cheung

unread,
May 7, 2023, 2:54:37 PM5/7/23
to lav...@googlegroups.com
I also do not know the reason but I would like to try, as an exercise for me.

In the summary of semFitNull, the row label of the NA for p-value is "P-value (Unknown)". It seems that the following lines do the printing:


"P-value (Unknown)" is printed because refdistr == "unknown", which is an element of the slot TEST in the output of sem():

> semFitNull@test
$standard
$standard$test
[1] "standard"
$standard$stat
[1] 2608.109
$standard$stat.group
[1] 2608.109
$standard$df
[1] 1
$standard$refdistr
[1] "unknown"
$standard$pvalue
[1] NA

As shown above, the p-value ($standard$pvalue) has already been set to NA in the call to sem(). It is not computed again when calling summary(). I am not 100% certain but I believe the p-value was set to NA and refdist set to "unknown" below, when the estimator is "ULS":


So, the p-value is "designed" to be NA in this case.
For the chi-square difference test conducted by anova(), I guess there is a p-value because the p-value is computed by the difference in chi-square and somehow this function does not check whether refdist is "unknown", as does summary() (to be precise, lav_test_print(), called by summary()):

By the way, the p-value of the saturated model with estimator = "ULS" is also NA, though not printed. As shown below, refdistr is also "unknown", just like the 1-df model:

> semFit@test
$standard
$standard$test
[1] "standard"
$standard$stat
[1] 1.317358e-14
$standard$stat.group
[1] 1.317358e-14
$standard$df
[1] 0
$standard$refdistr
[1] "unknown"
$standard$pvalue
[1] NA

It seems that lav_test_print(), called by summary(), automatically suppresses printing the p-value when df is 0, and so we do not know that it is also NA:

The above explanation is based on my understanding of the lines above. There may be things that I overlooked.

Regards,
Shu Fai Cheung (張樹輝)
--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.

Shu Fai Cheung (張樹輝)

unread,
May 7, 2023, 3:10:00 PM5/7/23
to lavaan
Oh ... a typo. The slot I mentioned is test, lowercase, not TEST.

-- Shu Fai

Shu Fai Cheung

unread,
May 7, 2023, 3:19:23 PM5/7/23
to lav...@googlegroups.com
I forgot the following part:

> lavInspect(semFitNull, what='fit')[3:5]
   chisq       df   pvalue
2608.109    1.000       NA

The p-value above is NA because lavInspect() calls fitMeasures() when what = "fit", and fitMeasures() simply retrieves the stored p-value, like summary().

Regards,
Shu Fai Cheung (張樹輝)

Keith Markus

unread,
May 8, 2023, 11:23:07 AM5/8/23
to lavaan
Thanks Ed and Shu Fai.  I do not follow the statistical literature closely enough to know the current state of play.  It was my impression that ULS had been rescued from the dustbin of history by threshold models for ordinal variables and was receiving renewed interest during the past decade or so.  That seems all the more remarkable if ULS does not support chi-square tests.  This is probably literally the first time that I have ever fit a model using ULS, so it is unfamiliar territory for me.
Reply all
Reply to author
Forward
0 new messages