Ed's explanation is very clear. I would like to explore another perspective, a model comparison one. Please correct me if I got anything wrong.
First, I will use a dataset with two predictors and one outcome variable for discussion. I set the sample size to 10000 just to make the values to be compared more similar, as shown later. The population regression coefficients are zero such that the p-values are far from zero even for a sample so large.
library(lavaan)
#> This is lavaan 0.6-15
#> lavaan is FREE software! Please report any bugs.
n <- 10000
set.seed(85471)
x1 <- rnorm(n)
x2 <- rnorm(n)
y <- rnorm(n)
dat <- data.frame(x1, x2, y)
The is the usual OLS regression result, using x1 and x2 to predict y:
lm_out <- lm(y ~ x1 + x2, dat)
summary(lm_out)
#>
#> Call:
#> lm(formula = y ~ x1 + x2, data = dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4.1187 -0.6718 -0.0015 0.6708 3.9153
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.013791 0.009957 -1.385 0.166
#> x1 -0.005108 0.009907 -0.516 0.606
#> x2 0.005309 0.009915 0.535 0.592
#>
#> Residual standard error: 0.9956 on 9997 degrees of freedom
#> Multiple R-squared: 5.516e-05, Adjusted R-squared: -0.0001449
#> F-statistic: 0.2757 on 2 and 9997 DF, p-value: 0.759
In addition to interpreting the test of the R-square as a test of the variance explained (the hypothesis that the population R-square is zero), we also also treat this test as a test to compare two models, one with all variables present, and one with intercept only, fitted by y ~ 1 in lm(). This model is equivalent to a model with the two regression coefficients "fixed to zero" (and so x1 and x2 are so-called "not in the model" in t):
lm_out0 <- lm(y ~ 1, dat)
summary(lm_out0)
#>
#> Call:
#> lm(formula = y ~ 1, data = dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4.1224 -0.6712 -0.0017 0.6733 3.9284
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.013856 0.009955 -1.392 0.164
#>
#> Residual standard error: 0.9955 on 9999 degrees of freedom
anova(lm_out0, lm_out)
#> Analysis of Variance Table
#>
#> Model 1: y ~ 1
#> Model 2: y ~ x1 + x2
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 9999 9910.2
#> 2 9997 9909.6 2 0.54668 0.2757 0.759
The F statistic and p-value in the ANOVA test above are identical to the F statistic and p-value of the R-square in the first model, with x1 and x2 included.
A conceptually equivalent test, though numerically not equivalent, in SEM is also a model comparison test. First, this is the model using x1 and x2 to predict y, which is saturated and df = 0:
fit <- sem("y ~ x1 + x2", dat)
summary(fit)
#> lavaan 0.6.15 ended normally after 6 iterations
#>
#> Estimator ML
#> Optimization method NLMINB
#> Number of model parameters 3
#>
#> Number of observations 10000
#>
#> Model Test User Model:
#>
#> Test statistic 0.000
#> Degrees of freedom 0
#>
#> Parameter Estimates:
#>
#> Standard errors Standard
#> Information Expected
#> Information saturated (h1) model Structured
#>
#> Regressions:
#> Estimate Std.Err z-value P(>|z|)
#> y ~
#> x1 -0.005 0.010 -0.516 0.606
#> x2 0.005 0.010 0.536 0.592
#>
#> Variances:
#> Estimate Std.Err z-value P(>|z|)
#> .y 0.991 0.014 70.711 0.000
For this covariance matrix with three variables, we can fit a model "without predictors" or, more precisely, as model with the coefficients of all predictors fixed to zero:
fit0 <- sem("y ~ 0 * x1 + 0 * x2", dat)
summary(fit0)
#> lavaan 0.6.15 ended normally after 7 iterations
#>
#> Estimator ML
#> Optimization method NLMINB
#> Number of model parameters 1
#>
#> Number of observations 10000
#>
#> Model Test User Model:
#>
#> Test statistic 0.552
#> Degrees of freedom 2
#> P-value (Chi-square) 0.759
#>
#> Parameter Estimates:
#>
#> Standard errors Standard
#> Information Expected
#> Information saturated (h1) model Structured
#>
#> Regressions:
#> Estimate Std.Err z-value P(>|z|)
#> y ~
#> x1 0.000
#> x2 0.000
#>
#> Variances:
#> Estimate Std.Err z-value P(>|z|)
#> .y 0.991 0.014 70.711 0.000
This model is not a saturated model. The model df is 2, and the model chi-square p-value is .759.
Though redundant, just for illustration, we can do a chi-square different test to compare these two models:
lavTestLRT(fit, fit0)
#>
#> Chi-Squared Difference Test
#>
#> Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
#> fit 0 28294 28316 0.0000
#> fit0 2 28291 28298 0.5516 0.55165 0 2 0.7589
This is redundant because we know in advance that the model chi-square p-value will be equal to the chi-square different test p-value comparing the model with all coefficients fitted to zero with a saturated model. Nevertheless, this illustrates that, in SEM, we can indeed get a test which is conceptually equivalent to the F-test of the R-square in regression, when viewing both tests as tests of comparing two models.
Numerically, they are not equivalent even for a sample size so large because OLS estimation is used in regression and ML is used in SEM. However, the difference is minor in this case:
lavTestLRT(fit, fit0)[2, 8]
#> [1] 0.7589457
anova(lm_out0, lm_out)[2, 6]
#> [1] 0.7590085
Two notes. First, nothing new from me. I vaguely recall there was a thread in this group discussing a related issue, something like testing R-square increase in SEM. What I presented above is basically what I recalled from that discussion. Second, I did not include mean in the SEM above and so, strictly speaking, the models compared in SEM are not the same two models in ANOVA, but this is not an issue here.
Regards,
Shu Fai Cheung (張樹輝)