A quick demo of the problem Jeremy explained in the first reply:
``` r
library(lavaan)
#> This is lavaan 0.6-19
#> lavaan is FREE software! Please report any bugs.
library(psych)
#>
#> Attaching package: 'psych'
#> The following object is masked from 'package:lavaan':
#>
#> cor2cov
mod_pop <-
"
f =~ .7 * x1 + .7 * x2 + .7 * x3 + .7 * x4
"
# "empirical = TRUE" is optional. Used here for demo
dat <- simulateData(model = mod_pop,
sample.nobs = 10000,
standardized = TRUE,
empirical = TRUE,
seed = 2345)
mod <-
"
f =~ x1 + x2 + x3 + x4
"
fit <- cfa(mod, dat, std.lv = TRUE)
parameterEstimates(fit)
#> lhs op rhs est se z pvalue ci.lower ci.upper
#> 1 f =~ x1 0.70 0.01 70.849 0 0.681 0.719
#> 2 f =~ x2 0.70 0.01 70.849 0 0.681 0.719
#> 3 f =~ x3 0.70 0.01 70.849 0 0.681 0.719
#> 4 f =~ x4 0.70 0.01 70.849 0 0.681 0.719
#> 5 x1 ~~ x1 0.51 0.01 52.223 0 0.491 0.529
#> 6 x2 ~~ x2 0.51 0.01 52.223 0 0.491 0.529
#> 7 x3 ~~ x3 0.51 0.01 52.223 0 0.491 0.529
#> 8 x4 ~~ x4 0.51 0.01 52.223 0 0.491 0.529
#> 9 f ~~ f 1.00 0.00 NA NA 1.000 1.000
# Divide into two groups using the scale scores
dat$score <- rowMeans(dat[, 1:4])
score_mean <- mean(dat$score)
score_sd <- sd(dat$score)
# Try changing the cutoff and see what happens
cutoff <- score_mean + .5 * score_sd
dat$group <- ifelse(dat$score < cutoff,
"lower",
"upper")
head(dat)
#> x1 x2 x3 x4 score group
#> 1 0.6132076 -0.3032094 -0.05372075 -1.3581287 -0.2754628 lower
#> 2 -0.2562410 -2.2107157 -0.32678316 -0.2849683 -0.7696771 lower
#> 3 1.4765655 1.3920863 0.18409890 0.5586193 0.9028425 upper
#> 4 -0.4380003 -0.3737520 0.82165120 1.0398746 0.2624434 lower
#> 5 -0.2400874 -1.0359636 -0.85575084 -1.3957762 -0.8818945 lower
#> 6 0.0552557 0.1245552 0.46427104 -0.5391987 0.0262208 lower
table(dat$group)
#>
#> lower upper
#> 6922 3078
# Full sample correlations and distribution
round(cor(dat[, 1:4]), 2)
#> x1 x2 x3 x4
#> x1 1.00 0.49 0.49 0.49
#> x2 0.49 1.00 0.49 0.49
#> x3 0.49 0.49 1.00 0.49
#> x4 0.49 0.49 0.49 1.00
describe(dat, range = FALSE)
#> vars n mean sd skew kurtosis se
#> x1 1 10000 0.00 1.00 -0.05 0.04 0.01
#> x2 2 10000 0.00 1.00 -0.01 0.03 0.01
#> x3 3 10000 0.00 1.00 -0.01 -0.02 0.01
#> x4 4 10000 0.00 1.00 -0.04 -0.04 0.01
#> score 5 10000 0.00 0.79 -0.04 0.04 0.01
#> group* 6 10000 1.31 0.46 0.83 -1.31 0.00
# Compare the correlations
round(cor(dat[dat$group == "lower", 1:4]), 2)
#> x1 x2 x3 x4
#> x1 1.00 0.25 0.26 0.25
#> x2 0.25 1.00 0.25 0.26
#> x3 0.26 0.25 1.00 0.26
#> x4 0.25 0.26 0.26 1.00
round(cor(dat[dat$group == "upper", 1:4]), 2)
#> x1 x2 x3 x4
#> x1 1.00 0.07 0.06 0.11
#> x2 0.07 1.00 0.05 0.06
#> x3 0.06 0.05 1.00 0.07
#> x4 0.11 0.06 0.07 1.00
# Compare the distributions
describe(dat[dat$group == "lower", ], range = FALSE)
#> vars n mean sd skew kurtosis se
#> x1 1 6922 -0.40 0.83 -0.29 0.12 0.01
#> x2 2 6922 -0.40 0.82 -0.23 0.22 0.01
#> x3 3 6922 -0.40 0.83 -0.21 0.10 0.01
#> x4 4 6922 -0.39 0.84 -0.20 0.05 0.01
#> score 5 6922 -0.40 0.55 -0.85 0.72 0.01
#> group* 6 6922 1.00 0.00 NaN NaN 0.00
describe(dat[dat$group == "upper", ], range = FALSE)
#> vars n mean sd skew kurtosis se
#> x1 1 3078 0.89 0.73 0.19 0.12 0.01
#> x2 2 3078 0.91 0.72 0.20 0.21 0.01
#> x3 3 3078 0.90 0.72 0.23 0.09 0.01
#> x4 4 3078 0.88 0.72 0.22 0.01 0.01
#> score 5 3078 0.90 0.40 1.08 1.04 0.01
#> group* 6 3078 1.00 0.00 NaN NaN 0.00
# Multigroup CFA
fit_gp <- cfa(mod, dat, group = "group")
(est <- parameterEstimates(fit_gp))
#> lhs op rhs block group est se z pvalue ci.lower ci.upper
#> 1 f =~ x1 1 1 1.000 0.000 NA NA 1.000 1.000
#> 2 f =~ x2 1 1 0.984 0.046 21.589 0.000 0.895 1.074
#> 3 f =~ x3 1 1 0.999 0.046 21.652 0.000 0.908 1.089
#> 4 f =~ x4 1 1 1.033 0.048 21.737 0.000 0.940 1.126
#> 5 x1 ~~ x1 1 1 0.522 0.012 43.690 0.000 0.499 0.546
#> 6 x2 ~~ x2 1 1 0.512 0.012 43.879 0.000 0.489 0.535
#> 7 x3 ~~ x3 1 1 0.513 0.012 43.446 0.000 0.490 0.536
#> 8 x4 ~~ x4 1 1 0.526 0.012 42.776 0.000 0.502 0.550
#> 9 f ~~ f 1 1 0.174 0.011 15.507 0.000 0.152 0.195
#> 10 x1 ~1 1 1 -0.397 0.010 -39.553 0.000 -0.416 -0.377
#> 11 x2 ~1 1 1 -0.404 0.010 -40.764 0.000 -0.424 -0.385
#> 12 x3 ~1 1 1 -0.402 0.010 -40.421 0.000 -0.422 -0.383
#> 13 x4 ~1 1 1 -0.392 0.010 -38.679 0.000 -0.412 -0.372
#> 14 f ~1 1 1 0.000 0.000 NA NA 0.000 0.000
#> 15 f =~ x1 2 2 1.000 0.000 NA NA 1.000 1.000
#> 16 f =~ x2 2 2 0.626 0.163 3.845 0.000 0.307 0.944
#> 17 f =~ x3 2 2 0.619 0.162 3.832 0.000 0.303 0.936
#> 18 f =~ x4 2 2 1.011 0.263 3.844 0.000 0.496 1.527
#> 19 x1 ~~ x1 2 2 0.478 0.020 23.577 0.000 0.439 0.518
#> 20 x2 ~~ x2 2 2 0.503 0.015 33.778 0.000 0.474 0.532
#> 21 x3 ~~ x3 2 2 0.502 0.015 33.880 0.000 0.473 0.531
#> 22 x4 ~~ x4 2 2 0.468 0.020 22.953 0.000 0.428 0.508
#> 23 f ~~ f 2 2 0.056 0.017 3.249 0.001 0.022 0.090
#> 24 x1 ~1 2 2 0.892 0.013 67.680 0.000 0.866 0.918
#> 25 x2 ~1 2 2 0.909 0.013 69.577 0.000 0.883 0.934
#> 26 x3 ~1 2 2 0.905 0.013 69.365 0.000 0.879 0.930
#> 27 x4 ~1 2 2 0.882 0.013 67.496 0.000 0.856 0.907
#> 28 f ~1 2 2 0.000 0.000 NA NA 0.000 0.000
# Compare the loadings
cbind(est[1:4, "est"],
est[15:18, "est"])
#> [,1] [,2]
#> [1,] 1.0000000 1.0000000
#> [2,] 0.9842726 0.6255983
#> [3,] 0.9985078 0.6193028
#> [4,] 1.0328542 1.0112341
```
<sup>Created on 2025-08-16 with [reprex v2.1.1](https://reprex.tidyverse.org)</sup> The demo assumes that one group has more cases than the other (mean + .5 SD as the cutoff), which I believe is what happens in your dataset.
The CFA model is the true model in this example. However, when the sample is split into two using the scale scores, the correlations in both groups are smaller than those in the full sample. Some factor loadings are also substantially different between groups.
For this multivariate normal dataset, there is a value of cutoff that can yield similar factor loadings between groups. However, the correlations will still be smaller than those in the full sample.
A side issue: Even though the full population is multivariate normal, the two subpopulations are not. The skewness and kurtosis values can demonstrate this for the two groups.
This demo assumes that the CFA model is the true model and that there is "no group", that is, no heterogeneous subpopulations, for this model. This assumption may not be sensible theoretically, but this is what we assume when we fit the CFA model to the full sample. If this is indeed the model you believe in, then the sample should not be split into two. The latent factor behind the items is assumed to be continuous in this model.
The problem demonstrated above occurred when you used the factor, or the scale mean based on the factor, to create the two groups. If you have a third variable to divide the samples into two groups, such as clinical diagnosis, then you can do multigroup analysis and examine measurement invariance. The issue above may still be present because the latent factor in the CFA model may, or should, still be related to the third variable. However, in this case, the full-sample model is no longer relevant because the two samples are considered to be from two different populations. Actually, with this third variable, it may not make sense to analyze the full sample using a single-group model, even if measurement invariance can be established (because factor means may be different).
If you do not have other variables to form the group, but think that, theoretically, there are two heterogeneous populations underlying the data, and these two populations may have different factor means, factor loadings, etc., perhaps what you want to try is a finite mixture model.
My two cents.
-- Shu Fai