Different parameterization when using categorical items

AM

unread,

Mar 25, 2023, 8:41:00 AM3/25/23

to lavaan

Hi, I am struggling to understand the parameterization of the lavaan output when using categorical manifest variables. I would like to start this question by first showing with continuous indicators the part which I understand, and then with categorical indicators the part which I don't understand.

Continuous indicators

For continuous indicators, let's use this script to manually generate the data and analyze it with lavaan:

trait <- rnorm(500)

error1 <- rnorm(500)
error2 <- rnorm(500)
error3 <- rnorm(500)
error4 <- rnorm(500)

X1 <- 1.1*trait + error1
X2 <- 1.3*trait + error2
X3 <- 1.5*trait + error3
X4 <- 1.7*trait + error4

data <- cbind(X1, X2, X3, X4)

model <- '
X =~ X1 + X2 + X3 + X4
X ~~ X
'

fit <- cfa(model, data, std.lv = T)

summary(fit)

This is the output:

Latent Variables:
Estimate Std.Err z-value P(>|z|)
X =~
X1 1.126 0.059 19.126 0.000
X2 1.349 0.064 21.081 0.000
X3 1.544 0.067 23.129 0.000
X4 1.763 0.076 23.057 0.000

Variances:
Estimate Std.Err z-value P(>|z|)
X 1.000
.X1 0.954 0.071 13.350 0.000
.X2 0.967 0.079 12.274 0.000
.X3 0.842 0.081 10.403 0.000
.X4 1.114 0.106 10.485 0.000

So as you can see, the values in the output closely match the values in the data generation. In the data generation, I have set the values of the factor loadings to be 1.1, 1.3, 1.5 and 1.7 and lavaan succesfully estimated them to be 1.13, 1.35, 1.54 and 1.76.

Catrgorical indicators

For this we will use almost exactly the same script, but we will just add the part where we convert the continuous manifest variables to binary ones by setting some chosen threshold and using the if function to see if the value is larger than that threshold.

trait <- rnorm(500)

error1 <- rnorm(500)
error2 <- rnorm(500)
error3 <- rnorm(500)
error4 <- rnorm(500)

X1 <- 1.1*trait + error1
X2 <- 1.3*trait + error2
X3 <- 1.5*trait + error3
X4 <- 1.7*trait + error4

X1 <- as.numeric(X1 > -1.3)
X2 <- as.numeric(X2 > -0.6)
X3 <- as.numeric(X3 > 1.4)
X4 <- as.numeric(X4 > 2.1)

model <- '
X =~ X1 + X2 + X3 + X4
X ~~ X
'

data <- cbind(X1, X2, X3, X4)

fit <- cfa(model, data, std.lv = T, ordered = c("X1", "X2", "X3", "X4"))

summary(fit)

Let's now look at the output:

Latent Variables:
Estimate Std.Err z-value P(>|z|)
X =~
X1 0.730 0.067 10.854 0.000
X2 0.812 0.062 13.143 0.000
X3 0.794 0.055 14.422 0.000
X4 0.898 0.058 15.520 0.000

Intercepts:
Estimate Std.Err z-value P(>|z|)
.X1 0.000
.X2 0.000
.X3 0.000
.X4 0.000
X 0.000

Thresholds:
Estimate Std.Err z-value P(>|z|)
X1|t1 -0.813 0.063 -12.829 0.000
X2|t1 -0.353 0.057 -6.152 0.000
X3|t1 0.793 0.063 12.580 0.000
X4|t1 1.146 0.072 15.962 0.000

Variances:
Estimate Std.Err z-value P(>|z|)
X 1.000
.X1 0.466
.X2 0.340
.X3 0.370
.X4 0.194

Scales y*:
Estimate Std.Err z-value P(>|z|)
X1 1.000
X2 1.000
X3 1.000
X4 1.000

This time the values in the output do not seem to closely match the factor loading specified in the data generating script. Also, the same thing applies to the thresholds. They also seem to be quite far off from the values in the data generation.

However, I believe that the model wasn't badly estimated, but just that it uses different parameterization when dealing with categorical variables.

I can't figure out how to convert from one parameterization to another, though. And I really need to do this because I am setting up a simulation study and if the parameters aren't on the same scale, then there is no way for me to assess the bias of the model. Can anyone provide formulae which would allow me to convert between the two parameterizations.

Or if the problem is something else and I am understanding this wrong, please let me know what my mistake is and how can I correct it. Many thanks!

Shu Fai Cheung

unread,

Mar 25, 2023, 9:40:00 PM3/25/23

to lav...@googlegroups.com

I think what you need to compare with are the population standardized loadings. This is an illustration based on your code, with set.seed() to make the results reproducible and a larger n (50000):

library(lavaan)
#> This is lavaan 0.6-15
#> lavaan is FREE software! Please report any bugs.

set.seed(870431)
n <- 50000

trait <- rnorm(n)

error1 <- rnorm(n)
error2 <- rnorm(n)
error3 <- rnorm(n)
error4 <- rnorm(n)



X1 <- 1.1*trait + error1
X2 <- 1.3*trait + error2
X3 <- 1.5*trait + error3
X4 <- 1.7*trait + error4

X1 <- as.numeric(X1 > -1.3)
X2 <- as.numeric(X2 > -0.6)
X3 <- as.numeric(X3 > 1.4)
X4 <- as.numeric(X4 > 2.1)

model <- '
X =~ X1 + X2 + X3 + X4
X ~~ X
'

data <- cbind(X1, X2, X3, X4)

fit <- cfa(model, data, std.lv = T, ordered = c("X1", "X2", "X3", "X4"))

# Verify that the loadings are equal to the standardized loadings
est <- subset(parameterEstimates(fit, standardized = TRUE), op == "=~")
est
#>   lhs op rhs   est    se       z pvalue ci.lower ci.upper std.lv std.all
#> 1   X =~  X1 0.746 0.006 116.278      0    0.733    0.758  0.746   0.746
#> 2   X =~  X2 0.790 0.005 148.368      0    0.780    0.801  0.790   0.790
#> 3   X =~  X3 0.829 0.005 152.552      0    0.819    0.840  0.829   0.829
#> 4   X =~  X4 0.858 0.006 143.289      0    0.846    0.870  0.858   0.858
#>   std.nox
#> 1   0.746
#> 2   0.790
#> 3   0.829
#> 4   0.858

# Population standardized loadings before dichotomization
loading_original <- c(1.1, 1.3, 1.5, 1.7)
std_loading_original <- loading_original / sqrt(loading_original^2 + 1)
round(std_loading_original, 3)
#> [1] 0.740 0.793 0.832 0.862

Hope this helps.

Regards,

Shu Fai Cheung (張樹輝)

--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lavaan/c60c5c0e-dd95-40a9-94f2-62d0296ef033n%40googlegroups.com.

AM

unread,

Mar 26, 2023, 6:30:54 AM3/26/23

to lavaan

Thank you very much for your help! Could I please just follow up with two questions to make my understanding of this better?

1) Where does the formula for the original standardized loading come from? I thought that the formula is supposed to be original_loading * sqrt(trait_variance) / sqrt(item_variance).

2) Is the same method used to make the thresholds standardized?

Shu Fai Cheung

unread,

Mar 26, 2023, 7:11:13 AM3/26/23

to lav...@googlegroups.com

1) Where does the formula for the original standardized loading come from? I thought that the formula is supposed to be original_loading * sqrt(trait_variance) / sqrt(item_variance).

You are right. I omitted the trait variance because it is one in the population (and I am lazy). Sorry for the confusion caused. (This is also why I used "sqrt(loading_original^2 + 1)", not the usual form but I made use of the convenient fact that all error variances are equal to 1.)

> 2) Is the same method used to make the thresholds standardized?

I think so. I rarely interpret the thresholds. Using the same example in my previous post, the population thresholds with X1 to X4 standardized and the results from lavaan are pretty close.

# Check threshold
est_t <- subset(parameterEstimates(fit, standardized = TRUE), op == "|")
est_t


#>   lhs op rhs    est    se        z pvalue ci.lower ci.upper std.lv std.all

#> 6  X1  |  t1 -0.874 0.006 -135.318      0   -0.886   -0.861 -0.874  -0.874
#> 7  X2  |  t1 -0.357 0.006  -62.206      0   -0.368   -0.346 -0.357  -0.357
#> 8  X3  |  t1  0.774 0.006  123.703      0    0.762    0.787  0.774   0.774
#> 9  X4  |  t1  1.066 0.007  153.786      0    1.052    1.079  1.066   1.066
#>   std.nox
#> 6  -0.874
#> 7  -0.357
#> 8   0.774
#> 9   1.066
# Population thresholds with Xs standardized
threshold_original <- c(-1.3, -0.6, 1.4, 2.1)
sd_x <- sqrt(loading_original^2 + 1)
round(threshold_original / sd_x, 3)
#> [1] -0.874 -0.366  0.777  1.065
round(est_t$est, 3)
#> [1] -0.874 -0.357  0.774  1.066

(I only included the new lines. Please combine the code above with that in the previous post to reproduce the results.)

Regards,

Shu Fai Cheung (張樹輝)

To view this discussion on the web visit https://groups.google.com/d/msgid/lavaan/fd225bac-3675-41a6-b83a-8703852fe253n%40googlegroups.com.

AM

unread,

Mar 26, 2023, 7:18:48 AM3/26/23

to lavaan

Thank you! I really appreciate your help!

Keith Markus

unread,

Mar 26, 2023, 9:22:45 AM3/26/23

to lavaan

AM,

Maybe this was clear from the context but I thought that it might be helpful to point out why your thresholds were scaled differently. The parameterization used by lavaan scales the unobserved continuous indicators as z scores. You simulated them as the sum of two z scores. The sum of two z scores will not itself be distributed as a z score when the two component z scores are orthogonal. The variance will be 2. Your simulated continuous indicator variable is a composite. The variance of a composite is the sum of the covariance matrix of its component variables. This is why Shu Fai suggested looking at the standardized estimates.

Keith

------------------------
Keith A. Markus
John Jay College of Criminal Justice, CUNY
http://jjcweb.jjay.cuny.edu/kmarkus
Frontiers of Test Validity Theory: Measurement, Causation and Meaning.
http://www.routledge.com/books/details/9781841692203/

AM

unread,

Mar 26, 2023, 2:46:46 PM3/26/23

to lavaan

Thank you everyone for your help and advice. I have been going through this this afternoon and I am still not sure I fully understand the formulae. I am mostly still confused by the term sqrt(loading_original^2 + 1) Would someone be so kind to write out how this term is created by simplifying from some more general term? I would really largely appreciate it, thanks.

Terrence Jorgensen

unread,

Mar 30, 2023, 5:43:50 AM3/30/23

to lavaan

To expand on what Keith wrote, you generated independent normal trait and errors with SD = variance = 1, so indeed each indicator's total variance = 1 + 1 = 2. But if the an error had been correlated with the trait, then the total variance would include the covariance s_12: 1 + 1 + 2*s_12. This is a linear-algebra result, if you want to look further into that.

When you discretized them with thresholds c(-1.3, -0.6, 1.4, 2.1), those were applied to normal distributions with mean = 0 and sd = sqrt(2), which can be visualized like so:

curve(dnorm(x, mean = 0, sd = sqrt(2)), from = -6, to = 6)
abline(v = c(-1.3, -0.6, 1.4, 2.1))

If you intended those to be standardized thresholds (z scores), then you would have to apply the z-score formula: subtract the mean (0) and divide by the SD:

## Same graph, only x-axis scale changes:

zThresh <- c(-1.3, -0.6, 1.4, 2.1) / sqrt(2)

curve(dnorm(x, mean = 0, sd = 1), from = -3, to = 3)
abline(v = zThresh)

So the parameterization is arbitrary, but using standardized thresholds would have to be applied to standard-normal variables (i.e., with total variance = 1, not 2). That would require setting loadings to be < 1, and setting your error variances to be 1 minus the explained variance (i.e., lambda * psi * lambda). Because your factor variance is 1, the explained variance is reduced to lambda^2, so error variances should be 1 - lambda^2 to simulate standard-normal indicators. For example:

error1 <- rnorm(500, sd = sqrt(1 - loading1^2))

still confused by the term sqrt(loading_original^2 + 1) Would someone be so kind to write out how this term is created by simplifying from some more general term?

That follows from above. When you use the theta parameterization, you are setting (by default) the residual variances of latent responses to be 1. Because that's what your population values happened to be, using parameterization = "theta" would be consistent with your population. In this case, the explained variance is still lambda^2, and the residual / unexplained variance is 1, so the total variance is the sum of those 2 (and the SD is the square-root of that).

A second option is to use delta parameterization, but make the output in your original post comparable to your population parameters by applying a different identification constraint on the total variance of latent responses y*. You can specify the scaling factors with the ~*~ operator, which is 1 / SD. By default, that is 1 in the default delta parameterization, but you could update it to be consistent with your population by setting it to 1 / sqrt(2) = 0.7071068:

model <- '
X =~ X1 + X2 + X3 + X4

X1 ~*~ 0.7071068*X1

X2 ~*~ 0.7071068*X2

X3 ~*~ 0.7071068*X3

X4 ~*~ 0.7071068*X4

'

Terrence D. Jorgensen (he, him, his)
Assistant Professor, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam
http://www.uva.nl/profile/t.d.jorgensen

Reply all

Reply to author

Forward