Different parameterization when using categorical items

152 views
Skip to first unread message

AM

unread,
Mar 25, 2023, 8:41:00 AM3/25/23
to lavaan
Hi, I am struggling to understand the parameterization of the lavaan output when using categorical manifest variables. I would like to start this question by first showing with continuous indicators the part which I understand, and then with categorical indicators the part which I don't understand.

Continuous indicators

For continuous indicators, let's use this script to manually generate the data and analyze it with lavaan:

trait <- rnorm(500)

error1 <- rnorm(500)
error2 <- rnorm(500)
error3 <- rnorm(500)
error4 <- rnorm(500)

X1 <- 1.1*trait + error1
X2 <- 1.3*trait + error2
X3 <- 1.5*trait + error3
X4 <- 1.7*trait + error4

data <- cbind(X1, X2, X3, X4)

model <- '
X =~ X1 + X2 + X3 + X4
X ~~ X
'

fit <- cfa(model, data, std.lv = T)

summary(fit)

This is the output:

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)
  X =~                                                
    X1                1.126    0.059   19.126    0.000
    X2                1.349    0.064   21.081    0.000
    X3                1.544    0.067   23.129    0.000
    X4                1.763    0.076   23.057    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
    X                 1.000                          
   .X1                0.954    0.071   13.350    0.000
   .X2                0.967    0.079   12.274    0.000
   .X3                0.842    0.081   10.403    0.000
   .X4                1.114    0.106   10.485    0.000


So as you can see, the values in the output closely match the values in the data generation. In the data generation, I have set the values of the factor loadings to be 1.1, 1.3, 1.5 and 1.7 and lavaan succesfully estimated them to be 1.13, 1.35, 1.54 and 1.76.

Catrgorical indicators

For this we will use almost exactly the same script, but we will just add the part where we convert the continuous manifest variables to binary ones by setting some chosen threshold and using the if function to see if the value is larger than that threshold.

trait <- rnorm(500)

error1 <- rnorm(500)
error2 <- rnorm(500)
error3 <- rnorm(500)
error4 <- rnorm(500)

X1 <- 1.1*trait + error1
X2 <- 1.3*trait + error2
X3 <- 1.5*trait + error3
X4 <- 1.7*trait + error4

X1 <- as.numeric(X1 > -1.3)
X2 <- as.numeric(X2 > -0.6)
X3 <- as.numeric(X3 > 1.4)
X4 <- as.numeric(X4 > 2.1)

model <- '
X =~ X1 + X2 + X3 + X4
X ~~ X
'

data <- cbind(X1, X2, X3, X4)

fit <- cfa(model, data, std.lv = T, ordered = c("X1", "X2", "X3", "X4"))

summary(fit)


Let's now look at the output:

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)
  X =~                                                
    X1                0.730    0.067   10.854    0.000
    X2                0.812    0.062   13.143    0.000
    X3                0.794    0.055   14.422    0.000
    X4                0.898    0.058   15.520    0.000

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .X1                0.000                          
   .X2                0.000                          
   .X3                0.000                          
   .X4                0.000                          
    X                 0.000                          

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)
    X1|t1            -0.813    0.063  -12.829    0.000
    X2|t1            -0.353    0.057   -6.152    0.000
    X3|t1             0.793    0.063   12.580    0.000
    X4|t1             1.146    0.072   15.962    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
    X                 1.000                          
   .X1                0.466                          
   .X2                0.340                          
   .X3                0.370                          
   .X4                0.194                          

Scales y*:
                   Estimate  Std.Err  z-value  P(>|z|)
    X1                1.000                          
    X2                1.000                          
    X3                1.000                          
    X4                1.000 


This time the values in the output do not seem to closely match the factor loading specified in the data generating script. Also, the same thing applies to the thresholds. They also seem to be quite far off from the values in the data generation.

However, I believe that the model wasn't badly estimated, but just that it uses different parameterization when dealing with categorical variables.

I can't figure out how to convert from one parameterization to another, though. And I really need to do this because I am setting up a simulation study and if the parameters aren't on the same scale, then there is no way for me to assess the bias of the model. Can anyone provide formulae which would allow me to convert between the two parameterizations.

Or if the problem is something else and I am understanding this wrong, please let me know what my mistake is and how can I correct it. Many thanks!

Shu Fai Cheung

unread,
Mar 25, 2023, 9:40:00 PM3/25/23
to lav...@googlegroups.com
I think what you need to compare with are the population standardized loadings. This is an illustration based on your code, with set.seed() to make the results reproducible and a larger n (50000):
library(lavaan)
#> This is lavaan 0.6-15
#> lavaan is FREE software! Please report any bugs.

set.seed(870431)
n <- 50000

trait <- rnorm(n)

error1 <- rnorm(n)
error2 <- rnorm(n)
error3 <- rnorm(n)
error4 <- rnorm(n)


X1 <- 1.1*trait + error1
X2 <- 1.3*trait + error2
X3 <- 1.5*trait + error3
X4 <- 1.7*trait + error4

X1 <- as.numeric(X1 > -1.3)
X2 <- as.numeric(X2 > -0.6)
X3 <- as.numeric(X3 > 1.4)
X4 <- as.numeric(X4 > 2.1)

model <- '
X =~ X1 + X2 + X3 + X4
X ~~ X
'

data <- cbind(X1, X2, X3, X4)

fit <- cfa(model, data, std.lv = T, ordered = c("X1", "X2", "X3", "X4"))

# Verify that the loadings are equal to the standardized loadings
est <- subset(parameterEstimates(fit, standardized = TRUE), op == "=~")
est
#>   lhs op rhs   est    se       z pvalue ci.lower ci.upper std.lv std.all
#> 1   X =~  X1 0.746 0.006 116.278      0    0.733    0.758  0.746   0.746
#> 2   X =~  X2 0.790 0.005 148.368      0    0.780    0.801  0.790   0.790
#> 3   X =~  X3 0.829 0.005 152.552      0    0.819    0.840  0.829   0.829
#> 4   X =~  X4 0.858 0.006 143.289      0    0.846    0.870  0.858   0.858
#>   std.nox
#> 1   0.746
#> 2   0.790
#> 3   0.829
#> 4   0.858

# Population standardized loadings before dichotomization
loading_original <- c(1.1, 1.3, 1.5, 1.7)
std_loading_original <- loading_original / sqrt(loading_original^2 + 1)
round(std_loading_original, 3)
#> [1] 0.740 0.793 0.832 0.862
Hope this helps.

Regards,
Shu Fai Cheung (張樹輝)


--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lavaan/c60c5c0e-dd95-40a9-94f2-62d0296ef033n%40googlegroups.com.

AM

unread,
Mar 26, 2023, 6:30:54 AM3/26/23
to lavaan
Thank you very much for your help! Could I please just follow up with two questions to make my understanding of this better?

1) Where does the formula for the original standardized loading come from? I thought that the formula is supposed to be original_loading * sqrt(trait_variance) / sqrt(item_variance).   

2) Is the same method used to make the thresholds standardized?

Shu Fai Cheung

unread,
Mar 26, 2023, 7:11:13 AM3/26/23
to lav...@googlegroups.com
1) Where does the formula for the original standardized loading come from? I thought that the formula is supposed to be original_loading * sqrt(trait_variance) / sqrt(item_variance).   

You are right. I omitted the trait variance because it is one in the population (and I am lazy). Sorry for the confusion caused. (This is also why I used "sqrt(loading_original^2 + 1)", not the usual form but I made use of the convenient fact that all error variances are equal to 1.)

> 2) Is the same method used to make the thresholds standardized?

I think so. I rarely interpret the thresholds. Using the same example in my previous post, the population thresholds with X1 to X4 standardized and the results from lavaan are pretty close.
# Check threshold
est_t <- subset(parameterEstimates(fit, standardized = TRUE), op == "|")
est_t

#>   lhs op rhs    est    se        z pvalue ci.lower ci.upper std.lv std.all
#> 6  X1  |  t1 -0.874 0.006 -135.318      0   -0.886   -0.861 -0.874  -0.874
#> 7  X2  |  t1 -0.357 0.006  -62.206      0   -0.368   -0.346 -0.357  -0.357
#> 8  X3  |  t1  0.774 0.006  123.703      0    0.762    0.787  0.774   0.774
#> 9  X4  |  t1  1.066 0.007  153.786      0    1.052    1.079  1.066   1.066
#>   std.nox
#> 6  -0.874
#> 7  -0.357
#> 8   0.774
#> 9   1.066
# Population thresholds with Xs standardized
threshold_original <- c(-1.3, -0.6, 1.4, 2.1)
sd_x <- sqrt(loading_original^2 + 1)
round(threshold_original / sd_x, 3)
#> [1] -0.874 -0.366  0.777  1.065
round(est_t$est, 3)
#> [1] -0.874 -0.357  0.774  1.066
(I only included the new lines. Please combine the code above with that in the previous post to reproduce the results.)

Regards,
Shu Fai Cheung (張樹輝)

AM

unread,
Mar 26, 2023, 7:18:48 AM3/26/23
to lavaan
Thank you! I really appreciate your help!

Keith Markus

unread,
Mar 26, 2023, 9:22:45 AM3/26/23
to lavaan
AM,
Maybe this was clear from the context but I thought that it might be helpful to point out why your thresholds were scaled differently.  The parameterization used by lavaan scales the unobserved continuous indicators as z scores.  You simulated them as the sum of two z scores.  The sum of two z scores will not itself be distributed as a z score when the two component z scores are orthogonal.  The variance will be 2.  Your simulated continuous indicator variable is a composite.  The variance of a composite is the sum of the covariance matrix of its component variables.  This is why Shu Fai suggested looking at the standardized estimates.

Keith
------------------------
Keith A. Markus
John Jay College of Criminal Justice, CUNY
http://jjcweb.jjay.cuny.edu/kmarkus
Frontiers of Test Validity Theory: Measurement, Causation and Meaning.
http://www.routledge.com/books/details/9781841692203/

AM

unread,
Mar 26, 2023, 2:46:46 PM3/26/23
to lavaan
Thank you everyone for your help and advice. I have been going through this this afternoon and I am still not sure I fully understand the formulae. I am mostly still confused by the term  sqrt(loading_original^2 + 1)  Would someone be so kind to write out how this term is created by simplifying from some more general term? I would really largely appreciate it, thanks.

Terrence Jorgensen

unread,
Mar 30, 2023, 5:43:50 AM3/30/23
to lavaan
To expand on what Keith wrote, you generated independent normal trait and errors with SD = variance = 1, so indeed each indicator's total variance = 1 + 1 = 2.  But if the an error had been correlated with the trait, then the total variance would include the covariance s_12:  1 + 1 + 2*s_12.  This is a linear-algebra result, if you want to look further into that.

When you discretized them with thresholds c(-1.3, -0.6, 1.4, 2.1), those were applied to normal distributions with mean = 0 and sd = sqrt(2), which can be visualized like so:

curve(dnorm(x, mean = 0, sd = sqrt(2)), from = -6, to = 6)
abline(v = c(-1.3, -0.6, 1.4, 2.1))


If you intended those to be standardized thresholds (z scores), then you would have to apply the z-score formula: subtract the mean (0) and divide by the SD:

## Same graph, only x-axis scale changes:
zThresh <- c(-1.3, -0.6, 1.4, 2.1) / sqrt(2)
curve(dnorm(x, mean = 0, sd = 1), from = -3, to = 3)
abline(v = zThresh)

So the parameterization is arbitrary, but using standardized thresholds would have to be applied to standard-normal variables (i.e., with total variance = 1, not 2).  That would require setting loadings to be < 1, and setting your error variances to be 1 minus the explained variance (i.e., lambda * psi * lambda).  Because your factor variance is 1, the explained variance is reduced to lambda^2, so error variances should be 1 - lambda^2 to simulate standard-normal indicators.  For example:

error1 <- rnorm(500, sd = sqrt(1 - loading1^2))


still confused by the term  sqrt(loading_original^2 + 1)  Would someone be so kind to write out how this term is created by simplifying from some more general term? 

That follows from above.  When you use the theta parameterization, you are setting (by default) the residual variances of latent responses to be 1.  Because that's what your population values happened to be, using parameterization = "theta" would be consistent with your population.  In this case, the explained variance is still lambda^2, and the residual / unexplained variance is 1, so the total variance is the sum of those 2 (and the SD is the square-root of that).

A second option is to use delta parameterization, but make the output in your original post comparable to your population parameters by applying a different identification constraint on the total variance of latent responses y*.  You can specify the scaling factors with the ~*~ operator, which is 1 / SD.  By default, that is 1 in the default delta parameterization, but you could update it to be consistent with your population by setting it to 1 / sqrt(2) = 0.7071068:

model <- '
X =~ X1 + X2 + X3 + X4
X1 ~*~ 0.7071068*X1
X2 ~*~ 0.7071068*X2
X3 ~*~ 0.7071068*X3
X4 ~*~ 0.7071068*X4
'


Terrence D. Jorgensen    (he, him, his)
Assistant Professor, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam
http://www.uva.nl/profile/t.d.jorgensen


Reply all
Reply to author
Forward
0 new messages