SAM approach with categorical indicators

436 views
Skip to first unread message

JAIME ANDRES GAVIRIA BEDOYA

unread,
Apr 3, 2024, 4:46:18 PM4/3/24
to lavaan
Hi everyone.

I'm a phd student.

After reading again the Rossel and Loh paper (2022) about sam approach I realized it was developed assuming continuous and latent variable indicators. So, I was thinking if this approach would make sense in case my items are categorical:

1) To run a sem model with binary items with their corresponding thresholds, which indeed are assumed to be continuous.
2) Then, estimate the parameters with the sam approach.

The issue I found is about the interpretation of the coefficients. Any suggestions?

Daniel Morillo Cuadrado

unread,
Apr 12, 2024, 9:10:02 AM4/12/24
to lav...@googlegroups.com
Hi Jaime,

to my understanding it does make sense, but:

1) Thresholds are indeed in a continuous scale; what you estimate there I assume are tetrachoric (or polychoric if the indicators are polytomic rather than dichotomic) correlations and thresholds, assuming latent utilities (each of the "latent variables" that correspond to each observed variable, for which those thresholds and correlations are computed, which are assumed to be continuous random, normal variables).
2) Meaning "the structural parameters", after applying an estimation method to the measurement model(s) derived from the previous thresholds/correlations.

About your question: It depends on what you mean by "the coefficients". In principle, the interpretation should not differ from the one you would make with any other estimation method within the SEM framework. Maybe I'm not fully getting what you mean, sorry.

In any case, your problem (using the SAM approach with binary items) could be a research question in itself.

Best,
Daniel

--
Daniel Morillo, Ph.D.
GitHub | ORCID


"La información aquí contenida es para uso exclusivo de la persona o entidad de destino. Está estrictamente prohibida su utilización, copia, descarga, distribución, modificación y/o reproducción total o parcial, sin el permiso expreso de Universidad de Antioquia, pues su contenido puede ser de carácter confidencial y/o contener material privilegiado. Si usted recibió esta información por error, por favor contacte en forma inmediata a quien la envió y borre este material de su computador. Universidad de Antioquia no es responsable por la información contenida en esta comunicación, el directo responsable es quien la firma o el autor de la misma."

UdeA

--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lavaan/7b83def8-a5d1-440d-933a-0de1cbe95a58n%40googlegroups.com.

Daniel Morillo Cuadrado

unread,
Apr 12, 2024, 9:10:09 AM4/12/24
to lav...@googlegroups.com
Btw, Rosseel and Loh (2022) explicitly gives you a clue about where to look for what you need:

For related work involving categorical indicators, or item response theory (IRT) models, see Hoshino and Bentler (2013), Lu et al. (2005), and Wang et al. (2019).

(At the end of the very first paragraph; sorry I can't provide a page number, I don't have access to the published version so I'm using a preprint)
--
Daniel Morillo, Ph.D.
GitHub | ORCID

Yves Rosseel

unread,
Apr 12, 2024, 9:13:36 AM4/12/24
to lav...@googlegroups.com
Hello Jaime,

I can confirm that the SAM approach works with categorical indicators.
The interpretation of the coefficients does not change: they are on the
same scale as if you would have used the sem() function.

Yves.
> "La información aquí contenida es para uso exclusivo de la persona o
> entidad de destino. Está estrictamente prohibida su utilización, copia,
> descarga, distribución, modificación y/o reproducción total o parcial,
> sin el permiso expreso de Universidad de Antioquia, pues su contenido
> puede ser de carácter confidencial y/o contener material privilegiado.
> Si usted recibió esta información por error, por favor contacte en forma
> inmediata a quien la envió y borre este material de su computador.
> Universidad de Antioquia no es responsable por la información contenida
> en esta comunicación, el directo responsable es quien la firma o el
> autor de la misma."
>
> UdeA
>
> --
> You received this message because you are subscribed to the Google
> Groups "lavaan" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to lavaan+un...@googlegroups.com
> <mailto:lavaan+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/lavaan/7b83def8-a5d1-440d-933a-0de1cbe95a58n%40googlegroups.com <https://groups.google.com/d/msgid/lavaan/7b83def8-a5d1-440d-933a-0de1cbe95a58n%40googlegroups.com?utm_medium=email&utm_source=footer>.

JAIME ANDRES GAVIRIA BEDOYA

unread,
Apr 22, 2024, 9:24:05 PM4/22/24
to lavaan
Thank you very much Dr Rosseel

I appreciate your answer.

Jacob H

unread,
Dec 6, 2024, 1:32:43 AM12/6/24
to lavaan
Hi,
If the categorical variables now work for SAM, how would one specify the Standard Errors. From my understanding using the fsr, barlett, ML approach only works for continuous items. Is the categorical variables only working for point estimates? 
Best,
Jacob

yros...@gmail.com

unread,
Dec 9, 2024, 6:10:00 AM12/9/24
to lav...@googlegroups.com
The sam() function supports latent variables with categorical indicators. (Not observed categorical variables that are part of the structural model). And it does produce two-step corrected standard errors. The logic behind the computation of those standard errors is the same as what we do when we have latent variables with numeric indicators (See Appendix C of the SAM paper). Only the computation of the first-step variance-covariance matrix ('\Sigma_{11}' in equation C1) is different (taking the categorical nature of the indicators into account), but everything else is the same.

Yves.

Jacob H

unread,
Dec 10, 2024, 1:05:49 AM12/10/24
to lavaan
Thank you Yves, going back to your paper ('\Sigma_{11}') I see this now.

Also, is the mapping matrix ML the correct specification for the SAM model with categorical items? I see the barlett scores, fsr and the croon correction in the paper appendix. I also see how you connect the local sam with the ML matrix to croon. However, I am unsure how that maps with categorical items. 
Also note: I get some warnings running simulations with binary variables in the SAM using OLS as the structural argument struc.args = list(estimator = "OLS")
for instance example 1 code in your paper:
# using the sam() function -- local SAM
fit.lsam <- sam(model, data = Data, sam.method = "local", estimator = "ML", struc.args = list(estimator = "OLS"))
Warning messages: 1: lavaan->lav_lavaan_step11_estoptim(): Model estimation FAILED! Returning starting values. 2: lavaan->lav_lavaan_step11_estoptim(): Model estimation FAILED! Returning starting values. 3: lavaan->lav_lavaan_step15_baseline(): estimation of the baseline model failed. I am guessing this is nothing to be worried about? Is this just overriding lavaan to do OLS instead of a MLE? 

You mentioned you are in the process of writing a paper about SAM with categorical—I am looking forward to reading it.
Best,
Jacob

Yves Rosseel

unread,
Jan 3, 2025, 5:55:20 AM1/3/25
to lav...@googlegroups.com
(a very late reply)

On 12/10/24 04:13, Jacob H wrote:
> Also, is the mapping matrix ML the correct specification for the SAM
> model with categorical items?

That is a very good question. It is certainly a convenient choice, and
perhaps closer in spirit to the IRT/ML approach, than the WLSMV approach
that we use for the CFA blocks in the first stage.

I can tell you that it seems to work very well in simulation studies, in
the sense that the resulting point estimates of the structural
parameters seem to be consistent.

The general idea is that once we have estimated the parameters in LAMBDA
and THETA (using, say, WLSMV), we don't need to worry any longer about
the categorical nature of the indicators.

But that only works if everything in the structural part can be
considered continuous. If the structural part contains some categorical
endogenous variables, then the current implementation will fail.

> Also note: I get some warnings running simulations with binary variables
> in the SAM using OLS as the structural argument struc.args =
> list(estimator = "OLS")

"OLS" is not supported (but that should trigger an error/warning). But
estimator = "ULS" does work.

> You mentioned you are in the process of writing a paper about SAM with
> categorical

I think I already have too many plans for 2025. I hope this one makes
it, but I cannot make any promises.

Yves.

Jacob H

unread,
Mar 3, 2025, 6:30:46 AM3/3/25
to lavaan
Hi Yves,

Thanks for the reply, sorry for the late reply, I recently came back to the SAM estimation for categorical variables.


> The general idea is that once we have estimated the parameters in LAMBDA and THETA (using, say, WLSMV), we don't need to worry any longer about the categorical nature of the indicators

I dont understand why this is the case when the mapping matrix for the lambda is applied to the categorical measures: M[y-v-e]. I thought the Lambda were estimated from the unobserved categorical measure model: y* = \lambda \eta + error where y* is the unobserved continuous measure of the categorical y.
If the mapping matrix is then mapped onto the categorical y then this would cause issues no? Are you mapping it onto the unobserved y*?

Best,
Jacob

Yves Rosseel

unread,
Mar 3, 2025, 6:34:16 AM3/3/25
to lav...@googlegroups.com
> If the mapping matrix is then mapped onto the categorical y then this
> would cause issues no? Are you mapping it onto the unobserved y*?

We are mapping onto the unobserved y*. In the formula

M [S - \Theta] M'

the 'S' is the matrix of polychoric (or tetrachoric) correlations. It is
an estimate of Var(y*).

Yves.

Jacob H

unread,
Mar 3, 2025, 2:59:32 PM3/3/25
to lavaan
Thank you! That make sense!

> But that only works if everything in the structural part can be considered continuous. If the structural part contains some categorical endogenous variables, then the current implementation will fail.


Do you mean if I have all categorical measures for my factors and I want to run a structural equation:

wage ~ factor + continuous observed variables  (such as family income) + dummy variables (gender, location fixed effects)

Are you saying that this will fail? Is this because when we add the exogenous variables to S and take the polychorlic it will fail. 

This maybe a stupid question, but couldnt compute a heterogenous correlation matrix, consisting of Pearson  correlations between numeric variables, polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables.


Thank you so much for responding

Yves Rosseel

unread,
Mar 5, 2025, 4:11:22 AM3/5/25
to lav...@googlegroups.com
> Do you mean if I have all categorical measures for my factors and I want
> to run a structural equation:
>
> wage ~ factor + continuous observed variables  (such as family income) +
> dummy variables (gender, location fixed effects)
>
> Are you saying that this will fail?

This should work, as long as 'wage' is continuous. It will (currently)
fail if you have an observed but categorical *dependent* variable in the
structural part.

But for the dummy variables, you may have to switch on the 'fixed.x =
TRUE' option for the structural part:

fit <- sam( ...., struc.args = list(fixed.x = TRUE), ...)

Because it is currently set to FALSE per default.

Yves.

Jacob H

unread,
Mar 5, 2025, 10:39:59 PM3/5/25
to lavaan
Hi Yves,

Thank you for the reply again. I am simulating some data and getting errors with your suggestions below is the simulated data I use and the errors/warnings im getting. I wanted to bring it to your attention, it could be totally my error as well.

------------------- simulated data------------------------------
# Load necessary libraries
library(MASS)
library(lavaan)

set.seed(12345)
n <- 20000

# 1. Single latent factor
F1 <- rnorm(n, mean=0, sd=1)

# 2. Exogenous variables
X1 <- rnorm(n, mean=50, sd=10)
X2 <- rnorm(n, mean=100, sd=15)
D1 <- rbinom(n, size=1, prob=0.5)
D2 <- rbinom(n, size=1, prob=0.5)
D3 <- rbinom(n, size=1, prob=0.5)

# Function to create one ordinal indicator
generate_ordered <- function(F, lambda, tau1, tau2) {
  eta <- lambda * F
  prob1 <- pnorm(tau1 - eta)
  prob2 <- pnorm(tau2 - eta) - prob1
  prob3 <- 1 - pnorm(tau2 - eta)
  probs <- cbind(prob1, prob2, prob3)
 
  apply(probs, 1, function(p) sample(1:3, size=1, prob=p))
}

# Define loadings and thresholds for each indicator
lambda <- c(1.0, 0.8, 0.6, 0.9)
tau1 <- c(-0.5, -0.4, -0.3, -0.2)
tau2 <- c(0.5, 0.6, 0.7, 0.8)

# Generate exactly four ordinal indicators
Y1 <- generate_ordered(F1, lambda[1], tau1[1], tau2[1])
Y2 <- generate_ordered(F1, lambda[2], tau1[2], tau2[2])
Y3 <- generate_ordered(F1, lambda[3], tau1[3], tau2[3])
Y4 <- generate_ordered(F1, lambda[4], tau1[4], tau2[4])

# 3. Outcome "wage"
intercept <- 10
beta_F  <- 2.0
beta_X1 <- 0.5
beta_X2 <- -0.3
beta_D1 <- 1.0
beta_D2 <- -1.5
beta_D3 <- 0.8

error <- rnorm(n, mean=0, sd=5)
wage <- intercept + beta_F*F1 + beta_X1*X1 + beta_X2*X2 +
  beta_D1*D1 + beta_D2*D2 + beta_D3*D3 + error

# Combine into data frame
simulated_data <- data.frame(
  wage, X1, X2, D1, D2, D3,
  Y1, Y2, Y3, Y4
)
simulated_data[,c("Y1", "Y2", "Y3", "Y4")] <- simulated_data[,c("Y1", "Y2", "Y3", "Y4")] - 1

head(simulated_data)

------------------- Estimating SAM with Dummy Variables------------------------------
model <- '
  # Measurement model (ordinal indicators loading on F)
  F =~ Y1 + Y2 + Y3 + Y4
 
  wage ~ F + X1 + X2 + D1 + D2 + D3
 
  '

fit.lsam <- sam(model, data = simulated_data, sam.method = "local", mm.args = list(estimator = "DWLS"), ordered = c("Y1", "Y2", "Y3", "Y4"))
Warning: lavaan->muthen1984(): trouble constructing W matrix; used generalized inverse for A11 submatrix


------------------- Estimating SAM with Dummy Variables fixed.x=T------------------------------
model <- '
  # Measurement model (ordinal indicators loading on F)
  F =~ Y1 + Y2 + Y3 + Y4
 
  wage ~ F + X1 + X2 + D1 + D2 + D3
 
  '

fit.lsam <- sam(model, data = simulated_data, sam.method = "local", struc.args = list(fixed.x = TRUE), mm.args = list(estimator = "DWLS"), ordered = c("Y1", "Y2", "Y3", "Y4"))

Error: lavaan->lav_samplestats_from_moments(): the (D)WLS estimator is only available with full data or with a user-provided WLS.V

------------------- Estimating SAM without Dummy Variables------------------------------

model <- '
  # Measurement model (ordinal indicators loading on F)
  F =~ Y1 + Y2 + Y3 + Y4
 
  wage ~ F + X1 + X2
 
  '

fit.lsam <- sam(model, data = simulated_data, sam.method = "local", mm.args = list(estimator = "DWLS"), ordered = c("Y1", "Y2", "Y3", "Y4"))

RESULT: Works fine no errors

Side note: recovers the parameters well 


If I estimate the "S" using a heterogenous Correlation matrix would this work? For example, to deal with the dummy variables
N <- nrow(Data)
S <- hetcor(Data) * (N - 1L)/N # adjust for biased sample covariance matrix
Var.eta <- M %*% (S - Theta) %*% t(M)

Yves Rosseel

unread,
Mar 15, 2025, 3:22:24 PM3/15/25
to lav...@googlegroups.com
Hello Jacob,

Sorry for the long delay. I did figure out why it doesn't work (yet) for
your example. It has to do with the covariates. The problem is that
'conditional.x = TRUE' (which is what we need in this case, as you have
binary covariates) does not work yet.

I need to fix this, but that may take a few more weeks...

Yves.
Reply all
Reply to author
Forward
0 new messages