Interpretation of latent variable scales + factor score computation by lavPredict()

913 views

Skip to first unread message

Laurens Reumers

unread,

Sep 29, 2022, 10:10:10 AM9/29/22

to lavaan

Hi everyone!

After searching the internet and the lavaan group for a while, I had to conclude I wasn’t going to find the answer to this question. I can imagine that other people have had or will have this same one. It’s a bit of a lengthy text, so my apologies for this. I tried to highlight the key bits of the issue as well as I could. My two-part question in short: how can a) (non-standardised) scales of latent variables (with factor loadings of first indicator set to 1) be interpreted, and b) what information does lavPredict() use to get factor scores?

I have run a longitudinal SEM (a cross-lagged panel model, N=4389) with a mix of latent variables (with ordinal indicators) and observed continuous variables. I have a running model without errors (and with good fit) in lavaan, with equality constraints on factor loadings, intercepts, thresholds and regression coefficients. Now I want to take the coefficients from the model and use them to make some very simple simulation runs.

However, I found that I don’t understand two important things. The first is how the scales of the latent variables should be interpreted. I fixed the factor loadings of the first indicators at 1, which should mean that this provides the latent variables with their scales – but they don’t have the same scale as this first indicator, I have found. For example, the (five) indicators of the first latent variable are all on a scale from 0 to 5, with an average somewhere between about 3 and 4. The latent variable, estimated with lavPredict() (method=”EBM”, type=”lv”), seems to have a minimum of about -3 and maximum of about 3.5, with a mean of about 0.28. This is quite stable over different time points.

How can the scale of a latent variable then be interpreted when it is determined by the first indicator, which is in this case ordinal? It should be relative to this first indicator somehow, but I can’t quite work out in what way exactly. Is it just that because the latent variable is continuous, it does not fit itself exactly within that six-point scale, and parts of the tails of the distribution curve will just land outside it? And how are the factor scores interpreted – what are they relative to? If an individual in my data scores a -1 on the LV, for example, relative to what should I interpret that -1?

The second thing I have a question about concerns the lavPredict() function and how it works. I have ordinal indicators, so I have to use either ML or EBM. They give me vastly different output, however. I decided the ranges given by EMB made much more sense, but I don't know if that's a valid conclusion. ML gives highly fluctuating estimates (-680 to 435, with mean 10.34 in the first year; -11 to 54, mean 0.58 in the last year) that don’t really seem to make any sense. Is there any apparent reason why this would be the case? And is simply using EBM (as this seems to work) a good way to go?

Another thing is that EMB computes the latent variables scores for all cases, even for cases that have zero data on all the indicators for a particular LV. That made little sense to me, and made me think about how it actually is that EBM produces its factor scores. Is any of the original data on the observed indicators used to obtain the factor scores? Or does lavPredict() work in some entirely different way, using just the fitted model?

Thanks in advance for any answers to my questions, they have been bugging me for a bit now. Again, apologies for the lengthy text.

Possibly superfluous example code (sorry if I made any typos; I did not run this exact one, of course):

model.1 <- '

LV2015 =~ 1*ov1_2015 + lmbd1*ov2_2015 + lmbd2*ov3_2015 + lmbd3*ov4_2015 + lmbd4*ov5_2015

LV2016 =~ 1*ov1_2016 + lmbd1*ov2_2016 + lmbd2*ov3_2016 + lmbd3*ov4_2016 + lmbd4*ov5_2016

LV2017 =~ 1*ov1_2017 + lmbd1*ov2_2017 + lmbd2*ov3_2017 + lmbd3*ov4_2017 + lmbd4*ov5_2017

LV2018 =~ 1*ov1_2018 + lmbd1*ov2_2018 + lmbd2*ov3_2018 + lmbd3*ov4_2018 + lmbd4*ov5_2018

LV2019 =~ 1*ov1_2019 + lmbd1*ov2_2019 + lmbd2*ov3_2019 + lmbd3*ov4_2019 + lmbd4*ov5_2019

#####################################

# Parts omitted from this example code:

# [Insert

# Second latent variable;

# Covariances between same indicators across time;

# Equality constraints on LV intercepts and thresholds;

# Structural model

# about here]

#####################################

fit.model.1 <- sem(model.1, data=dataset, estimator="WLSMV", missing="pairwise", parameterization="delta",

ordered=c("ov1_2015", "ov1_2016", "ov1_2017", "ov1_2018", "ov1_2019",

"ov2_2015", "ov2_2016", "ov2_2017", "ov2_2018", "ov2_2019",

"ov3_2015", "ov3_2016", "ov3_2017", "ov3_2018", "ov3_2019",

"ov4_2015", "ov4_2016", "ov4_2017", "ov4_2018", "ov4_2019",

"ov5_2015", "ov5_2016", "ov5_2017", "ov5_2018", "ov5_2019"))

factor.scores.1 <- lavPredict(fit.model.1, type="lv", method="EBM")

factor.scores.1

summary(factor.scores.1)

Terrence Jorgensen

unread,

Oct 8, 2022, 4:18:14 AM10/8/22

to lavaan

how can a) (non-standardised) scales of latent variables (with factor loadings of first indicator set to 1) be interpreted,

This is an excellent explanatory paper:

https://doi.org/10.1080/10705511.2018.1517356

How can the scale of a latent variable then be interpreted when it is determined by the first indicator, which is in this case ordinal?

Ah, then the scales of the indictors (which are latent responses underlying observed discrete categories) are just as arbitrary as the scales of the latent common factors. They are determined (by default) by setting the latent responses to have intercepts = 0 and marginal variances = 1 (delta parameterization), although setting parameterization = "theta" instead sets residual variances = 1. Alternatively, you could instead fix the first 2 thresholds to 0 and 1 (like LISREL does) to freely estimate the latent intercepts/variances. The scale is still arbitrary because we haven't used any scale of measurement to observe the unobserved variables, but the interpretation of "1 unit" would be changed to the distance between the first 2 thresholds.

what information does lavPredict() use to get factor scores? Is any of the original data on the observed indicators used to obtain the factor scores?

lavPredict() returns the factor-score estimates, attempting to respect the estimated factor means/(co)variances. But no method is perfect; they all return different factor-score estimates (even with continuous indicators). With ordinal indicators, fixing the first factor loading links the common-factor scale to the latent-response scale, which is determined by its own identification constraints. They are arbitrary, so I would recommend just setting std.lv = TRUE to have factor scores that are (approximately) z scores.

Terrence D. Jorgensen

Assistant Professor, Methods and Statistics

Research Institute for Child Development and Education, the University of Amsterdam