Manifest Exogenous Variables: Modeling as Random or Fixed?

561 views
Skip to first unread message

Justin Paulsen

unread,
May 16, 2023, 4:45:40 PM5/16/23
to lavaan

I initially sent the message below to Yves, and he suggested I post it in the discussion group.

Using lavaan, I am reproducing SEM results produced through Mplus. The model has a single latent outcome variable predicted by a series of latent and manifest (primarily demographic) variables in a cross-sectional dataset. It appears that the primary difference between the models is that by default Mplus treats manifest exogenous variables as random variables and the syntax I’ve used in lavaan treats them as fixed variables. I am wondering what the implications of that modeling decision are. I’d be interested in feedback on the ideas behind lavaan/Mplus defaulting to fixed or random manifest exogenous variables and what an analyst should consider when deciding how to model manifest exogenous variables. Thank you!

Yves response included this: "The advantage of treating exogenous covariates as fixed is that we do not need to assume normality for them. Therefore, they can be binary variables (ie dummy variables), just like in regression. In addition, you need less 'free' parameters, as we do not estimate the (co)variances of those covariates.

The disadvantage is that you cannot handle missing values in those 'fixed' covariates."

I'd be interested in any additional thoughts the community might have.

Shu Fai Cheung (張樹輝)

unread,
May 16, 2023, 10:15:17 PM5/16/23
to lavaan
I am also interested in this topic, partly because some programs do not fix the manifest exogenous variables to zero. For example, AMOS does not do this automatically. We need to manually do this. Moreover, x-variables treated as fixed or random can affect some aspects of free parameters present in both settings, such as the SEs and CIs of the standardized regression coefficients computed by the delta method.

But I would like to ask one issue about Mplus first. As far as I know, it also treats manifest exogenous variables as fixed, just like lavaan. For example, for the following model, the variances and covariances of x1 and x2 are not free parameters, just like lavaan with default settings:

TITLE: Linear regression
DATA: FILE IS data.dat;
VARIABLE: NAMES ARE x1 x2 y;
MODEL: y ON x1 x2;

May I know why Mplus treats you manifest exogenous variables as random?

-- Shu Fai

Shu Fai Cheung

unread,
May 17, 2023, 4:58:25 AM5/17/23
to lav...@googlegroups.com
One implication of fixed.x TRUE/FALSE is on the standard errors of the standardized solution. Given the discussion involves Mplus, I would like to share something I've just found about Mplus.

First, let me use lavaan for illustration first:

This is a 2-IV-1-DV toy dataset. The three variables are multivariate normal in the population. So normality is not an issue here.
library(lavaan)
#> This is lavaan 0.6-15
#> lavaan is FREE software! Please report any bugs.
set.seed(857105)
n <- 100
b1 <- .11
b2 <- .60
rho <- .40
x1 <- rnorm(n)
x2 <- rho * x1 + rnorm(n, 0, sqrt(1 - rho^2))
sd_e <- sqrt(1 - (b1^2 + b2^2))
y <- b1 * x1 + b2 * x2 + rnorm(n, 0, sd_e)
y2 <- rnorm(n)
x1 <- 2 * x1 + 10
x2 <- 3 * x2 + 20
y <- 40 + y
dat <- data.frame(x1, x2, y)
head(dat)
#>          x1       x2        y
#> 1  9.724924 21.03961 40.58337
#> 2 10.639568 20.36350 41.27968
#> 3  8.683245 20.94754 40.10021
#> 4 14.726216 26.91184 41.56065
#> 5 11.179734 16.45349 40.21687
#> 6  8.124515 22.32706 39.50138
I fitted the model twice, with fixed.x TRUE in one model and FALSE in another. I used a saturated regression model (the true model) so we can focus on parameter estimation without worrying about misspecification:
mod <- "y ~ x1 + x2"
fit_fixedx_true <- sem(mod, dat, fixed.x = TRUE)
fit_fixedx_false <- sem(mod, dat, fixed.x = FALSE)
parameterEstimates(fit_fixedx_true)
#>   lhs op rhs   est    se     z pvalue ci.lower ci.upper
#> 1   y  ~  x1 0.099 0.051 1.917  0.055   -0.002    0.199
#> 2   y  ~  x2 0.188 0.034 5.557  0.000    0.122    0.254
#> 3   y ~~   y 0.814 0.115 7.071  0.000    0.588    1.040
#> 4  x1 ~~  x1 4.320 0.000    NA     NA    4.320    4.320
#> 5  x1 ~~  x2 3.520 0.000    NA     NA    3.520    3.520
#> 6  x2 ~~  x2 9.986 0.000    NA     NA    9.986    9.986
parameterEstimates(fit_fixedx_false)
#>   lhs op rhs   est    se     z pvalue ci.lower ci.upper
#> 1   y  ~  x1 0.099 0.051 1.917  0.055   -0.002    0.199
#> 2   y  ~  x2 0.188 0.034 5.557  0.000    0.122    0.254
#> 3   y ~~   y 0.814 0.115 7.071  0.000    0.588    1.040
#> 4  x1 ~~  x1 4.320 0.611 7.071  0.000    3.122    5.517
#> 5  x1 ~~  x2 3.520 0.745 4.724  0.000    2.059    4.980
#> 6  x2 ~~  x2 9.986 1.412 7.071  0.000    7.218   12.754
standardizedSolution(fit_fixedx_true)
#>   lhs op rhs est.std    se     z pvalue ci.lower ci.upper
#> 1   y  ~  x1   0.177 0.091 1.946  0.052   -0.001    0.355
#> 2   y  ~  x2   0.513 0.080 6.395  0.000    0.356    0.670
#> 3   y ~~   y   0.608 0.068 8.905  0.000    0.474    0.742
#> 4  x1 ~~  x1   1.000 0.000    NA     NA    1.000    1.000
#> 5  x1 ~~  x2   0.536 0.000    NA     NA    0.536    0.536
#> 6  x2 ~~  x2   1.000 0.000    NA     NA    1.000    1.000
standardizedSolution(fit_fixedx_false)
#>   lhs op rhs est.std    se     z pvalue ci.lower ci.upper
#> 1   y  ~  x1   0.177 0.092 1.932  0.053   -0.003    0.357
#> 2   y  ~  x2   0.513 0.084 6.140  0.000    0.349    0.677
#> 3   y ~~   y   0.608 0.076 7.985  0.000    0.459    0.757
#> 4  x1 ~~  x1   1.000 0.000    NA     NA    1.000    1.000
#> 5  x1 ~~  x2   0.536 0.071 7.519  0.000    0.396    0.676
#> 6  x2 ~~  x2   1.000 0.000    NA     NA    1.000    1.000
fixed.x TRUE or FALSE do not affect the point estimates, SEs and CIs for y ~ x1 and y ~ x2 (and also y ~~ y), as expected.

However, the choice for fixed.x affects the delta method SEs and CIs for the standardized coefficients are affected, though only slightly, in this example. This is also a known phenomenon. When fixed.x = FALSE, the estimated sampling variances and covariances of x1 and x2 are also used in computation of the SEs of standardized y ~ x1 and y ~ x2, hence the difference.

Nothing new at this point. However, when I tried to reproduce the results in Mplus (the reverse of what Justin did), something strange (to me) happened.

I saved the data to a text file (data_2ivs ) and ran the following in Mplus:

TITLE: Linear regression
DATA: FILE IS data_2ivs.dat;
VARIABLE: NAMES ARE x1 x2 y1;
ANALYSIS: MODEL = NOMEANSTRUCTURE;
          INFORMATION = EXPECTED;
MODEL: y1 ON x1 x2;
OUTPUT: STDYX CINTERVAL;

I added NOMEANSTRUCTURE and INFORMATION = EXPECTED just to make sure that the results are comparable to those in lavaan (no mean structure in lavaan, by default).

I do not know how to easily simulate fixed.x = FALSE in Mplus. Therefore, the following results (part of them) are based on the default (treating manifest exogenous as fixed, as far as I know):

Number of Free Parameters                        3
                                                    Two-Tailed
                    Estimate       S.E.  Est./S.E.    P-Value
 Y1       ON
    X1                 0.099      0.051      1.917      0.055
    X2                 0.188      0.034      5.557      0.000

The number of free parameters is 3, identical to the lavaan model. The parameter estimates of the two regression coefficients are also identical. The output does not have the estimates of x1 and x2 variances and covariance because they are not model parameters. This confirms that, like lavaan, Mplus treats the observed x-variables as fixed, at least for this model.

These are the results for the standardized solution (requested by STDYX):

STDYX Standardization
                                                    Two-Tailed
                    Estimate       S.E.  Est./S.E.    P-Value
 Y1       ON
    X1                 0.177      0.092      1.932      0.053
    X2                 0.513      0.084      6.140      0.000

This part is strange. The estimates are identical to those in lavaan (est.std). However, the standard errors are identical to the lavaan results with fixed.x = FALSE.

I am not familiar with Mplus. Did I read the results correctly? Is this behavior intended?

Regards,
Shu Fai Cheung (張樹輝)


--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lavaan/0435a25a-7ca2-49e2-b182-695a7d359c10n%40googlegroups.com.

Edward Rigdon

unread,
May 17, 2023, 10:44:27 AM5/17/23
to lav...@googlegroups.com
     Perhaps I misunderstand, but fixed.x implies to me that x values are selected rather than sampled. So x variables which are normally distributed by definition are not "fixed." I think the effect should be similar to that for stratified sampling, where the stratification reduces overall sampling variance by ensuring that the stratification variables are represented consistent with the population.
     If that is correct, then you might see the real difference in using fixed.x = T when the covariates are, say dichotomous. If you have x1 and x2 and 100 observations, x1 could be 50 0's followed by 50 1's. Then you could control the correlation between x1 and x2 by the relative proportion of x2 = 0 when x1 = 0 vs when x1 = 1.

Shu Fai Cheung

unread,
May 17, 2023, 1:18:31 PM5/17/23
to lav...@googlegroups.com
As far as I understand, fixed.x = TRUE means "treating" the x-variables as fixed. So you are right that the x-variables are not truly "fixed" in my example.

This is related to what I have been wondering. In many applications of SEM, x-variables are rarely truly fixed, but they are treated as fixed by default, in both lavaan and Mplus (to my knowledge). However, as Yves, quoted in the original post, said, treating the x-variables fixed has some advantages, like being able to handle variables like dichotomous variables.

By the way, the following code can generate samples in which the x-variables are truly fixed, even though the x-variables were drawn, once, from a multivariate normal distribution. I believe fixed.x = TRUE can be used correctly in these samples, while fixed.x = FALSE cannot, even though the initial draw of x-variables was done using a multivariate normal distribution.
set.seed(857105)
n <- 100
b1 <- .11
b2 <- .60
rho <- .40
x1 <- rnorm(n)
x2 <- rho * x1 + rnorm(n, 0, sqrt(1 - rho^2))
gen_data <- function(n, x1, x2) {

    sd_e <- sqrt(1 - (b1^2 + b2^2))
    y <- b1 * x1 + b2 * x2 + rnorm(n, 0, sd_e)
    y2 <- rnorm(n)
    x1 <- 2 * x1 + 10
    x2 <- 3 * x2 + 20
    y <- 40 + y
    data.frame(x1, x2, y)
  }
set.seed(415145)
head(gen_data(100, x1 = x1, x2 =x2))
#>          x1       x2        y
#> 1  9.724924 21.03961 39.61278
#> 2 10.639568 20.36350 41.29758
#> 3  8.683245 20.94754 39.84185
#> 4 14.726216 26.91184 40.64332
#> 5 11.179734 16.45349 39.77236
#> 6  8.124515 22.32706 40.58855
head(gen_data(100, x1 = x1, x2 =x2))
#>          x1       x2        y
#> 1  9.724924 21.03961 39.47348
#> 2 10.639568 20.36350 40.07889
#> 3  8.683245 20.94754 39.16790
#> 4 14.726216 26.91184 41.00427
#> 5 11.179734 16.45349 38.46486
#> 6  8.124515 22.32706 39.93895
head(gen_data(100, x1 = x1, x2 =x2))
#>          x1       x2        y
#> 1  9.724924 21.03961 39.01510
#> 2 10.639568 20.36350 38.79520
#> 3  8.683245 20.94754 40.02613
#> 4 14.726216 26.91184 40.83357
#> 5 11.179734 16.45349 38.23792
#> 6  8.124515 22.32706 41.32508
Regards,
Shu Fai Cheung (張樹輝)

Message has been deleted

Maria

unread,
May 18, 2023, 4:59:59 AM5/18/23
to lavaan
Hi,

I was confused by this posting at the beginning, too. Anyway, I would like to try answer the question above, hopefully it is right.

I would decide based on theoretical arguments whether the variables should be treated as fixed or as random. If they are just covariates, I would treat them as fixed. If there are good (theoretical or empirical) reasons to treat them as random I would treat them as random. For example if I would assume that effects (e.g., regression coefficients) do differ between school classes or some groups that are stored in one of these predictor variables, I would treat these variables as random in my analysis. I think treating variables as fixed is easier for computation. I do not know why the default is different between lavaan and Mplus. I assume because someone with multilevel data would be aware of his/her data structure and could explicitly decide to treat variables as random in lavaan.

Hopefully this is right or do I am missing something?

Best,
Maria

Shu Fai Cheung

unread,
May 18, 2023, 7:52:07 AM5/18/23
to lav...@googlegroups.com
I would like to clarify one issue first, in case I am wrong. To my understanding, Mplus and lavaan both treat manifest exogenous variables as fixed. I illustrated in the following post this is the case for a regression model:


OP mentioned that the model of concern has latent variables. I thought maybe Mplus and lavaan behave differently in this case. Therefore, I created the following dataset.

library(lavaan)
set.seed(857105)
n <- 500
x1 <- rnorm(n)
fx <- rnorm(n)
fx1 <- .7 * fx + rnorm(n, 0, .51)
fx2 <- .7 * fx + rnorm(n, 0, .51)
fx3 <- .7 * fx + rnorm(n, 0, .51)
fy <- .5 * fx + .4 * x1 + rnorm(n, 0, sqrt(1 - .5^2 - .4^2))
fy1 <- .7 * fy + rnorm(n, 0, .51)
fy2 <- .7 * fy + rnorm(n, 0, .51)
fy3 <- .7 * fy + rnorm(n, 0, .51)

dat <- data.frame(x1, fx1, fx2, fx3, fy1, fy2, fy3)
head(dat)
# write.table(dat, "with_latent.dat", col.names = FALSE, row.names = FALSE)


The dataset was written to  with_latent.dat for Mplus, presented below.

I fitted the following model, with a latent factor (fy) predicted by an observed variable (x1):\
mod <- "
fx =~ fx1 + fx2 + fx3
fy =~ fy1 + fy2 + fy3
fy ~ fx + x1
"
fit <- sem(mod, dat)
fit
#> lavaan 0.6.15 ended normally after 27 iterations
#>   Estimator                                         ML
#>   Optimization method                           NLMINB
#>   Number of model parameters                        14
As expected, x1 is treated as fixed. Its variance is fixed:
parameterEstimates(fit)

#>    lhs op rhs   est    se      z pvalue ci.lower ci.upper
#> 17  x1 ~~  x1 1.124 0.000     NA     NA    1.124    1.124
I fitted the same model in Mplus:

TITLE: Latent and x variables
DATA: FILE IS with_latent.dat;
VARIABLE: NAMES ARE x1 fx1 fx2 fx3 fy1 fy2 fy3;

ANALYSIS: MODEL = NOMEANSTRUCTURE;
          INFORMATION = EXPECTED;
MODEL: fx by fx1@1 fx2 fx3;
       fy by fy1@1 fy2 fy3;
       fy on fx x1;
OUTPUT: STDYX CINTERVAL;

Mplus, like lavaan, treated x1 as fixed. The number of free parameters is the same, and the output does not have the variance of x1:

MODEL FIT INFORMATION
Number of Free Parameters                       14

So, for this model, Mplus, like lavaan, also treats the manifest exogenous variable as fixed.

But I only use Mplus occasionally. Therefore, there may be cases in which they behave differently in this regard. If I misunderstood how Mplus works, I would like to be corrected.

Regards,
Shu Fai Cheung (張樹輝)

You received this message because you are subscribed to a topic in the Google Groups "lavaan" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lavaan/69H6ax6upMc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lavaan+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lavaan/a43f392c-0988-44e0-99ab-cbc361bd043bn%40googlegroups.com.

Shu Fai Cheung

unread,
May 18, 2023, 8:04:27 AM5/18/23
to lav...@googlegroups.com
Regarding Mplus, I need to highlight one thing. For the model with latent variables, like what I found in the following post, for the CIs of the *standardized* solution (STDXY), the CIs Mplus formed are identical to those in lavaan with fixed.x set to *FALSE*:


Therefore, for this aspect of the result, Mplus does behave differently from lavaan in the fixed.x setting. This is strange and I am not sure whether this is intentional. It would be great if members familiar with Mplus can comment on this.

I hope I am not off-topic. One issue in the first post is the difference in behavior, which is different from what I know. Regardless of the behavior of Mplus, we can still discuss the main concern: the implications of treating x-variable as fixed or not. (There are other SEM programs which do not do what lavaan does, anyway, e.g., AMOS.)

Regards,
Shu Fai Cheung (張樹輝)

Justin Paulsen

unread,
May 18, 2023, 10:12:07 AM5/18/23
to lavaan
Thanks, all, for your thoughts on this. I'm mostly interested in the discussion on the theoretical considerations for defining manifest exogenous variables as random or fixed, so thanks for sharing thoughts related to that.

Shu Fai - one follow-up note about the programming. I failed to mention in the original post that the manifest exogenous variables included missing values. So, the Mplus model uses MLR and specifies the variance be estimated for the manifest exogenous variables to invoke the FIML estimation. So, that is why the variables are being treated as random. Lavaan provides the FIML.x option for missing-ness that estimates the manifest exogenous variables to be estimated as fixed. My apologies for the insufficient details in the initial post!

Shu Fai Cheung

unread,
May 18, 2023, 10:27:59 AM5/18/23
to lav...@googlegroups.com
Thanks a lot for your clarification! I am not aware of that behavior of Mplus, which is interesting. I learned a new thing about Mplus. Thanks!

Regards,
Shu Fai Cheung (張樹輝)

Shu Fai Cheung (張樹輝)

unread,
May 18, 2023, 9:53:49 PM5/18/23
to lavaan
When treating x-variables (manifest exogenous variable) as fixed (fixed across possible samples), we treat the sample as one of many possible samples in which the x-variables have the same values. There are settings in which this makes sense.

Treating x-variables as fixed does have its advantages (as you quoted in your first post), and also treating x-variables as random and using ML has its disadvantages (e.g., we need to make distributional assumption on the x-variables). However, there are many situations in which it is difficult to justify treating x-variable as fixed. For example, in many observational studies, cross-sectional and longitudinal included, x-variables are also random variables. We don't want to treat a sample as one of many possible samples in which these x-variables are fixed and restrict our generalization only to the chance combination of x-variables in a sample.

Maybe it is not only about whether x-variables should be treated as fixed, but also about how to do SEM right when we *have to* treat x-variables as random?

-- Shu Fai

Shu Fai Cheung (張樹輝)

unread,
May 18, 2023, 10:04:53 PM5/18/23
to lavaan
I found one thing interesting in your case, that uses MLR. As far as I understand, with missing = "fiml.x" and estimator = "MLR", lavaan still allows us to decide the setting of fixed.x (TRUE or FALSE). Based on your clarification, Mplus will use something similar to fixed.x = TRUE when we ask for MLR and missing data is present. No choice here in Mplus except for manually fixing variances, covariances, and means of x-variables to sample values?

Did you compare the Mplus results with the results of lavaan with fixed.x manually set to FALSE and missing = "fiml.x"? Are the result the same or very similar?

Perhaps your case highlights one advantage of lavaan compared to Mplus: Allowing users to easily decide treating manifest exogenous variables as fixed or random, when missing data is present, fiml.x is used, and MLR is requested.

-- Shu Fai

On Thursday, May 18, 2023 at 10:12:07 PM UTC+8 j.matthe...@gmail.com wrote:

Shu Fai Cheung (張樹輝)

unread,
May 19, 2023, 10:34:33 AM5/19/23
to lavaan
Just found that the latest edition of Kline's book has a box on this topic (Topic Box 9.2, p. 136), though only a small box with two paragraphs. It should be new as I can't recall reading this box in the previous editions.

Kline, R. B. (2023). Principles and practice of structural equation modeling (Fifth edition). The Guilford Press. https://www.guilford.com/books/Principles-and-Practice-of-Structural-Equation-Modeling/Rex-Kline/9781462551910

-- Shu Fai

On Thursday, May 18, 2023 at 10:12:07 PM UTC+8 j.matthe...@gmail.com wrote:

Justin Paulsen

unread,
May 19, 2023, 12:43:57 PM5/19/23
to lavaan
Super helpful, Shu Fai! I just checked, and this isn't addressed in the 4th edition. Thank you for sharing this!

Keith Markus

unread,
May 21, 2023, 9:50:32 AM5/21/23
to lavaan
All,
Sorry about the delay posting this, I had a busy week.  Also, as a caveat, I do not yet have Kline's 5th edition.

As I understand it, fixed.x has nothing to do with clustered data.  The FALSE option gives you traditional single-level SEM whereas the TRUE option omits exogenous variances and covariances from the model, simply taking them as read directly from the sample data.

My understanding of the motivation for this is that the saturated exogenous part of the model impacts null model fit and therefore comparative model fit.  Fixed.x allows us to evaluate comparative fit without this influence.  Here is a simple illustration.  In this example, there is an omitted path from y to w that introduces misspecificaiton into the model.  We would like to catch that specification error in our analysis.  x and z are exogenous variables with three free parameters impacted by fixed.x.  The exogenous associations are deliberately large and the endogenous associations deliberately small.

# Simple example:
set.seed(123456)
simModel <- '
  w ~ .1*y
  y ~ .1*x + .1*z
  y~~1*y
  x~~1*x
  z~~1*z
  x~~.9*z
  w~~.5*w
' # end simModel

myData2 <- simulateData(model=simModel2,
                        sample.nobs=1000)
round(cor(myData2),3)

fixedModel <- '
  y ~ x + z
  y ~~ y
  w ~~ w
' # end fixedModel

freeModel <- '
  y ~ x + z
  y ~~ y
  w ~~ w
  x ~~ x
  z ~~ z
  x ~~ z
' #end freeModel

freeFit2 <- lavaan(model=freeModel2,
                   data=myData2,
                   fixed.x=FALSE)
fixedFit2 <- lavaan(model=fixedModel2,
                    data=myData2,
                    fixed.x = TRUE)
summary(freeFit2,
        fit.measures=TRUE)
summary(fixedFit2,
        fit.measures=TRUE)

The model chi-square is identical for both fits.  However, the below output shows that the baseline model chi-squares are very different.  As a result, when we use traditional methods, the comparative fit indices seem fine but when we use fixed.x, these same indices clearly indicate a problem.  Note that 4 variables produce 6 covariances and thus the null model has 6 degrees of freedom.  Fixed.x removes one of these, namely cov(x,z).

fixed.x = FALSE:
Model Test Baseline Model:

  Test statistic                              1758.838
  Degrees of freedom                                 6
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.993
  Tucker-Lewis Index (TLI)                       0.987

fixed.x = TRUE:
Model Test Baseline Model:

  Test statistic                                51.990
  Degrees of freedom                                 5
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.754
  Tucker-Lewis Index (TLI)                       0.591

Estimation is not my area of expertise, but the differences in SEs noticed by Shu-Fai may reflect the fact that the parameter covariance matrices are of different sizes for the two models.  Perhaps these differences impact the variances on the main diagonal due to being jointly optimized.

Keith
------------------------
Keith A. Markus
John Jay College of Criminal Justice, CUNY
http://jjcweb.jjay.cuny.edu/kmarkus
Frontiers of Test Validity Theory: Measurement, Causation and Meaning.
http://www.routledge.com/books/details/9781841692203/

Reply all
Reply to author
Forward
0 new messages