Mediation analysis with discrete numbers and binary variables

165 views
Skip to first unread message

Haimei Yu

unread,
Aug 1, 2023, 3:22:01 AM8/1/23
to lavaan
Hi everyone, 

Since my data is not normally distributed, I am not sure if I have used the model correctly, so I want to check here.
 
I want to perform a mediation analysis with three variables, X, M, and Y, where 
X is a binary variable (I set it as a factor variable, as it represents two different conditions in our experiment); 
M is a binary variable (I set it as an ordered variable, as 1 represents reward, 0 represents no reward); 
Y is a discrete integer (curiosity rated on a 7-point Likert scalee). 

I used the following code: 
Data <- data.frame(X = X, Y = Y, M = M)
model <- ' # direct effect
             Y ~ c*X
           # mediator
             M ~ a*X
             Y ~ b*M
           # indirect effect (a*b)
             ab := a*b
           # total effect
             total := c + (a*b)
         '
fit <- sem(model, data=Data, estimator='WLSMVS')
summary(fit, nd=5)

Is this the correct way to do? 

Thank you so much for your help!
Haimei 

Keith Markus

unread,
Aug 2, 2023, 9:53:11 AM8/2/23
to lavaan
Haimei,
Others my have a different perspective on this.  However, here is mine.

If you believe that your binary indicators represent underlying continuous variables then your approach could be justified.  However, weighted least squares is a large sample technique that can give misleading results or even no result at all in smaller samples.  Moreover, your analysis is squeezing a square peg into a round hole.  You are using a model that estimates a conditional expected value of a variable somewhere between 0 and 1 which is almost always a value that will never occur in the data because the observed scores are all either 0 or 1 rather than being continuously distributed around their expected value.  Adjustments do not make your binary data continuous, they just adjust the standard errors.  

So, if I were analyzing your data, I would instead declare Y and Z, the mediator and outcome, as ordinal and make use of a threshold model for these.  You can then test mediation with the underlying latent variables.  In my view this provides more interpretable parameter estimates using a model better suited to your data.

If it is not plausible that your variables represent underlying continuous variables, then you might look into the mediation package in R which I believe allows for the case of ordinal mediators and outcomes using a nonparametric bootstrap procedure.  


One thing to bear in mind is that, even if we assume that your variables are measured without error, associations and thus estimated causal effects can be lower among binary variables because they contain much less information about the state of the observed cases and are thus less sensitive indicators.  If we are dealing with a qualitative phenomenon like an electrical circuit being open or closed causing a light to be on or off, then fine.  If we are dealing with a coarse binary representation of a more subtle real-world causal process (not necessarily continuous), then expect your causal estimates to be correspondingly coarse grained.  This is less an analysis issue than a research design issue.

Keith
------------------------
Keith A. Markus
John Jay College of Criminal Justice, CUNY
http://jjcweb.jjay.cuny.edu/kmarkus
Frontiers of Test Validity Theory: Measurement, Causation and Meaning.
http://www.routledge.com/books/details/9781841692203/



Haimei Yu

unread,
Aug 3, 2023, 2:23:48 AM8/3/23
to lavaan
Thank you so much for your suggestion. It is indeed impossible to consider our X and M as continuous variables. 
I tried to use the mediation package before but it did not work. The problem is my data is also repeated measures. Y was measured in each subject for 160 trials, and within these 160, there are the combination of 2*2 posssible conditions of the two binary variables. And we want to take use of each trial's data not only the averaged data. If so, The mediation() can only take the outcome from lmer(), not glmer(). But lmer() requires normal distribution of the data.
On the other hand, Lavaan has multi-level SEM, but it requires data to be continuous. So none of these seem to fit with our current dataset...

Keith Markus

unread,
Aug 4, 2023, 10:03:02 AM8/4/23
to lavaan
Haimei,
I am not the best person to answer your question.  One option I might investigate would be whether you could run two multilevel regression models for binary outcomes (e.g., logistic) and then bootstrap the standard error of the estimate of the indirect effect.  A similar approach was once proposed for ordinary least squares regression.  You might consider describing your research design in a post to a general forum like Cross Validated to see what suggestions you get for ways to test your hypotheses.
Reply all
Reply to author
Forward
0 new messages