CFA Model and Indirect effects

947 views
Skip to first unread message

Jordane Boudesseul

unread,
Mar 7, 2017, 9:48:48 PM3/7/17
to lavaan

I'm trying to fit this model with 2 mediators and 1 moderator: 



 
Here is my syntax :

# model <- ' 
              # regressions
             DV1~MED1+MED2
             MED1~MOD1+IV
             MED2~MOD1+IV

              # residual variances
              DV1~~DV1
              MED1~~MED1
              MED2~~MED2
              IV~~IV
              MOD1~~MOD1
              MED1~~MED2
              IV~~DV1
              MOD1~~MED1
              MOD1~~MED2
              DV1~~MOD1
              
'
fit <- cfa(model, data=DF, missing='fiml', group.label=DF$CITY)
summary(fit)
#

1) I recoded gender with dummy coding and the IV with ordinal coding but the DV and the MED2 are frequency raw data, is this problem to fit the model? Or do I need at least to order them ordinaly? Cause it does not give me SEs for those variables. Also the IV is pretty skewed, should I use the MLM estimator? 

2) I want to control for the clustering effect of the model by city where participants are living in (80 different cities) but R is warning me :

"Warning messages:
1: In lav_data_full(data = data, group = group, group.label = group.label,  
  lavaan WARNING: `group.label' argument will be ignored if `group' argument is missing
2: In lavaan::lavaan(model = model, data = DF, missing = "fiml", group.label = DF$CITY,  :
  lavaan WARNING: syntax contains parameters involving exogenous covariates; switching to fixed.x = FALSE"

Group involves exogenous covariates? Sorry I do not get that point.

3) Finally, once the model has been fitted, Rosseel (2012) precised that we can evaluate indirect effect in mediation analysis. But the fact that we're using CFA/SEM model isn't actually alternative to mediations analysis? Or are they complementary? 

Thanks for the help guys,

j

Terrence Jorgensen

unread,
Mar 8, 2017, 4:29:21 AM3/8/17
to lavaan
1) I recoded gender with dummy coding and the IV with ordinal coding

Your syntax and path diagram have no labels, so I can't tell how these variables operate in your model.  But if they are both exogenous predictors, then dummy codes are necessary for any kind of categorical variable (binary, ordinal, nominal).


Also, you say you have a moderator, but there are no product terms in your path diagram or syntax.  So neither predictor is a moderator, just a covariate.

but the DV and the MED2 are frequency raw data, is this problem to fit the model?

If they are not approximately continuous, then they should be treated as counts. Mplus is the only SEM software I am aware of that allows count outcomes.  If there are not very many categories, you could treat the count outcome as ordered.


Also the IV is pretty skewed, should I use the MLM estimator? 

If you set fixed.x = TRUE (the default option: ?lavOptions), no assumptions are made about the distribution of exogenous predictors.

2) I want to control for the clustering effect of the model by city where participants are living in (80 different cities) but R is warning me :

"Warning messages:
1: In lav_data_full(data = data, group = group, group.label = group.label,  
  lavaan WARNING: `group.label' argument will be ignored if `group' argument is missing

You did not tell lavaan which column in DF is the grouping variable. You do not need to specify "group.label" unless you want to specify that the cities appear in a particular order in the output.

cfa(model, data=DF, missing='fiml', group = "CITY")

2: In lavaan::lavaan(model = model, data = DF, missing = "fiml", group.label = DF$CITY,  :
  lavaan WARNING: syntax contains parameters involving exogenous covariates; switching to fixed.x = FALSE"

Group involves exogenous covariates? Sorry I do not get that point.

The message is not about group, it is about exogenous covariates (IV and MOD1).  You do not need to specify that they (co)vary in the syntax.  Using fixed.x = TRUE tells lavaan to just use their observed sample statistics, so that you don't need to assume they are normally distributed.

3) Finally, once the model has been fitted, Rosseel (2012) precised that we can evaluate indirect effect in mediation analysis. But the fact that we're using CFA/SEM model isn't actually alternative to mediations analysis? Or are they complementary? 

You can conduct mediation analysis in many ways.  SEM is one of the better frameworks because all paths are easily estimated simultaenously.  If you have categorical outcomes, though, it gets tricky.  See some later slides in this presentation:


You can also find a lot of advice about mediation on SEMNET:


FYI, there is no measurement model in your syntax or diagram, so this is a path analysis, not a CFA.  But either the cfa() or sem() functions both call lavaan() with the same default settings, so that detail is inconsequential.  I was just confused by "CFA" in the subject.

Terrence D. Jorgensen
Postdoctoral Researcher, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam

Jordane Boudesseul

unread,
Mar 8, 2017, 9:44:34 PM3/8/17
to lavaan
Thanks for your feedbacks Terrence, I've just responded down below:
 
1) I recoded gender with dummy coding and the IV with ordinal coding

Your syntax and path diagram have no labels, so I can't tell how these variables operate in your model.  But if they are both exogenous predictors, then dummy codes are necessary for any kind of categorical variable (binary, ordinal, nominal).

Sorry for that: the IV and MOD1 (gender) are effectively exogenous while the mediators and DV are endogenous. I effectively codded the exogenous predictors! 


Also, you say you have a moderator, but there are no product terms in your path diagram or syntax.  So neither predictor is a moderator, just a covariate.

You're right, my first regression line should include:

DV~MOD1*MED1+MOD1*MED2+IV

Do you recall a way to display R2-squared values for this? (to evaluate how much variance the full model explained).

but the DV and the MED2 are frequency raw data, is this problem to fit the model?

If they are not approximately continuous, then they should be treated as counts. Mplus is the only SEM software I am aware of that allows count outcomes.  If there are not very many categories, you could treat the count outcome as ordered.

What do you mean by "approximately continuous"? Is there a statistical way to check for that?


Also the IV is pretty skewed, should I use the MLM estimator? 

If you set fixed.x = TRUE (the default option: ?lavOptions), no assumptions are made about the distribution of exogenous predictors.

2) I want to control for the clustering effect of the model by city where participants are living in (80 different cities) but R is warning me :

"Warning messages:
1: In lav_data_full(data = data, group = group, group.label = group.label,  
  lavaan WARNING: `group.label' argument will be ignored if `group' argument is missing

You did not tell lavaan which column in DF is the grouping variable. You do not need to specify "group.label" unless you want to specify that the cities appear in a particular order in the output.

cfa(model, data=DF, missing='fiml', group = "CITY")

2: In lavaan::lavaan(model = model, data = DF, missing = "fiml", group.label = DF$CITY,  :
  lavaan WARNING: syntax contains parameters involving exogenous covariates; switching to fixed.x = FALSE"

Group involves exogenous covariates? Sorry I do not get that point.

The message is not about group, it is about exogenous covariates (IV and MOD1).  You do not need to specify that they (co)vary in the syntax.  Using fixed.x = TRUE tells lavaan to just use their observed sample statistics, so that you don't need to assume they are normally distributed.

Yes now the error message referred to something more specific :

"Error in lav_samplestats_from_data(lavdata = lavdata, missing = lavoptions$missing,  : 
  lavaan ERROR: data contains only a single observation in group 9
In addition: There were 50 or more warnings (use warnings() to see the first 50) "

I guess lavaan assumed there is at least 2 observations per group?

3) Finally, once the model has been fitted, Rosseel (2012) precised that we can evaluate indirect effect in mediation analysis. But the fact that we're using CFA/SEM model isn't actually alternative to mediations analysis? Or are they complementary? 

You can conduct mediation analysis in many ways.  SEM is one of the better frameworks because all paths are easily estimated simultaenously.  If you have categorical outcomes, though, it gets tricky.  See some later slides in this presentation:


You can also find a lot of advice about mediation on SEMNET:


FYI, there is no measurement model in your syntax or diagram, so this is a path analysis, not a CFA.  But either the cfa() or sem() functions both call lavaan() with the same default settings, so that detail is inconsequential.  I was just confused by "CFA" in the subject.

Absolutely, is there a way to change the subject to "path analysis"? Would shed a light on what's really inside the trend. Thanks in advance for your help! 

Terrence Jorgensen

unread,
Mar 9, 2017, 4:28:58 AM3/9/17
to lavaan
my first regression line should include:

DV~MOD1*MED1+MOD1*MED2+IV

Term-expansion with the asterisk operator (*) works like that in formula objects (?formula), but lavaan does not use formula objects.  This is because SEMs model several regression equations simultaneously.  So you need to explicitly include the product term as an additional variable in your ?model.syntax, like so:

myData$med1xG <-  myData$MED1 * myData$MOD1
myData$med2xG
<-  myData$MED2 * myData$MOD1


model
<- '
  # regressions
  DV ~ MOD1 + MED1 + MED2 + med1xG + med2xG + IV
...
'


Do you recall a way to display R2-squared values for this? (to evaluate how much variance the full model explained).

R-squared for each endogenous variable is the variance explained by its predictors, and you can request that from the summary() or parameterEstimates() output using the argument "rsquare = TRUE".

What do you mean by "approximately continuous"? Is there a statistical way to check for that?

Not in a hypothesis-testing way, but there are practical guidelines you might be able to find with a little Googling.  I think a Poisson distribution is approximately normalish when the mean is as little as 10 or 15.  Binomial (not binary) variables are approximately normal when they are based on at least 30 trials.  But continuity should be good enough, even if not normal, because there are robust estimators that adjust for excess kurtosis.  Ordinal data can be treated as continuous when there are at least 5 categories, or at least 7 if the distributions are quite skewed.  Here is some helpful reading, although it is about ordinal data, not counts per se.


Yes now the error message referred to something more specific :

"Error in lav_samplestats_from_data(lavdata = lavdata, missing = lavoptions$missing,  : 
  lavaan ERROR: data contains only a single observation in group 9
In addition: There were 50 or more warnings (use warnings() to see the first 50) "

I guess lavaan assumed there is at least 2 observations per group?

At least two observations are necessary to calculate (co)variance, because there is no (co)variability in a single number.  But you need a lot more than 2 observations per group in SEM.  How many cities do you have?  If it is more than a 10, a multilevel SEM is probably the framework you want.  lavaan does not yet provide multilevel functionality, but it will eventually (probably slowly) introduce such features:


Here are a couple of articles that provide a good conceptual introduction to MSEM:



Absolutely, is there a way to change the subject to "path analysis"? Would shed a light on what's really inside the trend. Thanks in advance for your help! 

I don't think that's necessary.  You can click "Edit subject" above a reply editor/window, but I think that would just start a new thread rather than continue linking to the original thread.

Jordane Boudesseul

unread,
Mar 23, 2017, 11:29:34 AM3/23/17
to lavaan
my first regression line should include:

DV~MOD1*MED1+MOD1*MED2+IV

Term-expansion with the asterisk operator (*) works like that in formula objects (?formula), but lavaan does not use formula objects.  This is because SEMs model several regression equations simultaneously.  So you need to explicitly include the product term as an additional variable in your ?model.syntax, like so:

myData$med1xG <-  myData$MED1 * myData$MOD1
myData$med2xG
<-  myData$MED2 * myData$MOD1


model
<- '
  # regressions
  DV ~ MOD1 + MED1 + MED2 + med1xG + med2xG + IV
...
'



Thanks for that Terrence! So if I put something like DV ~ IV + MOD1 + MOD2 for example, my moderators are juste covariates right? But does that mean my IV is also one? Or is there a way to specify it as an IV? A priori, should play the same role in the model?
 
Do you recall a way to display R2-squared values for this? (to evaluate how much variance the full model explained).

R-squared for each endogenous variable is the variance explained by its predictors, and you can request that from the summary() or parameterEstimates() output using the argument "rsquare = TRUE".

yes parameterEstimates() display many interesting cues but not rsquare (neither does summary with that argument, that's bizarre) 

What do you mean by "approximately continuous"? Is there a statistical way to check for that?

Not in a hypothesis-testing way, but there are practical guidelines you might be able to find with a little Googling.  I think a Poisson distribution is approximately normalish when the mean is as little as 10 or 15.  Binomial (not binary) variables are approximately normal when they are based on at least 30 trials.  But continuity should be good enough, even if not normal, because there are robust estimators that adjust for excess kurtosis.  Ordinal data can be treated as continuous when there are at least 5 categories, or at least 7 if the distributions are quite skewed.  Here is some helpful reading, although it is about ordinal data, not counts per se.


Ah interesting that could be a problem in my model. Many of my variables are still skewed even after a 7-categories recoding and other were already coded from 1-3 or even binary. I should be very careful when interpreting the results then. 

Yes now the error message referred to something more specific :

"Error in lav_samplestats_from_data(lavdata = lavdata, missing = lavoptions$missing,  : 
  lavaan ERROR: data contains only a single observation in group 9
In addition: There were 50 or more warnings (use warnings() to see the first 50) "

I guess lavaan assumed there is at least 2 observations per group?

At least two observations are necessary to calculate (co)variance, because there is no (co)variability in a single number.  But you need a lot more than 2 observations per group in SEM.  How many cities do you have?  If it is more than a 10, a multilevel SEM is probably the framework you want.  lavaan does not yet provide multilevel functionality, but it will eventually (probably slowly) introduce such features:


Here are a couple of articles that provide a good conceptual introduction to MSEM:



Thanks for the papers! I have around 50 cities but no prediction for the level-2, I just want to control for the clustering effect. Can I do that without damaging the model?

Terrence Jorgensen

unread,
Mar 24, 2017, 11:15:48 AM3/24/17
to lavaan
So if I put something like DV ~ IV + MOD1 + MOD2 for example, my moderators are just covariates right?

Yes.

But does that mean my IV is also one? Or is there a way to specify it as an IV? A priori, should play the same role in the model?

There is no difference between an independent variable, covariate, predictor, explanatory variable... These are all just conceptual terms we use to describe their role in our theory.  As far as the model is concerned, they are all just righthand-side variables. 

no prediction for the level-2, I just want to control for the clustering effect. 

In that case, you can look into the lavaan.survey package.  But I don't think it has support for categorical outcomes.

Jordane Boudesseul

unread,
Mar 26, 2017, 7:11:56 PM3/26/17
to lavaan
Thanks Terrence!

Also, when I run the analysis, there is effectively an error with the group term: 

"Error in estimate.moments.EM(Y = X[[g]], Mp = Mp[[g]], Yp = missing.[[g]],  : 
  lavaan ERROR: Sigma_22.inv cannot be inverted"

I'm wondering if this is because of the number of groups or if it's the nature of the groups themselves? (they are identified by a number).

Terrence Jorgensen

unread,
Mar 29, 2017, 8:28:53 AM3/29/17
to lavaan
"Error in estimate.moments.EM(Y = X[[g]], Mp = Mp[[g]], Yp = missing.[[g]],  : 
  lavaan ERROR: Sigma_22.inv cannot be inverted"

I'm wondering if this is because of the number of groups or if it's the nature of the groups themselves? (they are identified by a number).

The grouping indicator can be numeric/integer, that's not a problem.  

HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 '

class(HolzingerSwineford1939$sex) # integer
fit
<- cfa(HS.model, data = HolzingerSwineford1939, group = "sex")
summary
(fit, fit.measures=TRUE)

I don't know what caused the error though, and it is hard to guess without seeing a full script (and maybe data).
Reply all
Reply to author
Forward
0 new messages