endogeneity in lavaan

381 views
Skip to first unread message

Ned Kock

unread,
Nov 13, 2017, 4:33:03 PM11/13/17
to lavaan
I create a dataset of N=10,000 with four factors (F1 ... F4) that is contaminate with endogeneity with the following basely path coefficients:

B (baseline model)
0 0 0 0
0.400000000000000 0 0 0
0.400000000000000 0.400000000000000 0 0
0 0.400000000000000 0.400000000000000 0

The factors were created so that the factor correlation matrix replicates exactly that matrix of total effects given by (I-B)^-1:

1 0.400000000000000 0.560000000000000 0.384000000000000
0.400000000000000 1 0.400000000000000 0.560000000000000
0.560000000000000 0.400000000000000 1 0.400000000000000
0.384000000000000 0.560000000000000 0.400000000000000 1

The lavaan B estimates are exactly the ones that I obtain via ordinary least squares regression:

B (estimates, lavaan)
0 0 0 0
0.399999986341375 0 0 0
0.476173518060852 0.209537078691023 0 0
0 0.476197973557606 0.209508616382042 0

However, the correct estimates should have matched the ones that are obtained with a two-stage least squares analysis:

B (estimates, 2SLS)
0 0 0 0
0.399999986341375 0 0 0
0.476173518060852 0.209537078691023 0 0
0 0.451614828197087 0.153640143972920 0

Clearly my analysis via lavaan is not capturing endogeneity. What am I doing wrong? The lavaan code is shown below.

# read data
mData <- read.delim("Data_4Factors_B4Endo.txt")

# define analysis model
mModel <- '  
F2 ~ F1
F3 ~ F1 + F2
F4 ~ F2 + F3
'
# generate solution
mFit <- sem(mModel,data=mData)
mEst <- standardizedSolution(mFit,type="std.lv") # store the std estimates
summary(mFit,nd=4,standardized=T,fit.measures=T,rsquare=T) # show std solution

Thanks in advance,

Ned Kock

Mikko Rönkkö

unread,
Nov 14, 2017, 2:33:54 AM11/14/17
to lav...@googlegroups.com
Hi,

Your model is a normal mediation model and none of the regressors and endogenous with respect to their dependent variable. (The term endogenous variable can have different meanings in this context.)

If I understood your example correctly, you want to model regression of F3 on F1 and F2, using F2 as an instrument for F1. That model would not be identified and cannot be meaningfully estimated with ML and cannot be estimated at all using 2SLS. But more generally, if you want to do instrumental variable models, you need to free the correlations between the error term of the DV and the endogenous explanatory variable. This will not give you 2SLS estimates because Lavaan does not do 2SLS, but the ML estimates should be close to 2SLS estimates.

I think that you need to clarify a bit more what you want to do.

Mikko

--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.
To post to this group, send email to lav...@googlegroups.com.
Visit this group at https://groups.google.com/group/lavaan.
For more options, visit https://groups.google.com/d/optout.

Ned Kock

unread,
Nov 14, 2017, 8:44:09 AM11/14/17
to lavaan
The model is a mediation model that clearly displays endogeneity with respect to F4. Via 2SLS I can control for that endogeneity. I would like to know how to change the lavaan code so that I can control for endogeneity via lavaan.

Edward Rigdon

unread,
Nov 14, 2017, 9:19:23 AM11/14/17
to lav...@googlegroups.com
Ned--
     May I ask for some clarification? By "endogeneity," do you mean that the residual for F4 is correlated with one or more predictors in the same equation--here, F2 and F3? I did not see that you had included residual covariance in your simulation model. I saw your "B" matrix of regression coefficients, but not the residual covariance matrix.
     Or do you mean something else by "endogeneity?"
--Ed Rigdon

To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+unsubscribe@googlegroups.com.

Ned Kock

unread,
Nov 14, 2017, 9:38:47 AM11/14/17
to lavaan
Hi Ed. Yes, that is exactly what I mean. Is there a way I can change the specification of the model or the settings in lavaan to control for this?


On Tuesday, November 14, 2017 at 8:19:23 AM UTC-6, Edward Rigdon wrote:
Ned--
     May I ask for some clarification? By "endogeneity," do you mean that the residual for F4 is correlated with one or more predictors in the same equation--here, F2 and F3? [...]

Mikko Rönkkö

unread,
Nov 15, 2017, 1:52:21 AM11/15/17
to lav...@googlegroups.com
Hi,

You can add a correlation between an error and a predictor by adding correlations to the model

# define analysis model
mModel <- '  
F2 ~ F1
F3 ~ F1 + F2
F4 ~ F2 + F3
F4 ~~ F2
F4 ~~ F3
'

Note that this model is not identified because you have just one instrument (X1) and two endogenous explanatory variables (X2, X3) and you could not estimate this with 2SLS either. But if you are happy with assuming that only one of the two predictors X2 and X3 is endogenous with respect to X4, then it would work.

So this specification would control for endogeneity of X2 assuming that it is the only endogenous explanatory variable and that X1 is a valid instrument for the regression of  F4.

mModel <- '  
F2 ~ F1
F3 ~ F1 + F2
F4 ~ F2 + F3
F4 ~~ F2
'

Mikko

Ned Kock

unread,
Nov 15, 2017, 8:57:04 AM11/15/17
to lavaan
Thanks Mikko. Actually, I've tried this before; adding F4 ~~ F2 and F4 ~~ F3 leads to an identification problem, and adding only F4 ~~ F2 to severely biased results for at least one of the regression coefficients.

One possible solution is to create an instrumental variable I4 as indicated below, and then use it in the regression equation for F4. The results are then identical to those of the 2SLS analysis that I conducted.

mModel <- '  
# instrumental variable
I4 =~ F1
# path model
F2 ~ F1
F3 ~ F1 + F2
F4 ~ F2 + F3 + I4
'
In this specific case, this is equivalent to adding F1 as a predictor of F4. However, the link F1 -> F4 does not exist at the population, and thus I would get a false positive if I had added that link.

I just thought that there was a better way to do this with lavaan. The approach above would not work if the instrumental variable were more complex; e.g., aggregated two or more  instruments. In econometrics, instrumental variables are typically created as composites.

Ned

Mikko Rönkkö

unread,
Nov 15, 2017, 9:04:18 AM11/15/17
to lav...@googlegroups.com
Hi,

On 15 Nov 2017, at 15.57, Ned Kock <ned...@gmail.com> wrote:

Thanks Mikko. Actually, I've tried this before; adding F4 ~~ F2 and F4 ~~ F3 leads to an identification problem, and adding only F4 ~~ F2 to severely biased results for at least one of the regression coefficients.

One possible solution is to create an instrumental variable I4 as indicated below, and then use it in the regression equation for F4. The results are then identical to those of the 2SLS analysis that I conducted.

mModel <- '  
# instrumental variable
I4 =~ F1

This part just creates a new variable i4 = F1.

# path model
F2 ~ F1
F3 ~ F1 + F2
F4 ~ F2 + F3 + I4

Which makes your model equivalent to this specification:

F2 ~ F1
F3 ~ F1 + F2
F4 ~ F2 + F3 + F1


I think that you need to double check your 2SLS results because the model that you presented cannot be estimated with that technique due to the lack of instruments. (You have not explained your 2SLS specification, but your model suggests that only F1 qualifies as an instrument.) 

Mikko

In this specific case, this is equivalent to adding F1 as a predictor of F4. However, the link F1 -> F4 does not exist at the population, and thus I would get a false positive if I had added that link.

I just thought that there was a better way to do this with lavaan. The approach above would not work if the instrumental variable were more complex; e.g., aggregated two or more  instruments. In econometrics, instrumental variables are typically created as composites.

Ned

Ned Kock

unread,
Nov 15, 2017, 9:50:39 AM11/15/17
to lavaan
Mikko: In a 2SLS analysis, in-model endogeneity is controlled for by the creation of instrumental variables (IVs) for certain endogenous variables, and the inclusion of the IVs in regression equations. The classic approach used in econometrics is to create IVs as composites that aggregate instruments. The instruments are exogenous variables that influence the endogenous variables in question only via indirect effects.

Edward Rigdon

unread,
Nov 15, 2017, 2:16:06 PM11/15/17
to lav...@googlegroups.com
Ned--
     If you have multiple indicators, perhaps you would like to explore Ken Bollen's "model-implied instrumental variables SEM," as actuated in the fairly new MIIVsem package in R. This approach uses the multiple indicators in a factor model as instruments, estimating an observed variable model but nevertheless recovering structural model parameters.
--Ed Rigdon

--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+unsubscribe@googlegroups.com.

Mikko Rönkkö

unread,
Nov 16, 2017, 2:57:01 AM11/16/17
to lav...@googlegroups.com
Ned,

While your explanation has some correct elements, it is not entirely correct: In 2sls, you do not include IVs in the regression. Instead, in the first stage you regress all your original explanatory variables on the instruments, and then you use these fitted values as replacements for the original explanatory variables in the second stage regression.

But this post finally clarified what is it that you want to do:

Your population model was

F2 ~ F1
F3 ~ F1 + F2
F4 ~ F1 + F2 + F3

And the estimated model was

F2 ~ F1
F3 ~ F1 + F2
F4 ~ F2 + F3

So yes, F2 and F3 are endogenous regressors in that model because they both correlated with the omitted variable F1. You are right that you can obtain consistent estimates if you include F1 as a predictor of F4, but that has nothing to do with instrumental variables or 2sls. If we have an omitted variable, then an ideal solution is of course to get data for that variable and then include it to the model. Instrumental variables address the scenario where that is not possible. To qualify as an instrument, a variable must meet the exclusion criterion, which means that it should not be correlated with the error term of the focal regression. In your model X1 faisl this criterion because it has a unique effect in the population and is therefore not a valid instrument.

Mikko

On 15 Nov 2017, at 16.50, Ned Kock <ned...@gmail.com> wrote:

Mikko: In a 2SLS analysis, in-model endogeneity is controlled for by the creation of instrumental variables (IVs) for certain endogenous variables, and the inclusion of the IVs in regression equations. The classic approach used in econometrics is to create IVs as composites that aggregate instruments. The instruments are exogenous variables that influence the endogenous variables in question only via indirect effects.

Ned Kock

unread,
Nov 16, 2017, 7:56:05 AM11/16/17
to lavaan
Mikko, my population model was:

F2 ~ F1
F3 ~ F1 + F2
F4 ~ F2 + F3

The absence of the link F1 -> F4 at the population level is what gave rise to endogeneity, and the need for a simple 2SLS.

Best, Ned


On Thursday, November 16, 2017 at 1:57:01 AM UTC-6, Mikko Rönkkö wrote:
Ned,

While your explanation has some correct elements, it is not entirely correct: In 2sls, you do not include IVs in the regression. Instead, in the first stage you regress all your original explanatory variables on the instruments, and then you use these fitted values as replacements for the original explanatory variables in the second stage regression.

But this post finally clarified what is it that you want to do:

Your population model was

F2 ~ F1
F3 ~ F1 + F2
F4 ~ F1 + F2 + F3

[...]

Mikko Rönkkö

unread,
Nov 16, 2017, 8:04:59 AM11/16/17
to lav...@googlegroups.com
Hi,

On 16 Nov 2017, at 14.56, Ned Kock <ned...@gmail.com> wrote:

Mikko, my population model was:

F2 ~ F1
F3 ~ F1 + F2
F4 ~ F2 + F3

Right. I misread the B matrix in your original post. The effects of F1 on F4 was zero there.



The absence of the link F1 -> F4 at the population level is what gave rise to endogeneity, and the need for a simple 2SLS.

That alone would not produce endogeneity, so I guess were are back in the starting point that it is not clear what (or which) are the endogenous explanatory variables in the model. 

But I also guess that given that you got the results that you were after, this problem is now solved.

Mikko


Best, Ned


On Thursday, November 16, 2017 at 1:57:01 AM UTC-6, Mikko Rönkkö wrote:
Ned,

While your explanation has some correct elements, it is not entirely correct: In 2sls, you do not include IVs in the regression. Instead, in the first stage you regress all your original explanatory variables on the instruments, and then you use these fitted values as replacements for the original explanatory variables in the second stage regression.

But this post finally clarified what is it that you want to do:

Your population model was

F2 ~ F1
F3 ~ F1 + F2
F4 ~ F1 + F2 + F3

[...]

Ned Kock

unread,
Nov 16, 2017, 8:06:48 AM11/16/17
to lavaan
Thanks Ed. 

There are several conceptual similarities among Bollen's MIIV approach, Rosseel's Factor Score Path Analysis, and my own Factor-Based PLS-SEM method. The latter, explained in the report linked below, is the one I mentioned to you in Macau.


The key to estimating factors is, in my view, something akin to the variation sharing technique (see link below). Variants of the Factor-Based PLS-SEM methods have been implemented in WarpPLS (warppls.com) for at least 3 years now.


Ned
Reply all
Reply to author
Forward
0 new messages