SEM and categorical endogenous variables in lavaan

82 views
Skip to first unread message

samfi...@gmail.com

unread,
Jan 18, 2019, 2:39:58 PM1/18/19
to lavaan
List,

As I understand the implementation of the WLSMV estimator in lavaan, one can think of binary and ordinal manifest variables as arising from an underlying normally distributed variable with variance = 1.  When this unobserved variable exceeds a defined threshold(s), the observed variable switches categories from a lower to higher category.  I am interested in fitting the following model.  


model<- 
'theta =~ t*c0 + t*c3 + t*c5 + t*c9 
 c0 ~ 1 
 c3 ~ 1 + r*c0 
 c5 ~ 1 + r*c3
 c9 ~ 1 + r*c5'


where c0 - c9 are binary variables measured over time in children  aged 0,3,5,and 9.   From this model, you can see that the dependence between observations over time arise from an time-stable unobserved component (theta) and temporally lagged effects of variables measured from previous measurement occasions.  


Starting from the first time period,  I generate a normal random variate comprised of two components - an idiosyncratic error and theta.  


theta <- rnorm(n,0,1)

c0_star <- rnorm(n,0,sqrt(.75)) + .5*theta
c0 <- ifelse(c0_star > -.35,1,0)

Note that I choose the path coefficient for theta and the variance for the error component so that they sum to one ( .75 + .5^2) and then I draw a binary random variable based on a threshold of -.35.  


Going forward in time gets a bit more complicated because c3  is formed from three components - idiosyncratic error, theta, and the impact from the previous time period.   Choosing the variance for the error component so that the total variance is 1 requires considering an additional source of variation attributable to c0 and the covariance between c0 and theta. I am embarrassed to reveal how I avoided having to do algebra in order to accomplish this, but in the interest of full disclosure...

x <- seq(.1:1,by=.01)
test <- rep(999,length(x))

for(i in 1:length(x)){
  test[i] <- x[i] + .25 + .3^2 + 2*cov(.5*theta,c0_star*.3)- 1} 
plot(x,test)
cbind(x,test)


c3_star <- rnorm(n,0,sqrt(.515)) + .3*c0_star + .5*theta 
c3 <- ifelse(c3_star > -.35,1,0)
var(c3_star)


Naturally, the magnitude of the idiosyncratic error goes down over time.   I repeat this for the other two time periods.  

c5_star <- rnorm(n,0,sqrt(.46)) + .3*c3_star +  .5*theta 
c5 <- ifelse(c5_star > -.35,1,0)
var(c5_star)

c9_star <- rnorm(n,0,sqrt(.45))  + .3*c5_star + .5*theta
c9 <- ifelse(c9_star > -.35,1,0)
var(c9_star)


In this artificial situation, where I have the underlying normally distributed random variables, I can fit a model directly to these.  Thus, 

dt2 <- as.data.frame(cbind(c0_star,c3_star,c5_star,c9_star))
names(dt2) <- c("c0","c3","c5","c9")

fit2 <-  sem(model,data=dt2,std.lv = TRUE)


The parameter estimates obtained are close to the true values and repeated sampling reveals that the sampling distributions are centered on their true values. I also don't get any warnings about convergence.

parameterEstimates(fit2)


However, when I switch the binary measures,


dt <- as.data.frame(cbind(c0,c3,c5,c9))

names(dt) <- c("c0","c3","c5","c9")

dt$c0 <- ordered(dt$c0, levels = c("0","1"))
dt$c3 <- ordered(dt$c3, levels = c("0","1"))
dt$c5 <- ordered(dt$c5, levels = c("0","1"))
dt$c9 <- ordered(dt$c9, levels = c("0","1"))


The parameter estimates are clearly off (not dramatically) and I get warnings about convergence.  



Is my logic correct?  Should I be able to recover the true values for the parameter estimates (in expectation) using both approaches?  With slightly simpler models I have found this to be the case...


If you read this far, THANK YOU!


Sam







Reply all
Reply to author
Forward
0 new messages