Model specification with control variables and clustering SE

767 views
Skip to first unread message

Laura

unread,
Jun 29, 2021, 6:12:59 AM6/29/21
to lavaan
I am very new to SEM and have only worked otherwise with OLS - lm() - in R. I am running into several uncertainties when using SEM in R (lavaan). 
  1. Can I just add the control variables to the regression model as in case of lm() see below ? Also all  categorical variables have been encoded as dummy variables, is this correct?
  2. When adding all of the control variables, my CFI reduces drastically =0.382 but other fit indicators are still good (RMSEA = 0.021, SRMR = 0.01). Can I still use then the specified model?
  3. How can I cluster standard errors on e.g. city level in lavaan environment ? 

I have attached the model. 
model <- '
F =~ Standardizedx1 + Standardizedx2 + Standardizedx3 + Standardizedx4
Y ~ F + Q28_1 + Female + GenderOther + nationality   + Edu1+Edu2 + Edu3+ Edu4 + Edu5 + Eud6 + Edu7  + Edu1_f +Edu2_f + Edu3_f+ Edu4_f + Edu5 _f+ Eud6_f + Edu7_F + Public  + Public_IDK + Time1 + Time2 + Notime + Envir1 + Q36_1  + CV1+ CV2 + CV2+ CV3 + CV4+ subjects_A + subjects_B + subjects_C + subjects_D+ subjects_E + subjects_F + subjects_G + subjects_H + city1 + city2  city3 + city4 + city5 + city6 + city7 + city8+ city9 + city10  + city11 + city12+ city13 + city14+ city15+ city16 + city17' 

All variables expect F are control variables in the regression.

Pat Malone

unread,
Jun 29, 2021, 8:34:10 AM6/29/21
to lav...@googlegroups.com
Hi, Laura.

1. Looks reasonable.

2. Generally speaking, if you're relying on fit indices, you want all the major ones to be good. What are you chi-squared and df? Most often, the pattern you're seeing happens when variables have low correlations in the data. CFI compares the fit of your model to a null model. If the null model (assumes everything uncorrelated) isn't too bad, then CFI will be bad.

But you should probe to see where the misfit is coming from. The most likely source in your model is where one or more covariates are more closely related to some of your x variables than to the others. Start by looking at lavResiduals()  and focusing on those relations.

3. cluster="city" (and take out the city dummies)

Hope this helps,
Pat

--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lavaan/024c1cf2-a659-4e74-a80d-d449682dafb8n%40googlegroups.com.


--
Patrick S. Malone, PhD
Sr Research Statistician, FAR HARBΦR
This message may contain confidential information; if you are not the intended recipient please notify the sender and delete the message.

Pat Malone

unread,
Jun 29, 2021, 10:35:05 AM6/29/21
to lav...@googlegroups.com
Hi, Laura.

Please reply to the list so others can chime in.

Interleaved:

On Tue, Jun 29, 2021 at 10:24 AM Laura <l.hag...@gmail.com> wrote:
Dear Pat,

thank you for your quick response! I am still struggling with understanding what is happening to my model.
  1. For clustering SE, I thought the only possible way is to use the survey lavaan package? And when using the cluster = "city" option in the sem(), can i still have the cities in the regression as I intend them to have as like fixed effects?
lavaan.survey has not been updated since 2016. The cluster= argument in lavaan() is now preferred. Using the city dummies adjusts only means/intercepts. Using cluster="city" also handles partitioning of variance. But look into resources on centering options for multilevel regression--the same logic will carry into lavaan.
 
  1.  Also when running the survey lavaan package , clustered_STS <-svydesign(ids=~city, data=data) and lavaan.survey(results, clustered_STS) from results <- sem(model, data=data, std.lv=TRUE), compared to the cluster-option,results <- sem(model, data=data, std.lv=TRUE, cluster="city") I received "Robust.cluster.sem", with information =Expected and Information saturated (h1) model = Structures  vs. "Robust.cluster" for the cluster option, information=observed and observed information based on =Hessian. What exactly are those differences?
I've never used lavaan.survey -- can't speak to this.
 
  1. I am detecting weird movements in the CFI. As I only have one observation in 6 cities, I have put them together in one "city" so that the variance is not zero. However, by doing so, the CFI decreases massively. Also when I take out those 6 observations and the city dummies from the regression, the CFI is low again (df=233,chisq= 710.441). However, if I individually add the cities with only one individual as dummies into the regression, the CFI is high (0.95 with df=257, chisquar =  736.838 as N=4868). What is happening here ? 
There's no need to do this. Clusters can be of size 1 (if all clusters are size 1, you'll get exactly the same results as a non-clustered model).
 
  1. Also I always (for any city combination) receive the following warnings: Warning message:In lav_model_vcov(lavmodel = lavmodel, lavsamplestats = lavsamplestats,  :lavaan WARNING:The variance-covariance matrix of the estimated parameters (vcov)does not appear to be positive definite! The smallest eigenvalue (= -2.960830e-16) is smaller than zero. This may be a symptom that  the model is not identified. - I have read that this could be a numerical issue. Can I just ignore this warning? 
This question is asked frequently on this list. The general advice is that if the solution is reasonable, you can probably disregard.

Pat

Laura

unread,
Jun 30, 2021, 9:29:53 AM6/30/21
to lavaan
Dear Pat, 

Thank you again for helping me! 
Sorry, I am also new to google groups. As soon as I have realised that I only wrote to you, I already was googling how to get to the personal conversations so that I can copy my questions. And then your message already popped in. 

I have two other questions that are still unclear about my model/lavaan:
  1. Coming back to the frequently asked question about the variance-covariance matrix not being positive definite. I only seem to get this warning when I do cluster="city", however when left out there is no warning. Does this mean that the model is in general identifiable? Or why does clustering cause this warning? (How can I tell a model is identifiable?)
  2. I used the factor loadings from the estimated model above to create some kind of index - see below. When I now use those factor loadings to "manually" create an index for an OLS regression, I get a different results then when using the  SME. Should't the results from an OLS and SEM be the same in my case?
For question 2, I have attached the model:
OLS estimated:
F <- 0.2*Standardizedx1 + 0.16*Standardizedx2 + 0.27*Standardizedx3 + 0.34*Standardizedx4
lm(Y ~ F + Q28_1 + Female + GenderOther + nationality   + Edu1+Edu2 + Edu3+ Edu4 + Edu5 + Eud6 + Edu7  + Edu1_f +Edu2_f + Edu3_f+ Edu4_f + Edu5 _f+ Eud6_f + Edu7_F + Public  + Public_IDK + Time1 + Time2 + Notime + Envir1 + Q36_1  + CV1+ CV2 + CV2+ CV3 + CV4+ subjects_A + subjects_B + subjects_C + subjects_D+ subjects_E + subjects_F + subjects_G + subjects_H + city1 + city2  city3 + city4 + city5 + city6 + city7 + city8+ city9 + city10  + city11 + city12+ city13 + city14+ city15+ city16 + city17, data=data)

Pat Malone

unread,
Jun 30, 2021, 10:31:56 AM6/30/21
to lav...@googlegroups.com
Laura,

For point 1, I'm not sure.

For point 2, no, you shouldn't expect the same results. A factor is latent--it is definitionally something different that you can get from the manifest variables. A weighted mean will include (possibly systematic) error. It also doesn't account for imprecision in the estimates of the factor loadings. Finally, factor loadings aren't weights.

If you want observed variables that best reflect the factor, look into the plausible values literature--I can't recall off hand whether it's lavaan or semTools that can generate these, but one of them can.

Pat

Laura

unread,
Jun 30, 2021, 12:55:42 PM6/30/21
to lavaan
Dear Pat,

thank you for supporting me! 
Just one follow up question on the last part - are you talking about factor scores?

Best,
Laura

Pat Malone

unread,
Jun 30, 2021, 1:41:18 PM6/30/21
to lav...@googlegroups.com
More or less. The 50-cent summary of plausible values is doing multiple imputation of factor values.



Patrick S. Malone, PhD
Sr Research Statistician
FAR HARBΦR
+1 803.553.4181 | pat@ | farharbor.com
This message may contain confidential information; if you are not the intended recipient please notify the sender and delete the message.
Reply all
Reply to author
Forward
0 new messages