Too many variables

459 views
Skip to first unread message

Samantha Seiter

unread,
Oct 29, 2017, 1:23:21 AM10/29/17
to lavaan

Hello,

 

I am struggling with my data analysis and I would really appreciate insight into where I am going wrong.

 

I administered a 51 item Attitude questionnaire to 481 students.

Attitude is the overarching latent variable/construct.

I have clustered the 51 items into 13 Themes (latent variables/constructs).

These Themes are weakly to moderately correlated (0.3 to 0.6)

These 13 Themes can be further clustered into 4 higher order latent variables.

 

 

#13 Themes

SELF_PROF =~ ITEM6+ITEM17+ITEM22+ITEM32+ITEM41

SS_PROF =~ ITEM15+ITEM18+ITEM28+ITEM50

TIME =~ ITEM3+ITEM8+ITEM34

FOREIGNERS =~ ITEM9+ITEM11+ITEM24+ITEM44

POS_INCR_COURSES =~ ITEM1+ITEM12+ITEM33+ITEM51

NEG_INCR_COURSES =~ ITEM5+ITEM14+ITEM39+ITEM46+ITEM48

MATERIALS =~ ITEM23+ITEM26+ITEM43+ITEM45

DYNAMICS =~ ITEM21+ITEM31+ITEM35+ITEM36+ITEM47

PEDAGOGY =~ ITEM4+ITEM10+ITEM13+ITEM37

ENG_IMPR =~ ITEM2+ITEM16+ITEM42

NECESSITY =~ ITEM20+ITEM38+ITEM40

BEY_JAPAN =~ ITEM7+ITEM19+ITEM30

WORK =~ ITEM25+ITEM27+ITEM29+ITEM49

 

#These Themes are weak-to-moderately correlated

TIME~~SELF_PROF

PEDAGOGY~~SELF_PROF

PEDAGOGY~~TIME

POS_INCR_COURSES~~FOREIGNERS

MATERIALS~~FOREIGNERS

ENG_IMPROVEMENT~~FOREIGNERS

BEY_JAPAN~~FOREIGNERS

FOREIGNERS~~POS_INCR_COURSES

POS_INCR_COURSES~~MATERIALS

DYNAMICS~~POS_INCR_COURSES

ENG_IMPROVEMENT~~POS_INCR_COURSES

BEY_JAPAN~~POS_INCR_COURSES

MATERIALS~~FOREIGNERS

NEG_INCR_COURSE~~PEDAGOGY

ENG_IMPR~~MATERIALS

 

 

 

 

 

#4 higher order latent variables

#Easier for Lavaan to handle?

EngProfy =~ SELF_PROF + SS_PROF + ENG_IMPR

ClassEff =~ FOREIGNERS + MATERIALS + DYNAMICS + PEDAGOGY

DirEff =~ POS_INCR_COURSES + NEG_INCR_COURSES + TIME

EngNecJap =~ NECESSITY + BEY_JAPAN + WORK

 

I have tried FIRST coding SELF_PROF <-  ITEM6+ITEM17+ITEM22+ITEM32+ITEM41 then putting these 4 higher orders into the SEM model – no luck – model won’t fit.

I have tried creating new variables in my dataset: SELF_PROF, SS_TYPE etc. with the sum of these items – no luck – model won’t fit.

 

 

I have 4 Dependent/Predictor variables:

GEN, AREA, ACAD_SUBJ, UNI_TYPE

 

EngProfy   ~ d$GEN + d$AREA + d$ACAD_SUBJ + d$UNI_TYPE

ClassEff  ~ d$GEN + d$AREA + d$ACAD_SUBJ + d$UNI_TYPE

DirEff    ~ d$GEN + d$AREA + d$ACAD_SUBJ + d$UNI_TYPE

EngNecJap ~ d$GEN + d$AREA + d$ACAD_SUBJ + d$UNI_TYPE

 

 

I also have variables that I would like to control for (mediators):

ENG_PROF, ENG_COUNTRY, AGE, MTHS_THRU_ENG, SS_TYPE, EMI_REQ

 

There are interactions effects between some of the DVs and some of the control variables. I’ve tried coding ATT_DATA as the outcome, then regressing this onto these interacting variables – no luck, model won’t fit.

 

ATT_DATA ~ GEN*AGE

ATT_DATA ~ GEN*ENG_COUNTRY

ATT_DATA ~ AGE*ACAD_SUBJ

ATT_DATA ~ AGE*ENG_PROF

ATT_DATA ~ AGE*MTHS_THRU_ENG

ATT_DATA ~ AREA*UNI_TYPE

ATT_DATA ~ AREA*ENG_PROF

ATT_DATA ~ AREA*SS_TYPE

ATT_DATA ~ AREA*EMI_REQ

ATT_DATA ~ ENG_PROF*MTHS_THRU_ENG

ATT_DATA ~ ENG_PROF*ENG_COUNTRY

ATT_DATA ~ EMI_REQ*ENG_COUNTRY

 

 

 

 

As you can see I have MANY variables – and I think this is the root cause of my problem – no model is being fit.

Is there another way of coding that I am missing?

I also don't see how I can 'break the model up' and run separate models so that Lavaan can handle it.

 

Any help would be very much greatly appreciated.

 

 

kma...@aol.com

unread,
Oct 29, 2017, 10:12:24 PM10/29/17
to lavaan
Samantha,

This is a very broad question not specific to Lavaan.  It might be better suited for SEMNET.

Bear in mind that clustering items based on similarity does not guarantee that they will fit a linear common factor model.  It is just a hypothesis.  Failure is an option.

I would suggest the following:

Do not compute sum scores in your data set.  Stick to latent variable modeling using only the item scores.

Fit the one-factor model for each first level factor using only the items for that factor.  For factors with more than 3 items, evaluate the fit and parameter estimates.  For factors with 3 items, evaluate the parameter estimates.

Combine just the factors that look plausible based on their one-factor models into a single level CFA with correlated factors.  Evaluate fit and parameter estimates to look for problems.

Fit a series of models with a single 2nd order factor using just the items and first-order factors that correspond to that 2nd order factor.  Assess fit and parameter estimates.

Combine only the models for 2nd order factors that look plausible based on the prior analyses into your full model with multiple 2nd order factors.

This procedure is as much exploratory as confirmatory.  So, report the multi-step procedure fully and accurately with your results.

Keith
------------------------
Keith A. Markus
John Jay College of Criminal Justice, CUNY
http://jjcweb.jjay.cuny.edu/kmarkus
Frontiers of Test Validity Theory: Measurement, Causation and Meaning.
http://www.routledge.com/books/details/9781841692203/

Samantha Seiter

unread,
Oct 30, 2017, 7:09:49 AM10/30/17
to lavaan
Dear Keith,

Thank you for taking the time to reply. Your suggestions are very helpful. I did not know about SEMNET  - I shall post there too.

I guess my main question was - is there a maximum number of variables that we can use with Lavaan?

I have 13 latent variables (made up of the 51 questionnaire items)
4 DVs
and 6 control variables.

I am building up the model slowly - but it is clear that it cannot handle 13 latents, and maximum 2 DVs.
After over 100 tries at different ways of coding the model, the final error message is:

Warning message:

In lav_object_post_check(object) :

  lavaan WARNING: covariance matrix of latent variables

                is not positive definite;

                use inspect(fit,"cov.lv") to investigate.

 


Thanks again for your help .... I shall keep computing and see if SEMNET has suggestions of possibly another package which can handle all that I want to do.

Many thanks again.

Best wishes,
Samantha.

Edward Rigdon

unread,
Oct 30, 2017, 8:22:49 AM10/30/17
to lav...@googlegroups.com
     The warning (not error) message you are receiving does not indicate that the lavaan package is inadequate. It may simply be that your data are consistent with a factor model where the factor covariance matrix is not positive definite.

--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+unsubscribe@googlegroups.com.
To post to this group, send email to lav...@googlegroups.com.
Visit this group at https://groups.google.com/group/lavaan.
For more options, visit https://groups.google.com/d/optout.

Samantha Seiter

unread,
Oct 30, 2017, 6:03:12 PM10/30/17
to lavaan
Hi Edward,

Thank you for pointing out it's a Warning vs an Error ... I fit the model, and the model fit was actually not too bad. I'm still fiddling around to get all my variables in ... but I think I'll get there eventually.

I also checked - and my factor covariance matrix is positive definite - so - I think it should be alright.

Thank you again for taking the time to respond. Having someone point out such an obvious over-looked small thing actually really helps when you're deep into analysis.

Much appreciated.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.

kma...@aol.com

unread,
Oct 31, 2017, 10:31:42 AM10/31/17
to lavaan
Samantha,
  Ed is correct that a non-positive definite latent variable covariance matrix is not fatal.  However, it may be hinting that you model is overly complex and a simpler model will do.

  Your ratio of latent variables to observed variables is almost 4.  So, I do not think that you have too many latents in that respect.

  You have not said anything about sample size.  If you have a small sample, then building such a complex model could be like trying to thread a needle while looking through the bottom of a soda bottle.  There might be too much sampling error to support such a complex model.  So, you could have too many latent variables for your sample size.

  However, I think that the larger issue is this:  With all due respect to Dirk Gently, a holistic approach is not the best strategy to diagnose an SE model.  Try to partition the problem to narrow down the source of the issue.  The steps that I suggested earlier were a strategy toward that end.

  If you are still arriving at a model with a NPD latent variable covariance matrix after going through that process, then I would suggest the following.

  First, back up and focus on the first order CFA model.  Do you get the warning when the first order factors are free to covary without any 2nd order factors?  If not, then you can isolate the issue to the 2nd order portion of the model.  If so, then solving the issue in the 1st order model might also eliminate it from the 2nd order model.

  One strategy might be to search for positive definite submatrices in the latent variable covariance matrix by omitting individual rows and columns.  For example, you can remove the 2nd row and column and test for positive definiteness as follows.  If eliminating a single row/column does not help, try two, and so on.  This might narrow the issue to a subset of latent variables and possibly identify one or more redundant variables.  This is not an issue of more than the analytic approach can handle but rather an issue of having more latent variables than you need to model your specific data.

my.selection <- c(1, 3:13)
det(latent.variable.covariance.matrix[my.selection, my.selection]) > 0

  There is no mechanical procedure for re-specifying a model (or at least no good mechanical procedure).  Once you diagnose the problem, the best solution depends upon the substantive theory with which you are working.  As Ed said, in the end, you might decide to accept the model with the NPD matrix.  Or, you might find a simplified model that fits the data well without the NPD matrix.  For example, it may be that some of your observed variables can be better thought of as outcomes of the latent variables than as indicators of a separate factor.

Samantha Seiter

unread,
Oct 31, 2017, 5:02:13 PM10/31/17
to lav...@googlegroups.com
Dear Keith,

Thank you for your response.

Indeed - I am aware that my model might be too complicated - so I've decided to run 4 models - one for each of the 4 DVs/predictors and their indirect effects with the control variables.

I checked the correlations of the latent variables - and some correlate but only weak to moderately (range from 0.3 to 0.6).

Sample size for this dataset is 481 - that should be alright for 13 latents?

As for ratio of observed to latents - I did separate CFAs on the latents and deleted items as the model fit improved. I'm not down to 41 items. So some latents only contain 2 Items - but the model it still converging, so that should be ok? - but the warning is still there.

I got the model to converge for the first DV - which included some indirect effects between DVs and control variables. 


Number of observations                           481

  Estimator                                         ML
  Minimum Function Test Statistic             2462.901
  Degrees of freedom                               942
  P-value (Chi-square)                           0.000


Model fit was not bad: 

RMSEA                                                    0.058
90 Percent Confidence Interval          0.055  0.061
P-value RMSEA <= 0.05                      0.000

SRMR                                           0.074

but CFI was not good:
Comparative Fit Index (CFI)                    0.754
 Tucker-Lewis Index (TLI)                       0.719

I'd post the code here - but it's 2 pages long.

I then couldn't figure out how to include my control/confounding variables in the model.

This post was useful in explaining that:

I basically need to include the control variables as correlations (~~) with the latents.
I tackled the first control - AGE - and ran a MANOVA to see if there was a relationship and with which latents.
Since I cannot do a correlation with AGE and a continuous variable, I assume I put it in as a regression instead?
SELF_PROF~AGE ? I'll give that a go and see what I get.

I'll give it a go putting the control variables: AGE, ENGLISH IS REQUIRED,  and STUDENT TYPE as regressions, then ENGLISH PROFICIENCY, MONTHS IN ENGLISH SPEAKING COUNTRY, as Correlations within the model. Control variables are not always continuous - and I obviously get an error message when I try to correlate Age and Self Rated English Proficiency.

My question is - 
I've tried running the model using the sem() command OR and the lavaan() command - but the model does not converge. It only converges with the cfa() command. Why would this be? Since the Lavaan tutorial says the cfa and sem commands are doing pretty much the same thing. It's very odd.

Thanks as always - and apologies that this post is a bit all over the place.
my model is complicated ... but I've got it to converge ... just need to figure out how to take the control variables into account.

Thank you again for your reply - I do appreciate your time.

Happy Halloween to all,
Samantha.



--
You received this message because you are subscribed to a topic in the Google Groups "lavaan" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lavaan/Q575w7AJ2fY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lavaan+unsubscribe@googlegroups.com.

To post to this group, send email to lav...@googlegroups.com.
Visit this group at https://groups.google.com/group/lavaan.
For more options, visit https://groups.google.com/d/optout.



--
Best wishes,
Samantha.

Terrence Jorgensen

unread,
Nov 1, 2017, 4:57:43 AM11/1/17
to lavaan
I've tried running the model using the sem() command OR and the lavaan() command - but the model does not converge. It only converges with the cfa() command. Why would this be? Since the Lavaan tutorial says the cfa and sem commands are doing pretty much the same thing. 

cfa() is just a wrapper around lavaan(), but it changes different default options.  If you want to see what the lavaan() call specifies when you fit your cfa() model:

fit <- cfa(...)
fit@call

Terrence D. Jorgensen
Postdoctoral Researcher, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam

kma...@aol.com

unread,
Nov 1, 2017, 10:12:12 AM11/1/17
to lavaan
Samantha,
There is a lot going on here.

First, "too complicated" is too coarse a description.  It is important to distinguish different kinds of being too complicated, as I have tried to do in previous posts.

A better way to think about sample size is in relation to the number of parameters that you are trying to estimate.  I am not sure how many observed variables you are modeling, it looks like more than 41.  You can count your parameters as shown below.  There is no fixed rule-of-thumb but introductory SEM books can provide some rough guidelines (e.g., 1:10, 1:20).  You can compute the number of variances and covariances as (k * (k + 1))/2 where k is the number of observed variables.

lavInstpect(my.lavaan.fit.object, what='free')

All of your fit statistics and indices are sending a consistent message:  You model does not fit.  If the model is misspecified, you probably need to address that before you can make sense of anything else.  Build up the model step-by-step as described in my first post and watch for when the poor fit starts to show up.

Once you include covariates and outcomes, you are out of CFA territory and need to give up the idea of correlations.  Your factors are now endogenous variables that depend upon the covariates and the outcomes depend upon the factors (I am guessing the second order factors?).

I do not think that it is a good idea to model the covariates and outcomes in separate models.  Regression assumes that any omitted variables are exactly uncorrelated with the predictors in the model, in the population.  That may not be plausible.

If your sample size is too small -- and it may be, I am not sure, intuitively it feels to me like more of an N=1000 model than an N = 500 model and you are below 500 -- then you might consider the following alternative.  Sum the items to form scale scores in place of the first order factors.  Model these using the current second order factors and first order factors.  This will reduce the number of parameters that you are estimating.

One way to evaluate this is to simulate data from your model and then fit the model to the simulated data.  This will help you understand how the model behaves when the hypothesis is correct.  See the simulateData() function.  The standard errors for the parameters, both using your data and simulated data, may offer the best guide.
Reply all
Reply to author
Forward
0 new messages