Error: missing values in observed variables

14,037 views
Skip to first unread message

Kaushal Chaudhary

unread,
Mar 6, 2014, 4:59:53 PM3/6/14
to lav...@googlegroups.com
Hello,
 
I am using following code for full SEM model.
 
dat=read.csv(file='DATA.csv',na.string="-999")
head(dat)
library('lavaan')
### full sem model
mymodel<-'
# creating latent variables
          SMOKINGSTATUS=~SMOKINGDURING_PRIOR_PREG+SMOKING_DURING_PREGNANCY+SMOKING_PRIOR_PREG
          ALCOHOLUSESTATUS=~ALCOHOLUSEANYTIME+ALCOHOLUSEPRIORPREG+ALCOHOLUSE_INDEXPREG
          
          DRUGUSESTATUS=~DRUGUSEANYTIME+DRUGUSEPREG+DRUGUSEPRIORPREG
#  regression models
         
          TOTAL_BIRTH_WEIGHT_POUNDS~SMOKINGSTATUS+ALCOHOLUSESTATUS+DRUGUSESTATUS
 '
fit<-sem(model=mymodel, data=dat, missing='FIML')
summary(fit, fit.measures=TRUE, standardized=TRUE)
 
some observed variables have three level (coded as 1,2,3 for example ALCOHOLUSEANYTIME)
 
I am gettting following errors.
 
 
Error in lav_data_full(data = data, group = group, group.label = group.label,  : 
  lavaan ERROR: missing observed variables in dataset: SMOKINGDURING_PRIOR_PREG SMOKING_DURING_PREGNANCY SMOKING_PRIOR_PREG ALCOHOLUSEANYTIME ALCOHOLUSEPRIORPREG ALCOHOLUSE_INDEXPREG DRUGUSEANYTIME DRUGUSEPREG DRUGUSEPRIORPREG
 
sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lavaan_0.5-15

loaded via a namespace (and not attached):
[1] mnormt_1.4-7   pbivnorm_0.5-1 quadprog_1.5-5 stats4_3.0.2   tools_3.0.2   
Thank you for your help.
 
 

yrosseel

unread,
Mar 7, 2014, 3:33:47 AM3/7/14
to lav...@googlegroups.com

> Error in lav_data_full(data = data, group = group, group.label =
> group.label, : lavaan ERROR: missing observed variables in dataset:
> SMOKINGDURING_PRIOR_PREG SMOKING_DURING_PREGNANCY SMOKING_PRIOR_PREG
> ALCOHOLUSEANYTIME ALCOHOLUSEPRIORPREG ALCOHOLUSE_INDEXPREG
> DRUGUSEANYTIME DRUGUSEPREG DRUGUSEPRIORPREG

This simply means that the variables in your syntax can not be found
(exactly) in your dataset.

Look at the variable names in your dataset:

names(dat)

And note that capitalization matters. SMOKING is not the same as smoking.

Yves.

Kaushal Chaudhary

unread,
Mar 7, 2014, 10:03:59 AM3/7/14
to lav...@googlegroups.com
Hello Yves,
 
Thank you for your prompt reply.  I run into another  warning message.
 
Warning message:
In lavaan::lavaan(model = mymodel, data = dat, missing = "ML", model.type = "sem",  :
  lavaan WARNING: covariance matrix of latent variables is not positive definite; use inspect(fit,"cov.lv") to investigate.
> inspect(fit,"cov.lv")
                 SMOKIN ALCOHO DRUGUS
SMOKINGSTATUS     0.014              
ALCOHOLUSESTATUS  0.096  0.089       
DRUGUSESTATUS    -0.018 -0.037  0.272
 
Some observed variables have three leves (1,2,3 for eg. alcholuseanytime).  Do I have create dummy variables for that   or used Ordered function? I don't know if this is creating this error.
 
Thank you.

Terrence Jorgensen

unread,
Mar 9, 2014, 1:03:57 PM3/9/14
to lav...@googlegroups.com

 
Warning message:
In lavaan::lavaan(model = mymodel, data = dat, missing = "ML", model.type = "sem",  :
  lavaan WARNING: covariance matrix of latent variables is not positive definite; use inspect(fit,"cov.lv") to investigate.
> inspect(fit,"cov.lv")
                 SMOKIN ALCOHO DRUGUS
SMOKINGSTATUS     0.014              
ALCOHOLUSESTATUS  0.096  0.089       
DRUGUSESTATUS    -0.018 -0.037  0.272
 
Some observed variables have three leves (1,2,3 for eg. alcholuseanytime).  Do I have create dummy variables for that   or used Ordered function? I don't know if this is creating this error.
 

The warning is simply that there is at least one value that is beyond the boundary of possible values in a true model.  Use "cor.lv" rather than "cov.lv" to check out the latent correlation matrix, and you'll see which value is beyond +/- 1 (I'm guessing it's ALCOHOLUSESTATUS and SMOKINGSTATUS, given how large the covariance is compared to the variance of SMOKINGSTATUS).

As far as 3-level observed variables goes, you wouldn't need dummy codes.  Just specify that the names of ordered categorical variables in your call to sem(..., ordered = c("this.variable", "that.variable")).   If you do so, you will not be able to use FIML to account for missing data, so you might consider using multiple imputation.  The sem.mi() function in the semTools package automates the process of imputation, analysis, and pooling results.  If you are on the fence about whether 3-level ordinal variables should be treated as continuous, you can get some idea about the practical consequences by reading this article:  http://dx.doi.org/10.1037/a0029315

Terry

Kaushal Chaudhary

unread,
Mar 17, 2014, 11:06:06 AM3/17/14
to lav...@googlegroups.com
Hi Terrence,
 
Thank you very much for your input.  I fixed not postive definiteness  removing some of the observed variables while creating the latent variables.   I have output from model.  I was wondering how to interpret these  and which output is important for interpratation in SEM?  I saw my model fitted well since RMSEA value is less than 0.05.  Thank you for your help.
 
lavaan (0.5-15) converged normally after 238 iterations

  Number of observations                           402

  Number of missing patterns                        51

  Estimator                                         ML
  Minimum Function Test Statistic               53.796
  Degrees of freedom                                33
  P-value (Chi-square)                           0.013

Model test baseline model:

  Minimum Function Test Statistic              917.811
  Degrees of freedom                                55
  P-value                                        0.000

User model versus baseline model:

  Comparative Fit Index (CFI)                    0.976
  Tucker-Lewis Index (TLI)                       0.960

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -2833.187
  Loglikelihood unrestricted model (H1)      -2806.289

  Number of free parameters                         44
  Akaike (AIC)                                5754.374
  Bayesian (BIC)                              5930.218
  Sample-size adjusted Bayesian (BIC)         5790.602

Root Mean Square Error of Approximation:

  RMSEA                                          0.040
  90 Percent Confidence Interval          0.019  0.058
  P-value RMSEA <= 0.05                          0.807

Standardized Root Mean Square Residual:

  SRMR                                           0.071

Parameter estimates:

  Information                                 Observed
  Standard Errors                             Standard

                   Estimate  Std.err  Z-value  P(>|z|)   Std.lv  Std.all
Latent variables:
  SMOKINGSTATUS =~
    NEW_SMOKING_D     1.000                               0.257    0.642
    Nw_Smkng_P_P_     0.829    0.842    0.984    0.325    0.213    0.283
  ALCOHOLUSESTATUS =~
    AlcohlUsAnytm     1.000                               0.222    0.463
    NEW_ALCOHOLUS     1.213    0.582    2.086    0.037    0.269    0.765
  DRUGUSESTATUS =~
    DrugUseAnytim     1.000                               0.394    0.820
    NEW_DRUGUSEPR     1.161    0.090   12.832    0.000    0.457    0.910
  PRENATALCARESTATUS =~
    PrenatalCare      1.000                               0.039    0.262
    PRENATALCARE_   -17.651    5.218   -3.383    0.001   -0.689   -0.887
    PrntlCr_StrtT    10.916    3.276    3.332    0.001    0.426    0.652

Regressions:
  TOTAL_BIRTH_WEIGHT_POUNDS ~
    ALCOHOLUSESTA    -0.133    0.442   -0.302    0.763   -0.030   -0.022
    DRUGUSESTATUS     0.254    0.248    1.023    0.306    0.100    0.073
    PRENATALCARES     2.320    2.407    0.964    0.335    0.091    0.066
  Gestational_Age_At_Birth ~
    ALCOHOLUSESTA     1.047    0.806    1.299    0.194    0.232    0.102
    DRUGUSESTATUS    -0.987    0.535   -1.847    0.065   -0.389   -0.170
    PRENATALCARES   -20.004    7.689   -2.601    0.009   -0.781   -0.342
  TOTAL_BIRTH_WEIGHT_POUNDS ~
    Gsttnl_Ag_A_B     0.426    0.023   18.264    0.000    0.426    0.710
 
R-Square:

    NEW_SMOKING_DURING_PREGNANCY     0.412
    New_Smoking_Prior_Preg_timing     0.080
    AlcoholUseAnytime     0.215
    NEW_ALCOHOLUSE_INDEXPREG     0.586
    DrugUseAnytime     0.672
    NEW_DRUGUSEPREG     0.828
    PrenatalCare      0.069
    PRENATALCARE_VISIT_CATEGORY     0.788
    PrenatalCare_StartedTrimester     0.425
    TOTAL_BIRTH_WEIGHT_POUNDS     0.491
    Gestational_Age_At_Birth     0.075

yrosseel

unread,
Mar 18, 2014, 6:27:54 AM3/18/14
to lav...@googlegroups.com
On 03/17/2014 04:06 PM, Kaushal Chaudhary wrote:
> Hi Terrence,
> Thank you very much for your input. I fixed not postive definiteness
> removing some of the observed variables while creating the latent
> variables. I have output from model. I was wondering how to interpret
> these and which output is important for interpratation in SEM?

I think this is the point where you need to read a good book on SEM.

Yves.

Kaushal Chaudhary

unread,
Mar 18, 2014, 11:34:40 AM3/18/14
to lav...@googlegroups.com
 
Thanks Yves,
 
 I am reading some books that I have.  However, I have a question I am having trouble.  
 
I have some  categorical variables used in latent variable  and I am fitting  SEM model.
 
 
model< '   f1=~ x1+x2+x3
                  f2=~ x4+x5+x6
                                       
                 f1~f2'
 
x1-x6 are categorical variables.
 
semmodel<-sem(model, data, ordered=c("x1",........"x6"))
 
 Would x1-x6 would be treated as endogenous variables? If that is correct  then my model shouls be above like this?
 
 
 
I am also citing from you tutorial below.
 

10 Using categorical variables

Binary, ordinal and nominal variables are considered categorical (not continuous). It makes a big di
erence if

these categorical variables are exogenous (independent) or endogenous (dependent) in the model.

Exogenous categorical variables

If you have a binary exogenous covariate (say, gender), all you need to

do is to recode it as a dummy (0/1) variable. Just like you would do in a classic regression model. If you have

an exogenous ordinal variable, you can use a coding scheme reecting the order (say, 1,2,3,. . . ) and treat it

as any other (numeric) covariate. If you have a nominal categorical variable with

K > 2 levels, you need to

replace it by a set of

K 􀀀 1 dummy variables, again, just like you would do in classical regression.

Endogenous categorical variables

The lavaan 0.5 series can deal with binary and ordinal (but not nomi-

nal) endogenous variables. Only the three-stage WLS approach is currently supported, including some `robust'

variants. To use binary/ordinal data, you have two choices:

1. declare them as `ordered' (using the

ordered function, which is part of base R) in your data.frame before

you run the analysis; for example, if you need to declare four variables (say,

item1, item2, item3, item4)

as ordinal in your data.frame (called

Data), you can use something like:

Data[,

c("item1",

"item,

"item3"

,

"item4"

)] <-

lapply

(Data[,c("item1",

"item2"

,

"item3"

,

"item4"

)], ordered)

2. use the

ordered argument when using one of the tting functions (cfa/sem/growth/lavaan), for example,

if you have four binary or ordinal variables (say,

item1, item2, item3, item4), you can use:

fit <-

cfa(myModel, data = myData,

ordered=

c("item1","item2",

"item3"

,"item4"))
 
 
Thanks for your help.

yrosseel

unread,
Mar 18, 2014, 11:37:58 AM3/18/14
to lav...@googlegroups.com
On 03/18/2014 04:34 PM, Kaushal Chaudhary wrote:
> Thanks Yves,
> I am reading some books that I have. However, I have a question I am
> having trouble.
> I have some categorical variables used in latent variable and I am
> fitting SEM model.
> model< ' f1=~ x1+x2+x3
> f2=~ x4+x5+x6
> f1~f2'
> x1-x6 are categorical variables.
> semmodel<-sem(model, data, ordered=c("x1",........"x6"))
> Would x1-x6 would be treated as endogenous variables?

Yes.

If that is
> correct then my model shouls be above like this?

Yes.

Yves.

Kaushal Chaudhary

unread,
Mar 18, 2014, 12:39:36 PM3/18/14
to lav...@googlegroups.com
 
Thank you very much, Yves.
Reply all
Reply to author
Forward
0 new messages