How to get outliers for a selection of variables used in a SEM

502 views
Skip to first unread message

Lies Henderickx

unread,
Mar 21, 2018, 5:23:21 AM3/21/18
to lavaan
I need to find whether there are outliers in my dataset impacting my model fit and parameter estimation results.

I only use a selection of variables of the dataset.

This is my SEM syntax:

model <- '
extsoc =~ JOBMOTIVATIE_extsoc1 + JOBMOTIVATIE_extsoc2 + JOBMOTIVATIE_extsoc3
extmat =~ JOBMOTIVATIE_extmat1 + JOBMOTIVATIE_extmat2 + JOBMOTIVATIE_extmat3
introj =~ JOBMOTIVATIE_introj1 + JOBMOTIVATIE_introj2 + JOBMOTIVATIE_introj3 + JOBMOTIVATIE_introj4
ident =~ JOBMOTIVATIE_ident1 + JOBMOTIVATIE_ident2 + JOBMOTIVATIE_ident3
intrin =~ JOBMOTIVATIE_intrin1 + JOBMOTIVATIE_intrin2 + JOBMOTIVATIE_intrin3

MEAN_Smartconsumer ~ extsoc
MEAN_Smartconsumer ~ extmat
MEAN_Smartconsumer ~ introj
MEAN_Smartconsumer ~ ident
MEAN_Smartconsumer ~ intrin

MEAN_BEINGABLETO ~ extsoc
MEAN_BEINGABLETO ~ extmat
MEAN_BEINGABLETO ~ introj
MEAN_BEINGABLETO ~ ident
MEAN_BEINGABLETO ~ intrin

MEAN_VALUING ~ extsoc
MEAN_VALUING ~ extmat
MEAN_VALUING ~ introj
MEAN_VALUING ~ ident
MEAN_VALUING ~ intrin

MEAN_DOINGRESEARCH ~ extsoc
MEAN_DOINGRESEARCH ~ extmat
MEAN_DOINGRESEARCH ~ introj
MEAN_DOINGRESEARCH ~ ident
MEAN_DOINGRESEARCH ~ intrin


JOBMOTIVATIE_introj1 ~~ JOBMOTIVATIE_introj4
JOBMOTIVATIE_introj3 ~~ JOBMOTIVATIE_introj4
JOBMOTIVATIE_extsoc2 ~~ JOBMOTIVATIE_extsoc3'

fit <- sem(model, data=dat_SEM, group="fitK3.cluster")

I tried to use the faoutlier package, but I keep getting errors.

Syntax:
(FS <- forward.search(dat_SEM$JOBMOTIVATIE_extsoc2,dat_SEM$JOBMOTIVATIE_extsoc2,dat_SEM$JOBMOTIVATIE_extsoc3,
                    dat_SEM$JOBMOTIVATIE_extmat1,dat_SEM$JOBMOTIVATIE_extmat2, dat_SEM$JOBMOTIVATIE_extmat3,
                    dat_SEM$JOBMOTIVATIE_introj1, dat_SEM$JOBMOTIVATIE_introj2, dat_SEM$JOBMOTIVATIE_introj3, dat_SEM$JOBMOTIVATIE_introj4, 
                    dat_SEM$JOBMOTIVATIE_ident1, dat_SEM$JOBMOTIVATIE_ident2,dat_SEM$JOBMOTIVATIE_ident3,
                    dat_SEM$JOBMOTIVATIE_intrin1, dat_SEM$JOBMOTIVATIE_intrin2, dat_SEM$JOBMOTIVATIE_intrin3,
                    dat_SEM$MEAN_BEINGABLETO, dat_SEM$MEAN_DOINGRESEARCH, dat_SEM$MEAN_Smartconsumer, dat_SEM$MEAN_VALUING,
                    model))
Error: 
Error in 1:N : argument of length 0

What can I do to prevent this error?
And how can I extract the outliers?

kma...@aol.com

unread,
Mar 25, 2018, 10:01:08 AM3/25/18
to lavaan
Lies,
Again, this is not a lavaan question so much as a general SEMNET question.  That may be why it has gone unanswered.  Here is the code that I use in my data preparation lecture.  Substitute your own data frame for NSKK.df.  If I were writing the code today, I would probably use the head() function rather than indexing to print out just the first 20 cases.  However, that is more for illustration and to check for errors than it is a functional part of the analysis.



 
# These packages must be installed before loading.

require(psych)

require(moments)

require(robustbase)

require(MVN)

require(mnormt)

require(mi)

require(rela)


(snip)
 

# Multivariate Normality tests from package MVN

#  Useful to run all three because they are not equally sensitive.

Mardia.NSKK <- mardiaTest(NSKK.df)

Mardia.NSKK

 
Henze.Zirkler.NSKK <- hzTest(NSKK.df)

Henze.Zirkler.NSKK

 
Royston.NSKK <- roystonTest(NSKK.df)

Royston.NSKK

 
 
# MVN provides univariate tests and plots to follow up

uniPlot
(NSKK.df, type='histogram')

uniNorm
(NSKK.df, type='SW', desc=TRUE)

 
 
 
# Mahalanobis test (for multivariate outliers)

NSKK
.chisq <- mahalanobis(

    x
= NSKK.df,

    center
= colMeans(NSKK.df),

    cov
= cov(NSKK.df)

   
)

NSKK
.pvalue <- pchisq(q=NSKK.chisq, df=4, lower.tail=FALSE)

MahalanobisOut <- cbind(1:1000, NSKK.chisq, NSKK.pvalue)

 
# Only print first 20 cases.

MahalanobisOut[1:20,]

 
# View outliers only.

MahalanobisOut[MahalanobisOut[,3] < .05,]

 
#plot

par
(ask=FALSE, mfrow=c(1,1))

hist
(NSKK.chisq)

Critical.chisq <- qchisq(p=.05, df=4, lower.tail=FALSE)

abline
(v=Critical.chisq,col='red')

text
(x=Critical.chisq, y=400, labels='p < .05', col='red', pos=4)

 
# Robust Mahalanobis Distance

# Compute robust Mahalanobis distances

robMD
<- covMcd(NSKK.df)

plot
(NSKK.chisq, robMD$mah, pch=16,

  xlab
='Standard Mahalanobis Distance',

  ylab
='Robust Mahalanobis Distance')

abline
(a=0, b=30, col=gray(.60))

 
# Compare robust and sample means

colMeans
(NSKK.df)

robMD$center

 
# Compare SDs

sqrt
(diag(cov(NSKK.df)))

sqrt
(diag(robMD$cov))

 
# Compare correlations

round
(cov2cor(cov(NSKK.df)),2)

round
(cov2cor(robMD$cov),2)

 
# The centroid and covarianc matrix look very similar

# Nonetheless they make a big difference to the range of

#  Mahalanobis distances.

 
 
 
# Convenient plot & list

quan
.NSKK <- mvOutlier(NSKK.df, method='quan')

head
(quan.NSKK$outlier, 10)

tail
(quan.NSKK$outlier, 10)




 Keith
------------------------
Keith A. Markus
John Jay College of Criminal Justice, CUNY
http://jjcweb.jjay.cuny.edu/kmarkus
Frontiers of Test Validity Theory: Measurement, Causation and Meaning.
http://www.routledge.com/books/details/9781841692203/


Terrence Jorgensen

unread,
Mar 27, 2018, 7:22:10 AM3/27/18
to lavaan
this is not a lavaan question so much as a general SEMNET question

Agreed, but here is a software package that provides robust estimation for outliers:


Terrence D. Jorgensen
Postdoctoral Researcher, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam

Reply all
Reply to author
Forward
0 new messages