FIML and auxiliary variables

1,091 views
Skip to first unread message

Kevin Hallgren

unread,
Oct 9, 2012, 10:06:38 PM10/9/12
to lav...@googlegroups.com
Hello,

When using lavaan with FIML estimation and missing data, do the variables that are included in the dataset but not explicitly specified in the model get used as auxiliary variables to help with estimating missing values?  Or do you need to use the auxiliary() function in semTools to include auxiliary variables?

The lavaan manual says about the missing parameter: 'If "direct" or "ml" or "fiml" and the estimator is maximum likelihood, Full Information Maximum Likelihood (FIML) estimation is used using all available data in the data frame.'  So I wasn't clear if they meant they use "all available data" to help give the best estimate for missing values (i.e., they are auxiliary), or if that just meant it doesn't use listwise deletion.

My code is a simple regression model:

auxvars = c("NALTREXO","THERAPY","GENDER","AGE","DEPNDSX","T0_PHD","T4_PHD")   
model1 = 'T4_PHD ~ 1 + THERAPY'
fit.fiml = sem(model1, data=raw.dataset[,auxvars], estimator="ML", missing="FIML")
summary(fit.fiml)

Although only the T4_PHD and THERAPY variables are used in the model, I would like the 5 additional auxiliary variables to help give the most accurate estimate of the model given missing data for T4_PHD.

Thanks!
Kevin


Alex Schoemann

unread,
Oct 9, 2012, 11:59:54 PM10/9/12
to lav...@googlegroups.com
Hi Kevin,

I don't believe that lavaan will automatically include auxiliary variables. However, the auxiliary function in semTools will definitely do the job. The code to use auxiliary in your case should be pretty straight forward:


auxvars = c("NALTREXO","GENDER","AGE","DEPNDSX","T0_PHD")    

model1 = 'T4_PHD ~ 1 + THERAPY'
fit.fiml = sem(model1, data=raw.dataset, estimator="ML", missing="FIML") 
fit.aux <- auxiliary(fit.fiml, aux = auxvars, data=raw.dataset, estimator="ML", missing="FIML")
summary(fit.aux)

There are a couple of situations where the auxiliary function runs into trouble, and a simple regrression like this might be one. If you run into trouble, you might also think about using multiple imputations in lieu of FIML and auxiliary variables.

-Alex

yrosseel

unread,
Oct 10, 2012, 3:02:27 AM10/10/12
to lav...@googlegroups.com
On 10/10/2012 04:06 AM, Kevin Hallgren wrote:
> Hello,
>
> When using lavaan with FIML estimation and missing data, do the
> variables that are included in the dataset but not explicitly specified
> in the model get used as auxiliary variables to help with estimating
> missing values? Or do you need to use the auxiliary() function in
> semTools to include auxiliary variables?
>
> The lavaan manual says about the missing parameter: 'If |"direct"| or
> |"ml"| or |"fiml"| and the estimator is maximum likelihood, Full
> Information Maximum Likelihood (FIML) estimation is used using all
> available data in the data frame.' So I wasn't clear if they meant they
> use "all available data" to help give the best estimate for missing
> values (i.e., they are auxiliary), or if that just meant it doesn't use
> listwise deletion.

I agree this is not very clear indeed. But lavaan does NOT use auxiliary
variables. It only uses the (observed) variables that are included in
your model.

However -just to see if it makes a difference- you can easily trick
lavaan to use auxiliary variables: you add them to the model, but you
make sure they have no effect, for example:

model1 = 'T4_PHD ~ 1 + THERAPY + 0*NALTREXO + 0*THERAPY'

Here, both NALTREXO and THERAPY will be used as auxiliary variables.
Note, however, that we assume that all these variables are continuous;
we assume multivariate normality (and I'm not sure about THERAPY, but
GENDER should not be included!). Another caveat is that the degrees of
freedom will be off, but you can get them from the original analysis.

The same is true for the auxiliary() in semTools: only use continuous
variables!

> My code is a simple regression model:
>
> auxvars =
> c("NALTREXO","THERAPY","GENDER","AGE","DEPNDSX","T0_PHD","T4_PHD")
> model1 = 'T4_PHD ~ 1 + THERAPY'
> fit.fiml = sem(model1, data=raw.dataset[,auxvars], estimator="ML",
> missing="FIML")
> summary(fit.fiml)
>
> Although only the T4_PHD and THERAPY variables are used in the model, I
> would like the 5 additional auxiliary variables to help give the most
> accurate estimate of the model given missing data for T4_PHD.

In my understanding, auxiliary variables are not used when estimating
the model parameters! They are only used to fit the unrestricted (h1)
model (ie. the covariance matrix of the incomplete data). The latter is
only needed to compute the model test statistic. Your estimates (and
standard errors) will be fine without them.

Yves.



Kevin Hallgren

unread,
Oct 21, 2012, 3:09:22 AM10/21/12
to lav...@googlegroups.com
Thanks Alex and Yves.  I'm still running into a few problems for using MI with semTools instead.

With my syntax:

fit.mi = lavaan.mi(model1, data=raw.dataset[,auxvars], m = 20)

I get the output:

There were 19 warnings (use warnings() to see them)
Warning messages:
1: In lavaan(model = structure(list(id = 1:4, lhs = c("T4_PHD",  ... :
  lavaan WARNING: model has NOT converged!
2: In lavaan(model = structure(list(id = 1:4, lhs = c("T4_PHD",  ... :
  lavaan WARNING: model has NOT converged!
....


So it looks like of the 20 imputations I tried, 19 failed to converge.  This seems odd to me given that I'm only using observed variables for a regression model, which I would think should be easy to fit even if some of the imputed data are off.  The only thing I can think of that may be causing this is that there may not be much information in the auxiliary variables to do a good job at predicting the missing DV values (the highest correlation is 0.25).

In case it helps, this is what my dataset looks like (about 50% of the NALTREXO values are 1, and the other 50% are 0,)

head(raw.dataset[,auxvars])


  NALTREXO GENDER AGE DEPNDSX   T0_PHD    T4_PHD THERAPY
1        1      1  37       0 56.66667  0.000000       0
2        1      0  24       0 36.66667  0.000000       0
3        1      1  42       1 90.00000  3.571429       0
6        1      1  58       1 93.33333 17.857143       0
7        1      1  52       1 53.33333  0.000000       0
8        1      1  46       0 33.33333  0.000000       0



Thanks for any help!

Kevin

yrosseel

unread,
Oct 22, 2012, 10:42:14 AM10/22/12
to lav...@googlegroups.com
On 10/21/2012 09:09 AM, Kevin Hallgren wrote:
> Thanks Alex and Yves. I'm still running into a few problems for using
> MI with semTools instead.
>
> With my syntax:
>
> fit.mi = lavaan.mi(model1, data=raw.dataset[,auxvars], m = 20)
>
> I get the output:
>
> There were 19 warnings (use warnings() to see them)
> Warning messages:
> 1: In lavaan(model = structure(list(id = 1:4, lhs = c("T4_PHD", ... :
> lavaan WARNING: model has NOT converged!
> 2: In lavaan(model = structure(list(id = 1:4, lhs = c("T4_PHD", ... :
> lavaan WARNING: model has NOT converged!
> ....
>
> So it looks like of the 20 imputations I tried, 19 failed to converge.
> This seems odd

This seems odd to me too. After imputation, the data is complete, and
since you are fitting a simple regression model, lavaan should switch to
least-squares, and it should always converge. Is there a way to extract
the imputed data to see how it looks like?

Yves.

Alex Schoemann

unread,
Oct 22, 2012, 10:53:49 AM10/22/12
to lav...@googlegroups.com
I agree with Yves, this is really odd. I'd recommend running imputations separately in Amelia, then run each analysis separately to investigate what's happening.

The code for this would be:

library(Amelia)

raw.imp <- amelia(raw.dataset[,auxvars], m = 20)
#The raw data is in a list in the Amelia object called imputations with length m

#View raw data
raw.imp$imputations

#Run model on first imputed dataset
model1 = 'T4_PHD ~ 1 + THERAPY'
fit = sem(model1, data=raw.imp#imputations[[1]], estimator="ML") 
summary(fit)
Reply all
Reply to author
Forward
0 new messages