EM limitations and missing data analysis

Christian Connell

unread,

Nov 11, 2000, 3:00:00 AM11/11/00

to

Is anyone familiar with references that address limitations of EM (Missing
Data Analysis) with respect to level of missing data. I was under the
impression that EM should not be used if rates were above 20%, but can not
find any specific references that address the issue.

Also, out of curiousity -- what approaches do people on the list tend to use
for missing data analysis? I have used EM for lower rates (when
appropriate) and Schaffer's NORM mulitple imputation program for datasets
with a larger percent of missing data. My experience with NORM was
positive, and the use of multiple imputation seems more defensible in terms
of handling larger rates if missing data, but the process of going
back-and-forth from SPSS to NORM and then combining results from multiple
imputations of the data was extremely cumbersome (hence my question on
limits to EM).

Christian M. Connell, Ph.D.
Postdoctoral Psychology Fellow, The Consultation Center
Yale University School of Medicine
ccon...@theconsultationcenter.org

Rich Ulrich

unread,

Nov 30, 2000, 3:00:00 AM11/30/00

to

- from a couple of weeks ago -

On Sat, 11 Nov 2000 07:53:36 -0500, "Christian Connell"
<conn...@peoplepc.com> wrote:

> Is anyone familiar with references that address limitations of EM (Missing
> Data Analysis) with respect to level of missing data. I was under the
> impression that EM should not be used if rates were above 20%, but can not
> find any specific references that address the issue.

< snip, Q about other approaches people use. >

I consider it hazardous to replace Missing with some estimate.
The more complicated that it is to do and to describe, the more
hazardous it is, just because you can't keep track of odd influences.
I try to find a way around the Missing problem so that I can describe
where my numbers come from, and what biases they might include.
Computing a "composite score" is half a solution - it takes the
problem out of the computer program that wants "complete data".

Why estimate any missing? - You have to be careful that you don't skew
what you are attempting to test. Or estimate.

If I need "complete data" so that a particular algorithm (computer
program) will run, I think the limit has to depend on the particular
application. If 20% replacement of missing is okay, why not 30%? I
am not familiar with 20% as a guideline; I don't know what sort of
data that it should apply to. I think I would be wary about *any*
analysis on medical/clinical data that had over 10%, and 5% might
seem high for most.

Using what I know about the statistical technique, I try to consider,
for the data on hand, how robust the analysis will be. Then, it is
fine to go beyond that a-priori judgment, if possible: that is, it is
good to test the robustness by using other analyses that may be less
complete or less powerful or less informative. (Simplified model? a
test on ranked or dichotomized scores?)

What is really a no-no is when your Estimation procedure has "created"
the significant effect by counting the same evidence more than once.
Or, pretending to extra (not in the data) degrees of freedom.
I don't know how easy it is to learn to spot those.

--
Rich Ulrich, wpi...@pitt.edu
http://www.pitt.edu/~wpilib/index.html

Alan Acock

unread,

Nov 30, 2000, 11:23:29 PM11/30/00

to

The EM implementation in SPSS assumes missing values are missing at random.
However, if you include appropriate mechanism variables this assumption is
not too serious. A mechanism variable is a variable that is related to
whether the person answers and item or not--it may or may not be related to
the person's score on the item. Fortunately, we know a lot about mechanism
variables and usually have them included in our datasets. Variables such as
gender, race, and age are typical mechanism variables.

I believe there is a bug with SPSS's implementation of EM. The imputed
covariance matrix is okay, but the raw data it imputes has attenuated
variance (i.e., it is inconsistent with the covariance matrix). If you can
use the covariance matrix in your analysis (regression, SEM, etc.) this is
not a problem. If SPSS has fixed this problem, I would like to be so
informed.

Superior imputation software (free and better) is available at:
http://methcenter.psu.edu/mde.shtml
for norm and
http://www.jamesarbuckle.com/amos/applications/index.htm
for a beta program.
Alan Acock

--
***************************************************
The Acock's
Alan Acock's Address is ac...@home.com
Toni Acock's is antoni...@home.com

"Rich Ulrich" <wpi...@pitt.edu> wrote in message
news:c40d2tg640889hh1c...@4ax.com...