More variables than cases: EM, MI or how?

30 views
Skip to first unread message

Chris McManus

unread,
Feb 1, 2018, 11:51:24 AM2/1/18
to Missing Data
Apologies if this issue has already been raised but a few searches under what seemed like obvious terms didn't find anything. And obvious internet searches or textbooks don't seem to give any obvious solutions either.

I am carrying out an analysis of data from a number of teaching institutions (~30), and they are almost all of those in the UK.  So increasing the sample size is not really a possibility (and listwise deletion would run the risk of N being so small that nothing would ever be significant). I should also add that there are clear policy implications if it can be determined which features of courses affect outcomes (and there is much hand-waving saying that A and B clearly cause course graduates to be C or D - with little or no actual data, often not even considering that entrants may differ between the courses).

Various descriptive statistics are available from multiple sources for these institutions, but most are not complete. Some are incomplete for structural reasons (courses are new and therefore there is little data on previous graduates and current students have mostly not graduated, so most descriptors are within the course itself). Other data are incomplete for survey reasons (a few courses have changed recently and therefore data are not approproriate as the new course has not yet settled down), or due to non-response (not all courses have replied to requests for data, etc).

A practical problem is that the interest in the study is the relationship between input and process variables on a range of outcome variables (careers entered, post-grad performance, or whatever). The measures are themselves ordered in time, and for many it is possible to write down a plausible structural model, and it would be nice to fit that model, emphasising closer rather than more distant relationships.Official statistics generate many such course description statistics nowadays, and it is very easy to end up with more than 30 or so variables.  I have cut some of the numbers down by using the first couple of principal components, but that is not always possible. And of course reviewers will always ask that X, Y and Z are also taken into account. So it is easy to have more variables than cases.

Normally when there are many more cases than variables I would use MI (and that can be handled in R using mice() and then lavaan() ).  However I have a real concern about using MI (or EM for the correlation matrix) when there are more variables than cases since both in effect require inversion of a matrix and that then will be singular. Certainly it doesn't make sense to carry out a conventional regression when there are more variables than cases...

So what to do?  Of course the matrix inversion problem would not arise if, say, I were to hot-deck or some equivalent (perhaps repeatedly) as all missing values would be filled in, a correlation matrix could be calculated, and a structural model fitted in which only a small proportion of total paths are included in the model so that the degrees of freedom are still OK. That doesn't feel right though.

Has anyone any thoughts on how to approach the problem?  Any references to existing literature would be particularly appreciated.

With many thanks

Chris McManus
Professor of Psychology and Medical Education, UCL




Reply all
Reply to author
Forward
0 new messages