Earlier, I posted a query to the group re the use of a hybrid logistic
regression model. My current message relates to the same project but
raises a different issue (which I appreicate is not independent,
however, from the matter I raised separately). I just want to avoid my
messages becoming too long!
I have an ordinal response variable with cohort sizes of 19, 78, 99, 5
and 2 (in the order weak to severe for the condition represented). At
a talk I attended some time ago, it was pointed out that the first step
in carrying out an ordinal logistic regression analysis is to check
that the cohort sizes for the response variable are roughly equal and
if they are not, the next step is to merge categories to ensure that
they are. With the sort of categories I have, merging would make
little sense from a clinical point of view. I would welcome views
therefore on the appropriateness of assuming an ordinal logistic
regression model with such group sizes.
I also have a purely nominal response variable with cohort sizes of
161, 21 and 21 and again, it would not make clinical sense to merge any
of the response categories. Are the group sizes for this response
variable too unbalanced to assume a multinomial regression model?
Feedback would be much appreciated w.r.t. to the appropriateness of
implementing each of the above models given the group sizes for the
response variables.
Thank you in advance for your kind suggestions.
Best wishes
Margaret
distribute or copy this message, and do not disclose its contents or take any action in reliance on the information it contains. Thank you.
Thank you for this advice. On the basis of the results of tests of
sigificance which were performed at the univariate level, this
limitation on covariates may be okay as far as the multinomial logistic
regression analysis goes. However, it would appear from what you have
said, that my ordinal logistic regression model would be a non-starter
(as two of my ordinal categories apply to less than 10 people - 5
people and 2 people). Indeed, I am beginning to question whether the
results at the univariate level are worthy of publication in a peer
reviewed journal, given the groups sizes (19, 78, 99, 5
and 2) I listed for the ordinal response variable.
Does anyone (including yourself) have any further thoughts on this?
Best wishes
Margaret
> Scott R Millis, PhD, MEd, ABPP (CN & RP)
> Professor & Director of Research
> Department of Physical Medicine & Rehabilitation
> Wayne State University School of Medicine
> 261 Mack Blvd
> Detroit, MI 48201
> Email: smi...@med.wayne.edu
> Tel: 313-993-8085
> Fax: 313-745-9854
>
> *********************************************************
> This electronic message may contain information that is confidential and/or legally privileged. It is intended only for the use of the individual(s) and entity named as recipients in the message. If you are not an intended recipient of this message, please notify the sender immediately and delete the material from any computer. Do not deliver, distribute or copy this message, and do not disclose its contents or take any action in reliance on the information it contains. Thank you.
> --0-1933704739-1150815970=:95781
> Content-Type: text/html; charset=iso-8859-1
> Content-Transfer-Encoding: 8bit
> X-Google-AttachSize: 2233
>
> <div>Indeed, the unbalanced group sizes are a problem. To start, the smallest group size sets the limit for the number of covariates that you can include in model without over-fitting. So, for a group size of 21, you might be able to have 1 or 2 covariates in the model.</div> <div> </div> <div>SR Millis<BR><BR><B><I>Margaret <Margaret....@ed.ac.uk></I></B> wrote:</div> <BLOCKQUOTE class=replbq style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #1010ff 2px solid"><BR>Hello<BR><BR>Earlier, I posted a query to the group re the use of a hybrid logistic<BR>regression model. My current message relates to the same project but<BR>raises a different issue (which I appreicate is not independent,<BR>however, from the matter I raised separately). I just want to avoid my<BR>messages becoming too long!<BR><BR>I have an ordinal response variable with cohort sizes of 19, 78, 99, 5<BR>and 2 (in the order weak to severe for the condition represented).
> At<BR>a talk I attended some time ago, it was pointed out that the first step<BR>in carrying out an ordinal logistic regression analysis is to check<BR>that the cohort sizes for the response variable are roughly equal and<BR>if they are not, the next step is to merge categories to ensure that<BR>they are. With the sort of categories I have, merging would make<BR>little sense from a clinical point of view. I would welcome views<BR>therefore on the appropriateness of assuming an ordinal logistic<BR>regression model with such group sizes.<BR><BR><BR>I also have a purely nominal response variable with cohort sizes of<BR>161, 21 and 21 and again, it would not make clinical sense to merge any<BR>of the response categories. Are the group sizes for this response<BR>variable too unbalanced to assume a multinomial regression model?<BR><BR>Feedback would be much appreciated w.r.t. to the appropriateness of<BR>implementing each of the above models given the group sizes for
> the<BR>response variables.<BR><BR>Thank you in advance for your kind suggestions.<BR><BR>Best wishes<BR><BR>Margaret<BR><BR><BR> distribute or copy this message, and do not disclose its contents or take any action in reliance on the information it contains. Thank you.
> --0-1933704739-1150815970=:95781--
Indeed, the unbalanced group sizes are a problem. To start, the smallest group size sets the limit for the number of covariates that you can include in model without over-fitting. So, for a group size of 21, you might be able to have 1 or 2 covariates in the model.
distribute or copy this message, and do not disclose its contents or take any action in reliance on the information it contains. Thank you.
> --0-1933704739-1150815970=:95781--
Thank you for your message. I agree re your point about the need for a
sophisticated screening procedure. In practice, I use a procedure which
I developed through correspondence with the author Hosmer (one of the
co-authors of a well-read book entitled 'Applied Logistic
Regression'). The reference you quote seems interesting. I hope that I
will find some time to check it out before too long.
Best wishes
Margaret
> --0-368273892-1150927730=:60965
> Content-Type: text/html; charset=iso-8859-1
> Content-Transfer-Encoding: 8bit
> X-Google-AttachSize: 8384
>
> <div>Margaret,</div> <div> </div> <div>I agree with you that your ordinal model has group sizes that are too small and unbalanced.</div> <div> </div> <div>On a different note, the univariate "screening" of covariates before entering them into the multivariable logistic model (whether binary, ordinal, or multinomial) is to be avoided because of what Harrell calls the "phantom degrees of freedom" problem. That is, this univariate "look" actually spends degrees of freedom---which isn't typically accounted for---which then produces models that are overly optimistic and tend not to replicate on new samples. Harrell, in his book, "Regression modeling strategies," presents a systematic, sophisticated approach to model development that integrates new findings into the statistical properties of estimators.I highly recommend it. It helps you to unlearn many of the bad habits taught to you in grad school and/or post-doc fellowship.</div>
> <div> </div> <div>Scott Millis</div> <div> </div> <div><BR><BR><B><I>Margaret <Margaret....@ed.ac.uk></I></B> wrote:</div> <BLOCKQUOTE class=replbq style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #1010ff 2px solid"><BR><BR>SR Millis wrote:<BR>> Indeed, the unbalanced group sizes are a problem. To start, the smallest group size sets the limit for the number of covariates that you can include in model without over-fitting. So, for a group size of 21, you might be able to have 1 or 2 covariates in the model.<BR>><BR>> SR Millis<BR>><BR>Dear Scott<BR><BR>Thank you for this advice. On the basis of the results of tests of<BR>sigificance which were performed at the univariate level, this<BR>limitation on covariates may be okay as far as the multinomial logistic<BR>regression analysis goes. However, it would appear from what you have<BR>said, that my ordinal logistic regression model would be a non-starter<BR>(as two of my ordinal
> categories apply to less than 10 people - 5<BR>people and 2 people). Indeed, I am beginning to question whether the<BR>results at the univariate level are worthy of publication in a peer<BR>reviewed journal, given the groups sizes (19, 78, 99, 5<BR>and 2) I listed for the ordinal response variable.<BR><BR>Does anyone (including yourself) have any further thoughts on this?<BR><BR>Best wishes<BR><BR>Margaret<BR><BR><BR><BR>> Margaret <MARGARET....@ED.AC.UK>wrote:<BR>><BR>> Hello<BR>><BR>> Earlier, I posted a query to the group re the use of a hybrid logistic<BR>> regression model. My current message relates to the same project but<BR>> raises a different issue (which I appreicate is not independent,<BR>> however, from the matter I raised separately). I just want to avoid my<BR>> messages becoming too long!<BR>><BR>> I have an ordinal response variable with cohort sizes of 19, 78, 99, 5<BR>> and 2 (in the order weak to severe for the
> condition represented). At<BR>> a talk I attended some time ago, it was pointed out that the first step<BR>> in carrying out an ordinal logistic regression analysis is to check<BR>> that the cohort sizes for the response variable are roughly equal and<BR>> if they are not, the next step is to merge categories to ensure that<BR>> they are. With the sort of categories I have, merging would make<BR>> little sense from a clinical point of view. I would welcome views<BR>> therefore on the appropriateness of assuming an ordinal logistic<BR>> regression model with such group sizes.<BR>><BR>><BR>> I also have a purely nominal response variable with cohort sizes of<BR>> 161, 21 and 21 and again, it would not make clinical sense to merge any<BR>> of the response categories. Are the group sizes for this response<BR>> variable too unbalanced to assume a multinomial regression model?<BR>><BR>> Feedback would be much appreciated w.r.t. to
> the appropriateness of<BR>> implementing each of the above models given the group sizes for the<BR>> response variables.<BR>><BR>> Thank you in advance for your kind suggestions.<BR>><BR>> Best wishes<BR>><BR>> Margaret<BR>><BR>><BR>><BR>><BR>><BR>><BR>> Scott R Millis, PhD, MEd, ABPP (CN & RP)<BR>> Professor & Director of Research<BR>> Department of Physical Medicine & Rehabilitation<BR>> Wayne State University School of Medicine<BR>> 261 Mack Blvd<BR>> Detroit, MI 48201<BR>> Email: smi...@med.wayne.edu<BR>> Tel: 313-993-8085<BR>> Fax: 313-745-9854<BR>><BR>> *********************************************************<BR>> This electronic message may contain information that is confidential and/or legally privileged. It is intended only for the use of the individual(s) and entity named as recipients in the message. If you are not an intended recipient of this message, please notify the
> sender immediately and delete the material from any computer. Do not deliver, distribute or copy this message, and do not disclose its contents or take any action in reliance on the information it contains. Thank you.<BR>> --0-1933704739-1150815970=:95781<BR>> Content-Type: text/html; charset=iso-8859-1<BR>> Content-Transfer-Encoding: 8bit<BR>> X-Google-AttachSize: 2233<BR>><BR>> <DIV>Indeed, the unbalanced group sizes are a problem. To start, the smallest group size sets the limit for the number of covariates that you can include in model without over-fitting. So, for a group size of 21, you might be able to have 1 or 2 covariates in the model.</DIV> <DIV> </DIV> <DIV>SR Millis<BR><BR><B><I>Margaret <Margaret....@ed.ac.uk></I></B> wrote:</DIV> <BLOCKQUOTE class=replbq style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #1010ff 2px solid"><BR>Hello<BR><BR>Earlier, I posted a query to the group re the use of a hybrid
> logistic<BR>regression model. My current message relates to the same project but<BR>raises a different issue (which I appreicate is not independent,<BR>however, from the matter I raised separately). I just want to avoid my<BR>messages becoming too long!<BR><BR>I have an ordinal response variable with cohort sizes of 19, 78, 99, 5<BR>and 2 (in the order weak to severe for the condition represented).<BR>> At<BR>a talk I attended some time ago, it was pointed out that the first step<BR>in carrying out an ordinal logistic regression analysis is to check<BR>that the cohort sizes for the response variable are roughly equal and<BR>if they are not, the next step is to merge categories to ensure that<BR>they are. With the sort of categories I have, merging would make<BR>little sense from a clinical point of view. I would welcome views<BR>therefore on the appropriateness of assuming an ordinal logistic<BR>regression model with such group sizes.<BR><BR><BR>I also have a purely
> nominal response variable with cohort sizes of<BR>161, 21 and 21 and again, it would not make clinical sense to merge any<BR>of the response categories. Are the group sizes for this response<BR>variable too unbalanced to assume a multinomial regression model?<BR><BR>Feedback would be much appreciated w.r.t. to the appropriateness of<BR>implementing each of the above models given the group sizes for<BR>> the<BR>response variables.<BR><BR>Thank you in advance for your kind suggestions.<BR><BR>Best wishes<BR><BR>Margaret<BR><BR><BR>distribute or copy this message, and do not disclose its contents or take any action in reliance on the information it contains. Thank you.<BR>> --0-1933704739-1150815970=:95781--<BR><BR><BR> of Physical Medicine & Rehabilitation<br>Wayne State University School of Medicine<br>261 Mack Blvd<br>Detroit, MI 48201<br>Email: smi...@med.wayne.edu<br>Tel: 313-993-8085<br>Fax: 313-745-9854<br><br>*********************************************************<br>This electronic message may contain information that is confidential and/or legally privileged. It is intended only for the use of the individual(s) and entity named as recipients in the message. If you are not an intended recipient of this message, please notify the sender immediately and delete the material from any computer. Do not deliver, distribute or copy this message, and do not disclose its contents or take any action in reliance on the information it contains. Thank you.
> --0-368273892-1150927730=:60965--
> <div> </div> <div>Scott Millis</div> <div> </div> <div><BR><BR><B><I>Margaret <Margaret.MacDougall@ ed.ac.uk></I></B> wrote:</div> <BLOCKQUOTE class=replbq style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #1010ff 2px solid"><BR><BR>SR Millis wrote:<BR>> Indeed, the unbalanced group sizes are a problem. To start, the smallest group size sets the limit for the number of covariates that you can include in model without over-fitting. So, for a group size of 21, you might be able to have 1 or 2 covariates in the model.<BR>><BR>> SR Millis<BR>><BR>Dear Scott<BR><BR>Thank you for this advice. On the basis of the results of tests of<BR>sigificance which were performed at the univariate level, this<BR>limitation on covariates may be okay as far as the multinomial logistic<BR>regression analysis goes. However, it would appear from what you have<BR>said, that my ordinal logistic regression model would be a non-starter<BR>(as two of my ordinal
> categories apply to less than 10 people - 5<BR>people and 2 people). Indeed, I am beginning to question whether the<BR>results at the univariate level are worthy of publication in a peer<BR>reviewed journal, given the groups sizes (19, 78, 99, 5<BR>and 2) I listed for the ordinal response variable.<BR><BR>Does anyone (including yourself) have any further thoughts on this?<BR><BR>Best wishes<BR><BR>Margaret<BR><BR><BR><BR>> Margaret < MARGARET....@ED.AC.UK>wrote:<BR>><BR>> Hello<BR>><BR>> Earlier, I posted a query to the group re the use of a hybrid logistic<BR>> regression model. My current message relates to the same project but<BR>> raises a different issue (which I appreicate is not independent,<BR>> however, from the matter I raised separately). I just want to avoid my<BR>> messages becoming too long!<BR>><BR>> I have an ordinal response variable with cohort sizes of 19, 78, 99, 5<BR>> and 2 (in the order weak to severe for the
> condition represented). At<BR>> a talk I attended some time ago, it was pointed out that the first step<BR>> in carrying out an ordinal logistic regression analysis is to check<BR>> that the cohort sizes for the response variable are roughly equal and<BR>> if they are not, the next step is to merge categories to ensure that<BR>> they are. With the sort of categories I have, merging would make<BR>> little sense from a clinical point of view. I would welcome views<BR>> therefore on the appropriateness of assuming an ordinal logistic<BR>> regression model with such group sizes.<BR>><BR>><BR>> I also have a purely nominal response variable with cohort sizes of<BR>> 161, 21 and 21 and again, it would not make clinical sense to merge any<BR>> of the response categories. Are the group sizes for this response<BR>> variable too unbalanced to assume a multinomial regression model?<BR>><BR>> Feedback would be much appreciated w.r.t. to
> the appropriateness of<BR>> implementing each of the above models given the group sizes for the<BR>> response variables.<BR>><BR>> Thank you in advance for your kind suggestions.<BR>><BR>> Best wishes<BR>><BR>> Margaret<BR>><BR>><BR>><BR>><BR>><BR>><BR>> Scott R Millis, PhD, MEd, ABPP (CN & RP)<BR>> Professor & Director of Research<BR>> Department of Physical Medicine & Rehabilitation<BR>> Wayne State University School of Medicine<BR>> 261 Mack Blvd<BR>> Detroit, MI 48201<BR>> Email: smi...@med.wayne.edu<BR>> Tel: 313-993-8085<BR>> Fax: 313-745-9854<BR>><BR>> *********************************************************<BR>> This electronic message may contain information that is confidential and/or legally privileged. It is intended only for the use of the individual(s) and entity named as recipients in the message. If you are not an intended recipient of this message, please notify the
> sender immediately and delete the material from any computer. Do not deliver, distribute or copy this message, and do not disclose its contents or take any action in reliance on the information it contains. Thank you.<BR>> --0-1933704739-1150815970=:95781<BR>> Content-Type: text/html; charset=iso-8859-1<BR>> Content-Transfer-Encoding: 8bit<BR>> X-Google-AttachSize: 2233<BR>><BR>> <DIV>Indeed, the unbalanced group sizes are a problem. To start, the smallest group size sets the limit for the number of covariates that you can include in model without over-fitting. So, for a group size of 21, you might be able to have 1 or 2 covariates in the model.</DIV> <DIV> </DIV> <DIV>SR Millis<BR><BR><B><I>Margaret <Margaret.MacDougall@ ed.ac.uk></I></B> wrote:</DIV> <BLOCKQUOTE class=replbq style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #1010ff 2px solid"><BR>Hello<BR><BR>Earlier, I posted a query to the group re the use of a hybrid
This seems to be a good discussion on modeling. Scott suggests that "screening" of covariates before entering them into the multivariable logistic model is to be avoided. What I have been doing is to include univariate predictors in the model as well as variables which are of clinical signfiicance and which have been shown in other studies to be predictor ("forcing" them into the model regardless of univariate significance). Is this a good strategy or are there other ways of doing this?thanks,Sripal
That should always be the primary consideration!
> 2. Select variables that have wide score distributions.
> Variables having narrow ranges will have limited variance
> and an attenuated capacity to detect differences or detect
> associations.
That's true enough, but should be glossed in that it's the
precision (or SE) of the estimated coefficient which is improved
by wide scatter, and (relatedly) that it's the variables with the
widest scatter (other things, i.e. coefficients, being equal)
that have the best predictive power. There's a classic instance
of this: In school test scores on different subjects, to the
extent that score on any subject correlates with general ability,
it's usually the score in mathematics that is the best predictor
of overall attainment. This is because the range of maths scores
is usually wider than in other subjects (since, amongst other
things, it's very possible to score 0% or 100% in maths, either
by getting everything srong or getting everything right; while
examiners rarely have occasion to give 0% or 100% in a language,
or in history, or whatever).
> 3. Consider eliminating variables that have or will likely have high
> levels of missing data. However, if those variables are important to
> the model, consider using FIML or MI strategies when the proportion of
> missing data exceeds 0.15.
This depends on the extent to which missing values in these variables
are reliably predictable from the non-missing values of other variables,
which in turn depends on how well they are correlated with the
set of other variables.
> 4. If your sample is large enough, consider using variable
> reduction strategies such as principal component analysis---and
> use the PCs in your model lieu of individual variables.
This (in so far as you adopt the PCs which account for the most
variance) is (2) in another form. But note that PCs, used as
variables, can often be difficult to interpret.
(No comment on the remaining points).
Best wishes to all,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 26-Jun-06 Time: 21:46:35
------------------------------ XFMail ------------------------------
>This depends on the extent to which missing values in these variables
>are reliably predictable from the non-missing values of other variables,
>which in turn depends on how well they are correlated with the
>set of other variables.
Going off at a considerable tangent, this (clearly very true) statement
underlies my deep 'suspicion' about all methods of imputation of missing
values. Such imputation clearly cannot 'add information', and if one has a
situation in which the values of a particular variable are, as Ted puts it,
"reliably predictable" from other variables, I can't help but wonder why
one is using that (seemingly redundant) variable in the first place!
Except as a means of making computation more straightforward, I therefore
am not sure I really understand what is actually achieved by imputation of
missing values in situations such as Ted was eluding to - which probably
means that I am missing something of fundamental importance!
Kind Regards,
John
----------------------------------------------------------------
Dr John Whittington, Voice: +44 (0) 1296 730225
Mediscience Services Fax: +44 (0) 1296 738893
Twyford Manor, Twyford, E-mail: Joh...@mediscience.co.uk
Buckingham MK18 4EL, UK medis...@compuserve.com
----------------------------------------------------------------
_____________________________________________________
Doug Altman
Professor of Statistics in Medicine
Centre for Statistics in Medicine
Wolfson College Annexe
Linton Road
Oxford OX2 6UD
email: doug....@cancer.org.uk
Tel: 01865 284400 (direct line 01865 284401)
Fax: 01865 284424
Web: http://www.csm-oxford.org.uk/
Individuals with missing observations will also have non-missing observations. When you impute a missing value (or >1) for an individual all their non-missing data are reinstated in the data set (compared to a complete case analysis). In one study we did*, on average 3 real values were retrieved for each imputed value. The amount of retrieved data depends of course on the pattern of missing observations, but there will always be a gain. So in some sense information is indeed added when imputation is done. In the study I mentioned only 44% of cases had complete data, so we more than doubled the sample size by imputing missing values. This study helped me to overcome my initial scepticism about imputation and recognise the hidden assumptions of and problems associated with complete case analysis.
It will be rare that one can impute a value without quite a lot of uncertainty - what we want to do is to impute it in an unbiased way (which may be what Ted meant by 'reliable'). Because of the above-mentioned gain of real values too, the precision of regression coefficients may shrink when one imputes.
Doug
* Clark TG, Altman DG. Developing a prognostic model in the presence of missing data: an ovarian cancer
case-study. Journal of Clinical Epidemiology 2003;56:28–37.
Of course not! The main (and possibly only) point of imputation
is to put the information that is present in the data, into the
place where the reader/user expects to see it.
> and if one has a situation in which the values
> of a particular variable are, as Ted puts it, "reliably
> predictable" from other variables, I can't help but wonder
> why one is using that (seemingly redundant) variable in the
> first place!
As one example, suppose the variable one is interested in
is a time series. If the series were complete, one would
use standard methods of times-series analysis to identify
time-seriesy things like trends, seasonal effects, periodic
structure (spectrum) and so on. But is has missing values,
which is a bit of a spanner in the works for such methods.
Fortunately, the series comes with a nice set of covariates
from which the missing values can be predicted with reasonable
precision. So they can be used to "fill in" the missing values
(preferably several times over so that the uncertainties in
the imputation are represented), and the standard methods of
time series abalysis can then be used.
> Except as a means of making computation more straightforward,
Well, that's one way of describing it. Provided one has a good
probabilistic model for the data, the statistical model for
the data with missing values is then prefectly definite and
it can be treated by maximum likelihood methods. As far as the
parametric inference is concerned, this extracts the full
avilable information from the data. But in general it is a
very nasty procedure to implement and work with in such cases.
The advantage of the imputation approach is that, provided
the imputation method is "good" (a highly technical issue),
the result of analysing (by the appropriate complete-data
analysis) a sufficiently many times imputed dataset is
equivalent to the maximum likelihood approach.
> I therefore am not sure I really understand what is actually
> achieved by imputation of missing values in situations such
> as Ted was eluding to - which probably means that I am missing
> something of fundamental importance!
Not really. You can always do it the hard way if you wish!
The one thing to really watch out for with imputed data is
that the unwary may be presented with a data series with
missing values imputed, without the information about the
uncertainty inherent in the imputed values. So the unwary
might compute a mean and SD for the series, and treat this
as a "sample size of N", and so on. It's essential to give
the full story about the variation from imputation to imputation.
You will have gathered that I'm in favour of multiple imputation
methods if imputation is to be used at all. There are several
"single imputation" methods around, most of them much used
(often under inducement from software which implements them),
which do not give you any handle on the uncertainty inherent
in the process.
Yours, elusively,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 27-Jun-06 Time: 18:37:18
------------------------------ XFMail ------------------------------
Well, it depends on what you mean by "added". As I wrote
5 minutes ago, what it really does (and what Doug no
doubt really means by "reinstated") is that the information
inherent in the non-missing data about the missing values
is re-expressed in terms of imputed values placed where
the user expects to see them. The actual missing values
(i.e. the values you would have observed if only you had
observed them) are of course not re-instated.
> In the study I
> mentioned only 44% of cases had complete data, so
> we more than doubled the sample size by imputing
> missing values. This study helped me to overcome
> my initial scepticism about imputation and
> recognise the hidden assumptions of and problems
> associated with complete case analysis.
There is an implicit comparison here with one of the
ways of dealing with data where some cases have missing
values, namely the "Complete Cases only" approach: you
simply ignore all cases with missing data. This of course
does lose information, namely all the info in the cases
you ignore. Imputation is one way of putting this information
back, in that it gives a means of handling the cases with
missing data. The fact that its output looks like data
with no missing values is misleading -- but only superficially
so: provided you are ware of what has been done, and view
the imputed data in the right way, you should not be misled.
> It will be rare that one can impute a value
> without quite a lot of uncertainty - what we want
> to do is to impute it in an unbiased way (which
> may be what Ted meant by 'reliable'). Because of
> the above-mentioned gain of real values too, the
> precision of regression coefficients may shrink
> when one imputes.
This is because one is taking account of more cases, and
enjoying the information they bring.
My use of "reliable" was perhaps not well chosen. Perhaps
a combination of "realistic" and "with reasonable precision"
would have been better. Lack of bias is desirable, of course,
but one can cope with some degree of bias; and there are
boot-strap type techniques for bias reduction if bias is
a worry.
As to "realistic", this refers to a multitude of phenomena.
You don't want to find negative imputed blood-pressure.
You don't want to find imputations which range well
beyond reasinable values, compared with the values in
the observed cases. Often variables have the feature that,
if positive, they have a reasonable approximation to some
continous distribution but they may also be categorically
zero (examples might include time spent sitting watching TV
in a day, quantity consumed of junk munchies in plastic
bags in a day, amount of lager drunk in a day, ... ).
You need a method of imputation which is capable of predicting
"zero consumption" for such variables (i.e. of mixed discrete/
continuous type, called "semicontinuous" by Shafer et al.),
otherwise you imputations will not display this essential
feature.
Best wishes,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 27-Jun-06 Time: 19:03:01
------------------------------ XFMail ------------------------------
>Individuals with missing observations will also have non-missing
>observations. When you impute a missing value (or >1) for an individual
>all their non-missing data are reinstated in the data set ...[snip]...
>This study helped me to overcome my initial scepticism about imputation
>and recognise the hidden assumptions of and problems associated with
>complete case analysis.
Doug, what you say makes complete sense to me, but only in the context of
an assumption that the only alternative to imputation is to discard all the
non-missing data from subjects who have some missing data ('complete case
analysis'). There surely are many situations in which that is not
necessary. Whilst a 'complete set of data' (real and/or imputed) is
probably always needed for analyses based on classical ANOVAR, this is
surely not necessarily the case with other techniques, such as
regression-based ones.
In situations in which there are alternatives to imputation which do not
involve the wastage of non-missing data, I think I'm back to the comments I
made before - that I don't see that imputation 'adds any information', and
am not really sure of what it does achieve. In other words, I suppose that
means that I still have some of the sceptism which you say you once had.
Hi John,
As I tried to state earlier -- imputation does not (and can not)
add information, but it does re-express information about missing
values in a directly accessible form. As an example, I have prepared
the following toy case (possibly too simple to be persuasive):
Variable Y is linearly related to variable X, with error. Here
are the data (pretending to be observations on cases sampled
from a population):
X Y
0.11 0.06
1.37 1.47
1.47 1.61
2.49 2.49
2.53 NA
2.57 2.60
4.93 5.30
8.04 8.11
8.13 7.93
8.88 8.74
8.97 9.27
There is clearly information in this dataset about the missing
value of Y. You can if you wish simply regress Y on X for the
complete cases, and use the result to obtain a predictive
distribution for the missing value of Y, and hence (for instance)
display a "95% prediction interval" for it. Which is OK if the
only thing you're interested in is that particular missing
value (i.e. the value for the case which had X = 2.53).
But suppose you want to estimate the proportion of cases in
the population which have Y > 2.50 (shades of the "dichotomy"
discussion we're having at the moment ... )? Or what are the
mean and SD of the values of Y in the population? Answers to
such questions do not readily emerge from the predictive
distribution combined with the observed values.
But now let's do a multiple imputation (10-fold). Here are
the results.
X Y
0.11 0.06
1.37 1.47
1.47 1.61
2.49 2.49
2.53 2.37 2.98 2.46 2.71 2.45 2.68 2.49 2.74 2.38 2.23
2.57 2.60
4.93 5.30
8.04 8.11
8.13 7.93
8.88 8.74
8.97 9.27
This set of results displays the information about the missing
value of Y in the form of a bunch of values distributed according
to the Bayesian posterior distribution for the missing value.
The above (for example) gives numbers > 2.5, out of 11, in each
imputation as:
6, 7, 6, 7, 6, 7, 6, 7, 6, 6
Mean and SD for each imputation:
Mean SD
4.54 3.40
4.60 3.37
4.55 3.39
4.57 3.38
4.55 3.39
4.57 3.38
4.55 3.39
4.57 3.38
4.54 3.40
4.53 3.41
From these varying results, one for each imputation, there
are further procedures to evaluate a single estimate for the
quantity on question, together with an estimate of its SE
that takes account of the uncertainty due to the missing
value as expressed in the varying values imputed to the
missing value, in addition to the uncertainty inherent in
the original sampling.
No information is added -- but, as you can see from the above,
imputation gives a different way of looking at the information
that is there, in a form that is directly accessible for
further analysis. And this (at least) is what imputation does
achieve!
Best wishes,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 28-Jun-06 Time: 11:11:16
------------------------------ XFMail ------------------------------
>On 27-Jun-06 John Whittington wrote:
>
> > Going off at a considerable tangent, this (clearly very true)
> > statement underlies my deep 'suspicion' about all methods of
> > imputation of missing values. Such imputation clearly cannot
> > 'add information',
>
>Of course not! The main (and possibly only) point of imputation
>is to put the information that is present in the data, into the
>place where the reader/user expects to see it.
Indeed, although, as you are aware, Doug Altman has produced an argument
that the amount of information can be increased by imputation - but only,
as far as I can see, if one regards the only alternative to imputation as
being the discarding of (some or all) non-0missing data from subjects who
have some missing data.
>As one example, suppose the variable one is interested in
>is a time series. If the series were complete, one would
>use standard methods of times-series analysis to identify
>time-seriesy things like trends, seasonal effects, periodic
>structure (spectrum) and so on. But is has missing values,
>which is a bit of a spanner in the works for such methods.
>Fortunately, the series comes with a nice set of covariates
>from which the missing values can be predicted with reasonable
>precision. So they can be used to "fill in" the missing values
>(preferably several times over so that the uncertainties in
>the imputation are represented), and the standard methods of
>time series abalysis can then be used.
Sure, but as yoiu go on to partially concede that really corresponds with
my view when I wrote:
> > Except as a means of making computation more straightforward,
Once upon a time, when analytical techniques were mainly based on
'classical' approaches (e.g. ANOVAR) and computational facilities were very
limited, there was obviously a strong argument for trying to 'make data
sets complete' in order to make analysis even possible. However, advances
in approaches as well as computing resources now very much weaken that
argument.
>The one thing to really watch out for with imputed data is
>that the unwary may be presented with a data series with
>missing values imputed, without the information about the
>uncertainty inherent in the imputed values. So the unwary
>might compute a mean and SD for the series, and treat this
>as a "sample size of N", and so on. It's essential to give
>the full story about the variation from imputation to imputation.
Indeed. However, you are talking mainly about the
variability/uncertainty. What about the actual estimates? I think it
needs a mathematically more clever person than myself to work this out, but
what can one say about the mean of a set of data including imputed values
in comparison with the mean of the non-missing values in that data
set? Can one demonstrate that it is likely to be closer to the population
mean - and, indeed, quantify (presumably probabilistically) the extent of
that 'improvement'? ... all of that, of course, assuming data is missing
for genuinely 'random' reasons.
Paul R. Swank, Ph.D.
Professor, Developmental
Pediatrics
Director of Research, Center for Improving the Readiness of
Children for Learning and Education (C.I.R.C.L.E.)
Medical School
UT
Health Science Center at Houston
Going off at a considerable tangent, this (clearly very true) statement
underlies my deep 'suspicion' about all methods of imputation of missing
values. Such imputation clearly cannot 'add information', and if one has a
situation in which the values of a particular variable are, as Ted puts it,
"reliably predictable" from other variables, I can't help but wonder why
one is using that (seemingly redundant) variable in the first place!
Except as a means of making computation more straightforward, I therefore
am not sure I really understand what is actually achieved by imputation of
missing values in situations such as Ted was eluding to - which probably
means that I am missing something of fundamental importance!
Kind Regards,
John
----------------------------------------------------------------
Dr John Whittington, Voice: +44 (0) 1296 730225
Mediscience Services Fax: +44 (0) 1296 738893
Twyford Manor, Twyford, E-mail: Joh...@mediscience.co.uk
Buckingham MK18 4EL, UK medis...@compuserve.com
----------------------------------------------------------------
>But now let's do a multiple imputation (10-fold). Here are
>the results.
> X Y
> 0.11 0.06
> 1.37 1.47
> 1.47 1.61
> 2.49 2.49
> 2.53 2.37 2.98 2.46 2.71 2.45 2.68 2.49 2.74 2.38 2.23
> 2.57 2.60
> 4.93 5.30
> 8.04 8.11
> 8.13 7.93
> 8.88 8.74
> 8.97 9.27
>
>This set of results displays the information about the missing
>value of Y in the form of a bunch of values distributed according
>to the Bayesian posterior distribution for the missing value.
>
>The above (for example) gives numbers > 2.5, out of 11, in each
>imputation as:
>
> 6, 7, 6, 7, 6, 7, 6, 7, 6, 6
Very true, but, at least in your example, the 'gain' from doing that would
seem to be rather limited. If (since your stated interest is in 'counts')
one rounds the mean of your answers to the nearest whole number, then one
would get the same answer (6) as one would get with no imputation (i.e.
with N=10).
However, I do take your general point and can well believe that, in larger
and more complex situations, use of imputation in that sort of way might
result in an appreciable 'gain' in quality of information.
That of course is a numerical coincidence! I did say that the
example was "possibly too simple to be persuasive"! If you want
a more realistic case -- well, I recommend close all hatches,
submerge to periscope depth, deflect hydroplanes downwards,
and crash dive; and then I'll send some down!
The illustrative point was to show that, whatever feature you
would want to extract from a complete dataset, you can extract
by the same means from each imputation; and then the variation
in the result from imputation to imputation is a measure of
the uncertainty inherent in the information and can be combined
with the uncertainty (typically an SE) derived by standard
analysis [ sqrt(p*(1-p)/n) for a proportion ] to obtain an
SE for the combined estimate of the value; and this (in a
"regular" or as the pundits say "proper" problem) will be
close to the SE you would get if you did it ny bare-hands
maximum likelihood.
> However, I do take your general point and can well believe that,
> in larger and more complex situations, use of imputation in that
> sort of way might result in an appreciable 'gain' in quality of
> information.
The type of application in which this is most apparent, which is
also the one most used by MI pundits to promote the method, is in
analysing survey data (in which, also, often data will be recorded
on a large number of variables for each case, so there is a lot
of "concomitant" information sitting there).
The idea here is that the survey data are gathered by a single
agency and compiled into a database. This database will have
many users, to whom the agency grants access. If there were no
missing values, then each user could pump the database through
their own favourite software (SAS, SPSS, ... ) and get estimates
with SEs, crosstabs, etc. for whatever question they wanted to ask.
But when there are missing data (and in many cases the proportion
missing on some variables can be quite high, say 30-40%), there
is a dilemma.
The agency can say to the user "There are missing data here,
so you will need to use ML methods to obtain the full information
from the database." But the user is not up to this -- so would
perhaps have to call in consultants, or persuade a Professor to
set it up as a PhD project, or ...
Or the user can say to the agency "I want to extract the following
information from the database. Please use your expertise to provide
this." But the agency does not have the resources to cope with
this demand from the many users of the database.
Therefore the agency, the database constructor, takes on the role
of imputer, and creates multiple instances of the database completed
by imputation. These are then used by the user, the data analyst,
who analyses them by whatever standard methods they would apply to
any normal complete dataset. There are then relatively straightforward
rules for combining their multiple results.
This of course has the difficulty that imputation requires a
model for the data, which the imputer must choose. This may not
be the same as the model that the user (the anaylst) might
have in mind.
(And that is not the only worm crawling in this can).
Best wishes,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 28-Jun-06 Time: 17:10:09
------------------------------ XFMail ------------------------------
Essentially yes. Strictly speaking, the only necessary condition
is that the process of deletion (which of course is determined
by the missingness) is such that the expcted value of the
responses which would be deleted is the same as the expected
value of those which would not be deleted.
In practice, this boils down to deleting at random.
Best wishes,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 28-Jun-06 Time: 17:15:03
------------------------------ XFMail ------------------------------