Logistic regression with very unbalanced groups sizes

Margaret

unread,

Jun 20, 2006, 10:18:39 AM6/20/06

to MedStats

Hello

Earlier, I posted a query to the group re the use of a hybrid logistic
regression model. My current message relates to the same project but
raises a different issue (which I appreicate is not independent,
however, from the matter I raised separately). I just want to avoid my
messages becoming too long!

I have an ordinal response variable with cohort sizes of 19, 78, 99, 5
and 2 (in the order weak to severe for the condition represented). At
a talk I attended some time ago, it was pointed out that the first step
in carrying out an ordinal logistic regression analysis is to check
that the cohort sizes for the response variable are roughly equal and
if they are not, the next step is to merge categories to ensure that
they are. With the sort of categories I have, merging would make
little sense from a clinical point of view. I would welcome views
therefore on the appropriateness of assuming an ordinal logistic
regression model with such group sizes.

I also have a purely nominal response variable with cohort sizes of
161, 21 and 21 and again, it would not make clinical sense to merge any
of the response categories. Are the group sizes for this response
variable too unbalanced to assume a multinomial regression model?

Feedback would be much appreciated w.r.t. to the appropriateness of
implementing each of the above models given the group sizes for the
response variables.

Thank you in advance for your kind suggestions.

Best wishes

Margaret

SR Millis

unread,

Jun 20, 2006, 11:06:10 AM6/20/06

to MedS...@googlegroups.com

Indeed, the unbalanced group sizes are a problem. To start, the smallest group size sets the limit for the number of covariates that you can include in model without over-fitting. So, for a group size of 21, you might be able to have 1 or 2 covariates in the model.

SR Millis

Margaret <Margaret....@ed.ac.uk> wrote:

distribute or copy this message, and do not disclose its contents or take any action in reliance on the information it contains. Thank you.

Margaret

unread,

Jun 21, 2006, 6:15:14 AM6/21/06

to MedStats

SR Millis wrote:
> Indeed, the unbalanced group sizes are a problem. To start, the smallest group size sets the limit for the number of covariates that you can include in model without over-fitting. So, for a group size of 21, you might be able to have 1 or 2 covariates in the model.
>
> SR Millis
>

Dear Scott

Thank you for this advice. On the basis of the results of tests of
sigificance which were performed at the univariate level, this
limitation on covariates may be okay as far as the multinomial logistic
regression analysis goes. However, it would appear from what you have
said, that my ordinal logistic regression model would be a non-starter
(as two of my ordinal categories apply to less than 10 people - 5
people and 2 people). Indeed, I am beginning to question whether the
results at the univariate level are worthy of publication in a peer
reviewed journal, given the groups sizes (19, 78, 99, 5
and 2) I listed for the ordinal response variable.

Does anyone (including yourself) have any further thoughts on this?

Best wishes

Margaret

> Scott R Millis, PhD, MEd, ABPP (CN & RP)
> Professor & Director of Research
> Department of Physical Medicine & Rehabilitation
> Wayne State University School of Medicine
> 261 Mack Blvd
> Detroit, MI 48201
> Email: smi...@med.wayne.edu
> Tel: 313-993-8085
> Fax: 313-745-9854
>
> *********************************************************
> This electronic message may contain information that is confidential and/or legally privileged. It is intended only for the use of the individual(s) and entity named as recipients in the message. If you are not an intended recipient of this message, please notify the sender immediately and delete the material from any computer. Do not deliver, distribute or copy this message, and do not disclose its contents or take any action in reliance on the information it contains. Thank you.
> --0-1933704739-1150815970=:95781
> Content-Type: text/html; charset=iso-8859-1
> Content-Transfer-Encoding: 8bit
> X-Google-AttachSize: 2233
>
> <div>Indeed, the unbalanced group sizes are a problem.  To start, the smallest group size sets the limit for the number of covariates that you can include in model without over-fitting. So, for a group size of 21, you might be able to have 1 or 2 covariates in the model.</div> <div> </div> <div>SR Millis Margaret <Margaret....@ed.ac.uk> wrote:</div> <BLOCKQUOTE class=replbq style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #1010ff 2px solid"> Hello Earlier, I posted a query to the group re the use of a hybrid logistic regression model. My current message relates to the same project but raises a different issue (which I appreicate is not independent, however, from the matter I raised separately). I just want to avoid my messages becoming too long! I have an ordinal response variable with cohort sizes of 19, 78, 99, 5 and 2 (in the order weak to severe for the condition represented).
> At a talk I attended some time ago, it was pointed out that the first step in carrying out an ordinal logistic regression analysis is to check that the cohort sizes for the response variable are roughly equal and if they are not, the next step is to merge categories to ensure that they are. With the sort of categories I have, merging would make little sense from a clinical point of view. I would welcome views therefore on the appropriateness of assuming an ordinal logistic regression model with such group sizes. I also have a purely nominal response variable with cohort sizes of 161, 21 and 21 and again, it would not make clinical sense to merge any of the response categories. Are the group sizes for this response variable too unbalanced to assume a multinomial regression model? Feedback would be much appreciated w.r.t. to the appropriateness of implementing each of the above models given the group sizes for
> the response variables. Thank you in advance for your kind suggestions. Best wishes Margaret distribute or copy this message, and do not disclose its contents or take any action in reliance on the information it contains. Thank you.
> --0-1933704739-1150815970=:95781--

SR Millis

unread,

Jun 21, 2006, 6:08:50 PM6/21/06

to MedS...@googlegroups.com

Margaret,

I agree with you that your ordinal model has group sizes that are too small and unbalanced.

On a different note, the univariate "screening" of covariates before entering them into the multivariable logistic model (whether binary, ordinal, or multinomial) is to be avoided because of what Harrell calls the "phantom degrees of freedom" problem. That is, this univariate "look" actually spends degrees of freedom---which isn't typically accounted for---which then produces models that are overly optimistic and tend not to replicate on new samples. Harrell, in his book, "Regression modeling strategies," presents a systematic, sophisticated approach to model development that integrates new findings into the statistical properties of estimators.I highly recommend it. It helps you to unlearn many of the bad habits taught to you in grad school and/or post-doc fellowship.

Scott Millis

Margaret <Margaret....@ed.ac.uk> wrote:

Indeed, the unbalanced group sizes are a problem. To start, the smallest group size sets the limit for the number of covariates that you can include in model without over-fitting. So, for a group size of 21, you might be able to have 1 or 2 covariates in the model.

SR Millis

Margaret <Margaret....@ed.ac.uk> wrote:

distribute or copy this message, and do not disclose its contents or take any action in reliance on the information it contains. Thank you.
> --0-1933704739-1150815970=:95781--

Margaret

unread,

Jun 24, 2006, 6:47:26 AM6/24/06

to MedStats

Dear Scott

Thank you for your message. I agree re your point about the need for a
sophisticated screening procedure. In practice, I use a procedure which
I developed through correspondence with the author Hosmer (one of the
co-authors of a well-read book entitled 'Applied Logistic
Regression'). The reference you quote seems interesting. I hope that I
will find some time to check it out before too long.

Best wishes

Margaret

> --0-368273892-1150927730=:60965

> Content-Type: text/html; charset=iso-8859-1
> Content-Transfer-Encoding: 8bit

> X-Google-AttachSize: 8384
>
> <div>Margaret,</div> <div> </div> <div>I agree with you that your ordinal model has group sizes that are too small and unbalanced.</div> <div> </div> <div>On a different note, the univariate "screening" of covariates before entering them into the multivariable logistic model (whether binary, ordinal, or multinomial) is to be avoided because of what Harrell calls the "phantom degrees of freedom" problem. That is, this univariate "look" actually spends degrees of freedom---which isn't typically accounted for---which then produces models that are overly optimistic and tend not to replicate on new samples.  Harrell, in his book, "Regression modeling strategies," presents a systematic, sophisticated approach to model development that integrates new findings into the statistical properties of estimators.I highly recommend it. It helps you to unlearn many of the bad habits taught to you in grad school and/or post-doc fellowship.</div>
> <div> </div> <div>Scott Millis</div> <div> </div> <div> Margaret <Margaret....@ed.ac.uk> wrote:</div> <BLOCKQUOTE class=replbq style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #1010ff 2px solid"> SR Millis wrote: > Indeed, the unbalanced group sizes are a problem. To start, the smallest group size sets the limit for the number of covariates that you can include in model without over-fitting. So, for a group size of 21, you might be able to have 1 or 2 covariates in the model. > > SR Millis > Dear Scott Thank you for this advice. On the basis of the results of tests of sigificance which were performed at the univariate level, this limitation on covariates may be okay as far as the multinomial logistic regression analysis goes. However, it would appear from what you have said, that my ordinal logistic regression model would be a non-starter (as two of my ordinal
> categories apply to less than 10 people - 5 people and 2 people). Indeed, I am beginning to question whether the results at the univariate level are worthy of publication in a peer reviewed journal, given the groups sizes (19, 78, 99, 5 and 2) I listed for the ordinal response variable. Does anyone (including yourself) have any further thoughts on this? Best wishes Margaret > Margaret <MARGARET....@ED.AC.UK>wrote: > > Hello > > Earlier, I posted a query to the group re the use of a hybrid logistic > regression model. My current message relates to the same project but > raises a different issue (which I appreicate is not independent, > however, from the matter I raised separately). I just want to avoid my > messages becoming too long! > > I have an ordinal response variable with cohort sizes of 19, 78, 99, 5 > and 2 (in the order weak to severe for the
> condition represented). At > a talk I attended some time ago, it was pointed out that the first step > in carrying out an ordinal logistic regression analysis is to check > that the cohort sizes for the response variable are roughly equal and > if they are not, the next step is to merge categories to ensure that > they are. With the sort of categories I have, merging would make > little sense from a clinical point of view. I would welcome views > therefore on the appropriateness of assuming an ordinal logistic > regression model with such group sizes. > > > I also have a purely nominal response variable with cohort sizes of > 161, 21 and 21 and again, it would not make clinical sense to merge any > of the response categories. Are the group sizes for this response > variable too unbalanced to assume a multinomial regression model? > > Feedback would be much appreciated w.r.t. to
> the appropriateness of > implementing each of the above models given the group sizes for the > response variables. > > Thank you in advance for your kind suggestions. > > Best wishes > > Margaret > > > > > > > Scott R Millis, PhD, MEd, ABPP (CN & RP) > Professor & Director of Research > Department of Physical Medicine & Rehabilitation > Wayne State University School of Medicine > 261 Mack Blvd > Detroit, MI 48201 > Email: smi...@med.wayne.edu > Tel: 313-993-8085 > Fax: 313-745-9854 > > ********************************************************* > This electronic message may contain information that is confidential and/or legally privileged. It is intended only for the use of the individual(s) and entity named as recipients in the message. If you are not an intended recipient of this message, please notify the
> sender immediately and delete the material from any computer. Do not deliver, distribute or copy this message, and do not disclose its contents or take any action in reliance on the information it contains. Thank you. > --0-1933704739-1150815970=:95781 > Content-Type: text/html; charset=iso-8859-1 > Content-Transfer-Encoding: 8bit > X-Google-AttachSize: 2233 > > <DIV>Indeed, the unbalanced group sizes are a problem.  To start, the smallest group size sets the limit for the number of covariates that you can include in model without over-fitting. So, for a group size of 21, you might be able to have 1 or 2 covariates in the model.</DIV> <DIV> </DIV> <DIV>SR Millis Margaret <Margaret....@ed.ac.uk> wrote:</DIV> <BLOCKQUOTE class=replbq style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #1010ff 2px solid"> Hello Earlier, I posted a query to the group re the use of a hybrid
> logistic regression model. My current message relates to the same project but raises a different issue (which I appreicate is not independent, however, from the matter I raised separately). I just want to avoid my messages becoming too long! I have an ordinal response variable with cohort sizes of 19, 78, 99, 5 and 2 (in the order weak to severe for the condition represented). > At a talk I attended some time ago, it was pointed out that the first step in carrying out an ordinal logistic regression analysis is to check that the cohort sizes for the response variable are roughly equal and if they are not, the next step is to merge categories to ensure that they are. With the sort of categories I have, merging would make little sense from a clinical point of view. I would welcome views therefore on the appropriateness of assuming an ordinal logistic regression model with such group sizes. I also have a purely
> nominal response variable with cohort sizes of 161, 21 and 21 and again, it would not make clinical sense to merge any of the response categories. Are the group sizes for this response variable too unbalanced to assume a multinomial regression model? Feedback would be much appreciated w.r.t. to the appropriateness of implementing each of the above models given the group sizes for > the response variables. Thank you in advance for your kind suggestions. Best wishes Margaret distribute or copy this message, and do not disclose its contents or take any action in reliance on the information it contains. Thank you. > --0-1933704739-1150815970=:95781-- of Physical Medicine & Rehabilitation Wayne State University School of Medicine 261 Mack Blvd Detroit, MI 48201 Email: smi...@med.wayne.edu Tel: 313-993-8085 Fax: 313-745-9854 ********************************************************* This electronic message may contain information that is confidential and/or legally privileged. It is intended only for the use of the individual(s) and entity named as recipients in the message. If you are not an intended recipient of this message, please notify the sender immediately and delete the material from any computer. Do not deliver, distribute or copy this message, and do not disclose its contents or take any action in reliance on the information it contains. Thank you.
> --0-368273892-1150927730=:60965--

Sripal Kumar

unread,

Jun 24, 2006, 9:25:43 AM6/24/06

to MedS...@googlegroups.com

This seems to be a good discussion on modeling. Scott suggests that "screening" of covariates before entering them into the multivariable logistic model is to be avoided. What I have been doing is to include univariate predictors in the model as well as variables which are of clinical signfiicance and which have been shown in other studies to be predictor ("forcing" them into the model regardless of univariate significance). Is this a good strategy or are there other ways of doing this?

thanks,

Sripal

> <div> </div> <div>Scott Millis</div> <div> </div> <div> Margaret <Margaret.MacDougall@ ed.ac.uk> wrote:</div> <BLOCKQUOTE class=replbq style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #1010ff 2px solid"> SR Millis wrote: > Indeed, the unbalanced group sizes are a problem. To start, the smallest group size sets the limit for the number of covariates that you can include in model without over-fitting. So, for a group size of 21, you might be able to have 1 or 2 covariates in the model. > > SR Millis > Dear Scott Thank you for this advice. On the basis of the results of tests of sigificance which were performed at the univariate level, this limitation on covariates may be okay as far as the multinomial logistic regression analysis goes. However, it would appear from what you have said, that my ordinal logistic regression model would be a non-starter (as two of my ordinal

> categories apply to less than 10 people - 5 people and 2 people). Indeed, I am beginning to question whether the results at the univariate level are worthy of publication in a peer reviewed journal, given the groups sizes (19, 78, 99, 5 and 2) I listed for the ordinal response variable. Does anyone (including yourself) have any further thoughts on this? Best wishes Margaret > Margaret < MARGARET....@ED.AC.UK>wrote: > > Hello > > Earlier, I posted a query to the group re the use of a hybrid logistic > regression model. My current message relates to the same project but > raises a different issue (which I appreicate is not independent, > however, from the matter I raised separately). I just want to avoid my > messages becoming too long! > > I have an ordinal response variable with cohort sizes of 19, 78, 99, 5 > and 2 (in the order weak to severe for the
> condition represented). At > a talk I attended some time ago, it was pointed out that the first step > in carrying out an ordinal logistic regression analysis is to check > that the cohort sizes for the response variable are roughly equal and > if they are not, the next step is to merge categories to ensure that > they are. With the sort of categories I have, merging would make > little sense from a clinical point of view. I would welcome views > therefore on the appropriateness of assuming an ordinal logistic > regression model with such group sizes. > > > I also have a purely nominal response variable with cohort sizes of > 161, 21 and 21 and again, it would not make clinical sense to merge any > of the response categories. Are the group sizes for this response > variable too unbalanced to assume a multinomial regression model? > > Feedback would be much appreciated w.r.t. to
> the appropriateness of > implementing each of the above models given the group sizes for the > response variables. > > Thank you in advance for your kind suggestions. > > Best wishes > > Margaret > > > > > > > Scott R Millis, PhD, MEd, ABPP (CN & RP) > Professor & Director of Research > Department of Physical Medicine & Rehabilitation > Wayne State University School of Medicine > 261 Mack Blvd > Detroit, MI 48201 > Email: smi...@med.wayne.edu > Tel: 313-993-8085 > Fax: 313-745-9854 > > ********************************************************* > This electronic message may contain information that is confidential and/or legally privileged. It is intended only for the use of the individual(s) and entity named as recipients in the message. If you are not an intended recipient of this message, please notify the

> sender immediately and delete the material from any computer. Do not deliver, distribute or copy this message, and do not disclose its contents or take any action in reliance on the information it contains. Thank you. > --0-1933704739-1150815970=:95781 > Content-Type: text/html; charset=iso-8859-1 > Content-Transfer-Encoding: 8bit > X-Google-AttachSize: 2233 > > <DIV>Indeed, the unbalanced group sizes are a problem.  To start, the smallest group size sets the limit for the number of covariates that you can include in model without over-fitting. So, for a group size of 21, you might be able to have 1 or 2 covariates in the model.</DIV> <DIV> </DIV> <DIV>SR Millis Margaret <Margaret.MacDougall@ ed.ac.uk> wrote:</DIV> <BLOCKQUOTE class=replbq style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #1010ff 2px solid"> Hello Earlier, I posted a query to the group re the use of a hybrid

SR Millis

unread,

Jun 26, 2006, 11:39:22 AM6/26/06

to MedS...@googlegroups.com

With all due respect to Hosmer & Lemeshow, their proposed "2-step" procedure for variable selection has a number of difficulties. Alternatively, Harrell (2001) has provided detailed strategies for variable selection:

1. Research hypotheses, theory, and past research findings should guide variable selection.

2. Select variables that have wide score distributions. Variables having narrow ranges will have limited variance and an attenuated capacity to detect differences or detect associations.

3. Consider eliminating variables that have or will likely have high levels of missing data. However, if those variables are important to the model, consider using FIML or MI strategies when the proportion of missing data exceeds 0.15.

4. If your sample is large enough, consider using variable reduction strategies such as principal component analysis---and use the PCs in your model lieu of individual variables.

4. Linearity between predictors and dependent variable should not be reflexively assumed. In the past, it was common to use logarithmic, square root, reciprocal, or other power transformations on predictors or dependent variable. Although these transformations may still be useful, Harrell (2001) provides an innovative approach emphasizing the strategic use of reduced cubic splines to model the complexity and degree of nonlinearity.

5. Fit the model on the entire data set.

6. Leave statistically nonsignificant predictor variables in the model. Taking out the nonsignificant predictors and then re-fitting the models with only the significant predictors produces a biased model. Harrell (2001) noted that, “Leaving insignificant predictors in the model increases the likelihood that the confidence interval for the effect of interest has the stated coverage” (p. 82).

Yet another approach to variable selection is Bayesian model averaging: Viallefont et al. (2001) Statist Med, 20: 3215-3230,

SR Millis

Sripal Kumar <sripa...@gmail.com> wrote:

This seems to be a good discussion on modeling. Scott suggests that "screening" of covariates before entering them into the multivariable logistic model is to be avoided. What I have been doing is to include univariate predictors in the model as well as variables which are of clinical signfiicance and which have been shown in other studies to be predictor ("forcing" them into the model regardless of univariate significance). Is this a good strategy or are there other ways of doing this?

thanks,

Sripal

Ted Harding

unread,

Jun 26, 2006, 4:46:41 PM6/26/06

to MedS...@googlegroups.com

On 26-Jun-06 SR Millis wrote:
> With all due respect to Hosmer & Lemeshow, their proposed
> "2-step" procedure for variable selection has a number of
> difficulties. Alternatively, Harrell (2001) has provided
> detailed strategies for variable selection:
>
> 1. Research hypotheses, theory, and past research findings
> should guide variable selection.

That should always be the primary consideration!

> 2. Select variables that have wide score distributions.
> Variables having narrow ranges will have limited variance
> and an attenuated capacity to detect differences or detect
> associations.

That's true enough, but should be glossed in that it's the
precision (or SE) of the estimated coefficient which is improved
by wide scatter, and (relatedly) that it's the variables with the
widest scatter (other things, i.e. coefficients, being equal)
that have the best predictive power. There's a classic instance
of this: In school test scores on different subjects, to the
extent that score on any subject correlates with general ability,
it's usually the score in mathematics that is the best predictor
of overall attainment. This is because the range of maths scores
is usually wider than in other subjects (since, amongst other
things, it's very possible to score 0% or 100% in maths, either
by getting everything srong or getting everything right; while
examiners rarely have occasion to give 0% or 100% in a language,
or in history, or whatever).

> 3. Consider eliminating variables that have or will likely have high
> levels of missing data. However, if those variables are important to
> the model, consider using FIML or MI strategies when the proportion of
> missing data exceeds 0.15.

This depends on the extent to which missing values in these variables
are reliably predictable from the non-missing values of other variables,
which in turn depends on how well they are correlated with the
set of other variables.

> 4. If your sample is large enough, consider using variable
> reduction strategies such as principal component analysis---and
> use the PCs in your model lieu of individual variables.

This (in so far as you adopt the PCs which account for the most
variance) is (2) in another form. But note that PCs, used as
variables, can often be difficult to interpret.

(No comment on the remaining points).

Best wishes to all,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 26-Jun-06 Time: 21:46:35
------------------------------ XFMail ------------------------------

John Whittington

unread,

Jun 27, 2006, 12:13:45 PM6/27/06

to MedS...@googlegroups.com

At 21:46 26/06/06 +0100, Ted Harding wrote (in small part):

>This depends on the extent to which missing values in these variables
>are reliably predictable from the non-missing values of other variables,
>which in turn depends on how well they are correlated with the
>set of other variables.

Going off at a considerable tangent, this (clearly very true) statement
underlies my deep 'suspicion' about all methods of imputation of missing
values. Such imputation clearly cannot 'add information', and if one has a
situation in which the values of a particular variable are, as Ted puts it,
"reliably predictable" from other variables, I can't help but wonder why
one is using that (seemingly redundant) variable in the first place!

Except as a means of making computation more straightforward, I therefore
am not sure I really understand what is actually achieved by imputation of
missing values in situations such as Ted was eluding to - which probably
means that I am missing something of fundamental importance!

Kind Regards,

John

----------------------------------------------------------------
Dr John Whittington, Voice: +44 (0) 1296 730225
Mediscience Services Fax: +44 (0) 1296 738893
Twyford Manor, Twyford, E-mail: Joh...@mediscience.co.uk
Buckingham MK18 4EL, UK medis...@compuserve.com
----------------------------------------------------------------

Doug Altman

unread,

Jun 27, 2006, 12:31:30 PM6/27/06

to MedS...@googlegroups.com

Individuals with missing observations will also have non-missing observations. When you impute a missing value (or >1) for an individual all their non-missing data are reinstated in the data set (compared to a complete case analysis). In one study we did*, on average 3 real values were retrieved for each imputed value. The amount of retrieved data depends of course on the pattern of missing observations, but there will always be a gain. So in some sense information is indeed added when imputation is done. In the study I mentioned only 44% of cases had complete data, so we more than doubled the sample size by imputing missing values. This study helped me to overcome my initial scepticism about imputation and recognise the hidden assumptions of and problems associated with complete case analysis.

It will be rare that one can impute a value without quite a lot of uncertainty - what we want to do is to impute it in an unbiased way (which may be what Ted meant by 'reliable'). Because of the above-mentioned gain of real values too, the precision of regression coefficients may shrink when one imputes.

Doug

* Clark TG, Altman DG. Developing a prognostic model in the presence of missing data: an ovarian cancer
case-study. Journal of Clinical Epidemiology 2003;56:28–37.

_____________________________________________________

Doug Altman
Professor of Statistics in Medicine
Centre for Statistics in Medicine
Wolfson College Annexe
Linton Road
Oxford OX2 6UD

email:  doug....@cancer.org.uk
Tel:    01865 284400 (direct line 01865 284401)
Fax:    01865 284424

Web:     http://www.csm-oxford.org.uk/

SR Millis

unread,

Jun 27, 2006, 12:55:17 PM6/27/06

to MedS...@googlegroups.com

Another alternative to multiple imputation for dealing with missing data in linear models is the use of full information maximum likelihood (FIML)--also known as direct ML. FIML can easily be performed with "structural equation modeling" software like AMOS and Mplus.

SR Millis

Doug Altman <doug....@cancer.org.uk> wrote:

Individuals with missing observations will also have non-missing observations. When you impute a missing value (or >1) for an individual all their non-missing data are reinstated in the data set (compared to a complete case analysis). In one study we did*, on average 3 real values were retrieved for each imputed value. The amount of retrieved data depends of course on the pattern of missing observations, but there will always be a gain. So in some sense information is indeed added when imputation is done. In the study I mentioned only 44% of cases had complete data, so we more than doubled the sample size by imputing missing values. This study helped me to overcome my initial scepticism about imputation and recognise the hidden assumptions of and problems associated with complete case analysis.

It will be rare that one can impute a value without quite a lot of uncertainty - what we want to do is to impute it in an unbiased way (which may be what Ted meant by 'reliable'). Because of the above-mentioned gain of real values too, the precision of regression coefficients may shrink when one imputes.

Doug

* Clark TG, Altman DG. Developing a prognostic model in the presence of missing data: an ovarian cancer

case-study. Journal of Clinical Epidemiology 2003;56:28â€“37.

Ted Harding

unread,

Jun 27, 2006, 1:37:22 PM6/27/06

to MedS...@googlegroups.com

On 27-Jun-06 John Whittington wrote:
>
> At 21:46 26/06/06 +0100, Ted Harding wrote (in small part):
>
>>This depends on the extent to which missing values in these
>>variables are reliably predictable from the non-missing
>>values of other variables, which in turn depends on how well
>>they are correlated with the set of other variables.
>
> Going off at a considerable tangent, this (clearly very true)
> statement underlies my deep 'suspicion' about all methods of
> imputation of missing values. Such imputation clearly cannot
> 'add information',

Of course not! The main (and possibly only) point of imputation
is to put the information that is present in the data, into the
place where the reader/user expects to see it.

> and if one has a situation in which the values
> of a particular variable are, as Ted puts it, "reliably
> predictable" from other variables, I can't help but wonder
> why one is using that (seemingly redundant) variable in the
> first place!

As one example, suppose the variable one is interested in
is a time series. If the series were complete, one would
use standard methods of times-series analysis to identify
time-seriesy things like trends, seasonal effects, periodic
structure (spectrum) and so on. But is has missing values,
which is a bit of a spanner in the works for such methods.
Fortunately, the series comes with a nice set of covariates
from which the missing values can be predicted with reasonable
precision. So they can be used to "fill in" the missing values
(preferably several times over so that the uncertainties in
the imputation are represented), and the standard methods of
time series abalysis can then be used.

> Except as a means of making computation more straightforward,

Well, that's one way of describing it. Provided one has a good
probabilistic model for the data, the statistical model for
the data with missing values is then prefectly definite and
it can be treated by maximum likelihood methods. As far as the
parametric inference is concerned, this extracts the full
avilable information from the data. But in general it is a
very nasty procedure to implement and work with in such cases.
The advantage of the imputation approach is that, provided
the imputation method is "good" (a highly technical issue),
the result of analysing (by the appropriate complete-data
analysis) a sufficiently many times imputed dataset is
equivalent to the maximum likelihood approach.

> I therefore am not sure I really understand what is actually
> achieved by imputation of missing values in situations such
> as Ted was eluding to - which probably means that I am missing
> something of fundamental importance!

Not really. You can always do it the hard way if you wish!

The one thing to really watch out for with imputed data is
that the unwary may be presented with a data series with
missing values imputed, without the information about the
uncertainty inherent in the imputed values. So the unwary
might compute a mean and SD for the series, and treat this
as a "sample size of N", and so on. It's essential to give
the full story about the variation from imputation to imputation.

You will have gathered that I'm in favour of multiple imputation
methods if imputation is to be used at all. There are several
"single imputation" methods around, most of them much used
(often under inducement from software which implements them),
which do not give you any handle on the uncertainty inherent
in the process.

Yours, elusively,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861

Date: 27-Jun-06 Time: 18:37:18
------------------------------ XFMail ------------------------------

Ted Harding

unread,

Jun 27, 2006, 2:03:06 PM6/27/06

to MedS...@googlegroups.com

On 27-Jun-06 Doug Altman wrote:
> Individuals with missing observations will also
> have non-missing observations. When you impute a
> missing value (or >1) for an individual all their
> non-missing data are reinstated in the data set
> (compared to a complete case analysis). In one
> study we did*, on average 3 real values were
> retrieved for each imputed value. The amount of
> retrieved data depends of course on the pattern
> of missing observations, but there will always be
> a gain. So in some sense information is indeed
> added when imputation is done.

Well, it depends on what you mean by "added". As I wrote
5 minutes ago, what it really does (and what Doug no
doubt really means by "reinstated") is that the information
inherent in the non-missing data about the missing values
is re-expressed in terms of imputed values placed where
the user expects to see them. The actual missing values
(i.e. the values you would have observed if only you had
observed them) are of course not re-instated.

> In the study I
> mentioned only 44% of cases had complete data, so
> we more than doubled the sample size by imputing
> missing values. This study helped me to overcome
> my initial scepticism about imputation and
> recognise the hidden assumptions of and problems
> associated with complete case analysis.

There is an implicit comparison here with one of the
ways of dealing with data where some cases have missing
values, namely the "Complete Cases only" approach: you
simply ignore all cases with missing data. This of course
does lose information, namely all the info in the cases
you ignore. Imputation is one way of putting this information
back, in that it gives a means of handling the cases with
missing data. The fact that its output looks like data
with no missing values is misleading -- but only superficially
so: provided you are ware of what has been done, and view
the imputed data in the right way, you should not be misled.

> It will be rare that one can impute a value
> without quite a lot of uncertainty - what we want
> to do is to impute it in an unbiased way (which
> may be what Ted meant by 'reliable'). Because of
> the above-mentioned gain of real values too, the
> precision of regression coefficients may shrink
> when one imputes.

This is because one is taking account of more cases, and
enjoying the information they bring.

My use of "reliable" was perhaps not well chosen. Perhaps
a combination of "realistic" and "with reasonable precision"
would have been better. Lack of bias is desirable, of course,
but one can cope with some degree of bias; and there are
boot-strap type techniques for bias reduction if bias is
a worry.

As to "realistic", this refers to a multitude of phenomena.
You don't want to find negative imputed blood-pressure.
You don't want to find imputations which range well
beyond reasinable values, compared with the values in
the observed cases. Often variables have the feature that,
if positive, they have a reasonable approximation to some
continous distribution but they may also be categorically
zero (examples might include time spent sitting watching TV
in a day, quantity consumed of junk munchies in plastic
bags in a day, amount of lager drunk in a day, ... ).
You need a method of imputation which is capable of predicting
"zero consumption" for such variables (i.e. of mixed discrete/
continuous type, called "semicontinuous" by Shafer et al.),
otherwise you imputations will not display this essential
feature.

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861

Date: 27-Jun-06 Time: 19:03:01
------------------------------ XFMail ------------------------------

Werner Vach

unread,

Jun 28, 2006, 4:45:58 AM6/28/06

to MedS...@googlegroups.com

I feel the point needs a slightly more differentiated point of view.

1) If we are talking about the handling of incomplete covariate data in a
clinical setting, the question of whether to perform a complete case analysis
or to use a more sophistacted technique (imputation, ML etc.) allowing to make use
of the subjects with partially incomplete data, it is a question of a bias -variance trade
off. If we use such techniques, we can expect to decrease the standard error of the
regression coefficient, as we use more information, but we may also introduce bias.
The reason for this is, that all sophisticated techniques require (at least in their
standard version) the Missing At Random (MAR) assumption to ensure unbiased
(i.e. consistent) estimates. This assumption excludes that the probability of a missing
value depends on the true value. A complete case analysis does not require this assumption.
Here it is sufficient to ensure, that the probability of a missing value depends only
on the covariate values, but not on the outcome, which is in prospectively collected data
often a highly reasonable assumption. (Note that in epidemiological studies the use of
sophisticated methods to handle incomplete covariate data is typically motivated by the
fact, that one is afraid of that a complete case analysis results in biased data, as the
occurrence of missing values may depend on the outcome.)

So it may happen, that the use of more sophisticated techniques introduces a bias. The typically
situation is, that subjects with high missing values in X tend to have more missing values. Then
if X is associated with Y, the subjects with a missing value tend to have higher values of Y than
those without. So if we now impute values for the missing values which try to mirror the observed
distribution of X, we miss that there are many subjects with high values of X among the subjects
with a missing value, and consequently we start to underestimate the effect of X by imputation (or
other sophisticated methods).

So whenever one starts to think about using imputation methods, one should think about whether
the MAR assumption is satisfied or not. If one has good reasons to have doubts about this, a complete
case analysis may be the cleanest way to analyse the data.

And in any case it is a dangerous argument just to say, that imputation reduces the standard errors.

2) In my experience, many people tend to overstimate the possible gain by using imputation (or other
sophisticated methods.)
The first central point is, that if we are interested in estimating the effect
of a certain covariate X, and if this covariate is affected by missing values, then we can hardly overcome
the loss due to missing values in this covariate. So if we have a missing rate of 25% in X, and
overall a complete case analysis can only use 50% of the data, then the expected gain is not from 50% of the
sample size to 100%, but only to 75% (ignoring, that we have of course still to pay some additional
price for the imputation.) So thinking about imputation is most useful, if one has a covariate, which is
nearly completly observed, and which one is interested in, but if one has other covariates, which
have high missing rates. But then one has to think about, whether it is not a better idea to remove theses
variables from the model ... If you do not like to do this, because you expect a substantial difference between
adjusted and unadjusted effects, then you must be aware of that also the possible gain of using imputation
reduces, as if you cannot improve the estimation of the effect of the confounders, you cannot improve
the adjusted effect estimates ...
The second point is, that people tend to forget, that standard errors decrease only with the square root
of the sampel size ...

Werner

John Whittington

unread,

Jun 28, 2006, 5:02:58 AM6/28/06

to MedS...@googlegroups.com

At 17:31 27/06/06 +0100, Doug Altman wrote (in small part):

>Individuals with missing observations will also have non-missing
>observations. When you impute a missing value (or >1) for an individual

>all their non-missing data are reinstated in the data set ...[snip]...

>This study helped me to overcome my initial scepticism about imputation
>and recognise the hidden assumptions of and problems associated with
>complete case analysis.

Doug, what you say makes complete sense to me, but only in the context of
an assumption that the only alternative to imputation is to discard all the
non-missing data from subjects who have some missing data ('complete case
analysis'). There surely are many situations in which that is not
necessary. Whilst a 'complete set of data' (real and/or imputed) is
probably always needed for analyses based on classical ANOVAR, this is
surely not necessarily the case with other techniques, such as
regression-based ones.

In situations in which there are alternatives to imputation which do not
involve the wastage of non-missing data, I think I'm back to the comments I
made before - that I don't see that imputation 'adds any information', and
am not really sure of what it does achieve. In other words, I suppose that
means that I still have some of the sceptism which you say you once had.

Ted Harding

unread,

Jun 28, 2006, 6:11:19 AM6/28/06

to MedS...@googlegroups.com

On 28-Jun-06 John Whittington wrote:
> [...]

> In situations in which there are alternatives to imputation
> which do not involve the wastage of non-missing data, I think
> I'm back to the comments I made before - that I don't see that
> imputation 'adds any information', and am not really sure of
> what it does achieve. In other words, I suppose that means
> that I still have some of the sceptism which you say you once
> had.

Hi John,
As I tried to state earlier -- imputation does not (and can not)
add information, but it does re-express information about missing
values in a directly accessible form. As an example, I have prepared
the following toy case (possibly too simple to be persuasive):

Variable Y is linearly related to variable X, with error. Here
are the data (pretending to be observations on cases sampled
from a population):

X Y
0.11 0.06
1.37 1.47
1.47 1.61
2.49 2.49
2.53 NA
2.57 2.60
4.93 5.30
8.04 8.11
8.13 7.93
8.88 8.74
8.97 9.27

There is clearly information in this dataset about the missing
value of Y. You can if you wish simply regress Y on X for the
complete cases, and use the result to obtain a predictive
distribution for the missing value of Y, and hence (for instance)
display a "95% prediction interval" for it. Which is OK if the
only thing you're interested in is that particular missing
value (i.e. the value for the case which had X = 2.53).

But suppose you want to estimate the proportion of cases in
the population which have Y > 2.50 (shades of the "dichotomy"
discussion we're having at the moment ... )? Or what are the
mean and SD of the values of Y in the population? Answers to
such questions do not readily emerge from the predictive
distribution combined with the observed values.

But now let's do a multiple imputation (10-fold). Here are
the results.
X Y
0.11 0.06
1.37 1.47
1.47 1.61
2.49 2.49
2.53 2.37 2.98 2.46 2.71 2.45 2.68 2.49 2.74 2.38 2.23
2.57 2.60
4.93 5.30
8.04 8.11
8.13 7.93
8.88 8.74
8.97 9.27

This set of results displays the information about the missing
value of Y in the form of a bunch of values distributed according
to the Bayesian posterior distribution for the missing value.

The above (for example) gives numbers > 2.5, out of 11, in each
imputation as:

6, 7, 6, 7, 6, 7, 6, 7, 6, 6

Mean and SD for each imputation:

Mean SD
4.54 3.40
4.60 3.37
4.55 3.39
4.57 3.38
4.55 3.39
4.57 3.38
4.55 3.39
4.57 3.38
4.54 3.40
4.53 3.41

From these varying results, one for each imputation, there
are further procedures to evaluate a single estimate for the
quantity on question, together with an estimate of its SE
that takes account of the uncertainty due to the missing
value as expressed in the varying values imputed to the
missing value, in addition to the uncertainty inherent in
the original sampling.

No information is added -- but, as you can see from the above,
imputation gives a different way of looking at the information
that is there, in a form that is directly accessible for
further analysis. And this (at least) is what imputation does
achieve!

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861

Date: 28-Jun-06 Time: 11:11:16
------------------------------ XFMail ------------------------------

John Whittington

unread,

Jun 28, 2006, 7:52:58 AM6/28/06

to MedS...@googlegroups.com

At 18:37 27/06/06 +0100, Ted Harding wrote (in part):

>On 27-Jun-06 John Whittington wrote:
>
> > Going off at a considerable tangent, this (clearly very true)
> > statement underlies my deep 'suspicion' about all methods of
> > imputation of missing values. Such imputation clearly cannot
> > 'add information',
>
>Of course not! The main (and possibly only) point of imputation
>is to put the information that is present in the data, into the
>place where the reader/user expects to see it.

Indeed, although, as you are aware, Doug Altman has produced an argument
that the amount of information can be increased by imputation - but only,
as far as I can see, if one regards the only alternative to imputation as
being the discarding of (some or all) non-0missing data from subjects who
have some missing data.

>As one example, suppose the variable one is interested in
>is a time series. If the series were complete, one would
>use standard methods of times-series analysis to identify
>time-seriesy things like trends, seasonal effects, periodic
>structure (spectrum) and so on. But is has missing values,
>which is a bit of a spanner in the works for such methods.
>Fortunately, the series comes with a nice set of covariates
>from which the missing values can be predicted with reasonable
>precision. So they can be used to "fill in" the missing values
>(preferably several times over so that the uncertainties in
>the imputation are represented), and the standard methods of
>time series abalysis can then be used.

Sure, but as yoiu go on to partially concede that really corresponds with

my view when I wrote:

> > Except as a means of making computation more straightforward,

Once upon a time, when analytical techniques were mainly based on
'classical' approaches (e.g. ANOVAR) and computational facilities were very
limited, there was obviously a strong argument for trying to 'make data
sets complete' in order to make analysis even possible. However, advances
in approaches as well as computing resources now very much weaken that
argument.

>The one thing to really watch out for with imputed data is
>that the unwary may be presented with a data series with
>missing values imputed, without the information about the
>uncertainty inherent in the imputed values. So the unwary
>might compute a mean and SD for the series, and treat this
>as a "sample size of N", and so on. It's essential to give
>the full story about the variation from imputation to imputation.

Indeed. However, you are talking mainly about the
variability/uncertainty. What about the actual estimates? I think it
needs a mathematically more clever person than myself to work this out, but
what can one say about the mean of a set of data including imputed values
in comparison with the mean of the non-missing values in that data
set? Can one demonstrate that it is likely to be closer to the population
mean - and, indeed, quantify (presumably probabilistically) the extent of
that 'improvement'? ... all of that, of course, assuming data is missing
for genuinely 'random' reasons.

Swank, Paul R

unread,

Jun 28, 2006, 11:01:17 AM6/28/06

to MedS...@googlegroups.com

Complete case analysis or listwise deletion of missing values requires MCAR (missing completely at random) to avoid bias, at least as far as I know. Someone told me that was proved in Little and Rubin's book.

Paul R. Swank, Ph.D.
Professor, Developmental Pediatrics
Director of Research, Center for Improving the Readiness of Children for Learning and Education (C.I.R.C.L.E.)
Medical School
UT Health Science Center at Houston

From: MedS...@googlegroups.com [mailto:MedS...@googlegroups.com] On Behalf Of Werner Vach
Sent: Wednesday, June 28, 2006 3:46 AM
To: MedS...@googlegroups.com
Subject: {MEDSTATS} Re: {MEDSTATS:2247} Re: Logistic regression with very unbalanced

Going off at a considerable tangent, this (clearly very true) statement
underlies my deep 'suspicion' about all methods of imputation of missing

values. Such imputation clearly cannot 'add information', and if one has a
situation in which the values of a particular variable are, as Ted puts it,
"reliably predictable" from other variables, I can't help but wonder why
one is using that (seemingly redundant) variable in the first place!

Except as a means of making computation more straightforward, I therefore
am not sure I really understand what is actually achieved by imputation of
missing values in situations such as Ted was eluding to - which probably
means that I am missing something of fundamental importance!

Kind Regards,

John

----------------------------------------------------------------
Dr John Whittington,       Voice:    +44 (0) 1296 730225
Mediscience Services       Fax:      +44 (0) 1296 738893
Twyford Manor, Twyford,    E-mail:   Joh...@mediscience.co.uk
Buckingham MK18 4EL, UK             medis...@compuserve.com
----------------------------------------------------------------

John Whittington

unread,

Jun 28, 2006, 11:14:06 AM6/28/06

to MedS...@googlegroups.com

At 11:11 28/06/06 +0100, Ted Harding wrote (in part):

>But now let's do a multiple imputation (10-fold). Here are
>the results.
> X Y
> 0.11 0.06
> 1.37 1.47
> 1.47 1.61
> 2.49 2.49
> 2.53 2.37 2.98 2.46 2.71 2.45 2.68 2.49 2.74 2.38 2.23
> 2.57 2.60
> 4.93 5.30
> 8.04 8.11
> 8.13 7.93
> 8.88 8.74
> 8.97 9.27
>
>This set of results displays the information about the missing
>value of Y in the form of a bunch of values distributed according
>to the Bayesian posterior distribution for the missing value.
>
>The above (for example) gives numbers > 2.5, out of 11, in each
>imputation as:
>
> 6, 7, 6, 7, 6, 7, 6, 7, 6, 6

Very true, but, at least in your example, the 'gain' from doing that would
seem to be rather limited. If (since your stated interest is in 'counts')
one rounds the mean of your answers to the nearest whole number, then one
would get the same answer (6) as one would get with no imputation (i.e.
with N=10).

However, I do take your general point and can well believe that, in larger
and more complex situations, use of imputation in that sort of way might
result in an appreciable 'gain' in quality of information.

Ted Harding

unread,

Jun 28, 2006, 12:10:14 PM6/28/06

to MedS...@googlegroups.com

On 28-Jun-06 John Whittington wrote:
>

That of course is a numerical coincidence! I did say that the
example was "possibly too simple to be persuasive"! If you want
a more realistic case -- well, I recommend close all hatches,
submerge to periscope depth, deflect hydroplanes downwards,
and crash dive; and then I'll send some down!

The illustrative point was to show that, whatever feature you
would want to extract from a complete dataset, you can extract
by the same means from each imputation; and then the variation
in the result from imputation to imputation is a measure of
the uncertainty inherent in the information and can be combined
with the uncertainty (typically an SE) derived by standard
analysis [ sqrt(p*(1-p)/n) for a proportion ] to obtain an
SE for the combined estimate of the value; and this (in a
"regular" or as the pundits say "proper" problem) will be
close to the SE you would get if you did it ny bare-hands
maximum likelihood.

> However, I do take your general point and can well believe that,
> in larger and more complex situations, use of imputation in that
> sort of way might result in an appreciable 'gain' in quality of
> information.

The type of application in which this is most apparent, which is
also the one most used by MI pundits to promote the method, is in
analysing survey data (in which, also, often data will be recorded
on a large number of variables for each case, so there is a lot
of "concomitant" information sitting there).

The idea here is that the survey data are gathered by a single
agency and compiled into a database. This database will have
many users, to whom the agency grants access. If there were no
missing values, then each user could pump the database through
their own favourite software (SAS, SPSS, ... ) and get estimates
with SEs, crosstabs, etc. for whatever question they wanted to ask.

But when there are missing data (and in many cases the proportion
missing on some variables can be quite high, say 30-40%), there
is a dilemma.

The agency can say to the user "There are missing data here,
so you will need to use ML methods to obtain the full information
from the database." But the user is not up to this -- so would
perhaps have to call in consultants, or persuade a Professor to
set it up as a PhD project, or ...

Or the user can say to the agency "I want to extract the following
information from the database. Please use your expertise to provide
this." But the agency does not have the resources to cope with
this demand from the many users of the database.

Therefore the agency, the database constructor, takes on the role
of imputer, and creates multiple instances of the database completed
by imputation. These are then used by the user, the data analyst,
who analyses them by whatever standard methods they would apply to
any normal complete dataset. There are then relatively straightforward
rules for combining their multiple results.

This of course has the difficulty that imputation requires a
model for the data, which the imputer must choose. This may not
be the same as the model that the user (the anaylst) might
have in mind.

(And that is not the only worm crawling in this can).

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861

Date: 28-Jun-06 Time: 17:10:09
------------------------------ XFMail ------------------------------

Ted Harding

unread,

Jun 28, 2006, 12:15:05 PM6/28/06

to MedS...@googlegroups.com

On 28-Jun-06 Swank, Paul R wrote:
> Complete case analysis or listwise deletion of missing values requires
> MCAR (missing completely at random) to avoid bias, at least as far as I
> know. Someone told me that was proved in Little and Rubin's book.

Essentially yes. Strictly speaking, the only necessary condition
is that the process of deletion (which of course is determined
by the missingness) is such that the expcted value of the
responses which would be deleted is the same as the expected
value of those which would not be deleted.

In practice, this boils down to deleting at random.

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861

Date: 28-Jun-06 Time: 17:15:03
------------------------------ XFMail ------------------------------

Werner Vach

unread,

Jun 29, 2006, 4:09:28 AM6/29/06

to MedS...@googlegroups.com

It is a widespread misunderstanding, that a complete case analysis requires in any case the MCAR assumption.
MCAR is always a sufficient condition to justify a complete case analysis, but it is not necessarily a
necessary one. One main exception is multiple regression. Here we are interested in analysing the
conditional distribution of Y given X, and a complete case analysis means just, that we add
an additional condition, namely "no missing values". If we denote this event by "R=1" (
response to all variables), we analyse in a complete case analysis the conditional distribution
of Y given X and R=1. So we can expect from a complete case analysis unbiased results, if
the distributions Y | X, R=1 and Y | X are identical. One simple condition to ensure this is
that P(R=1| Y, X) does not depend on Y. And this condition is often justifiable in
prospective studies, and it is quite different from the MAR assumption, as it allows
the probability of a missing value in a certain component of X to depend on the
true value.

It is somewhat surprising, that this important fact about a complete case analysis
in the case of incomplete covariate data is now widely known. It is just a consequence of
why regression analysis is so popular in medical research, as one can use regression models
even if one has a sample which is far away from being representative for the population
of interest with respect to X.

Reply all

Reply to author

Forward