Logistic Regression - Extrapolation Issue

273 views
Skip to first unread message

Ryan

unread,
Jul 24, 2010, 4:42:15 PM7/24/10
to MedStats
Hi,

Suppose we have the following binary logistic regression equation:

logit(y) = bo
+ b1*group
+ b2*x
+ b3*group*x

where group is a binary variable (coded 0/1) and x is a continuous
variable. Further suppose the range of x values is very different
between groups. Let's say x values range from 2.4 to 25.2 for group=0
and 22.3 to 79.6 for group=1. When comparing groups in such a design,
one usually sets x to several values along the range of x to get a
sense of how the odds ratios change. For example, one might consider
setting x to the 25th, 50th, and 75th percentiles of x. I wonder,
however, to avoid extrapolation, if one should only set x to values
within the shared range. Using the example ranges I just provided, the
shared range would be 22.3 to 25.2. This could be problematic since
the values within the shared range are not terribly representative of
either group.

I've encountered this type of situation a few times in my work, and
I'm wondering if anyone else has as well. If yes, how did you deal
with it? General thoughts and references would be appreciated. I can
provide a specific example if that will help.

Thanks,

Ryan

Ray Koopman

unread,
Jul 25, 2010, 11:31:31 AM7/25/10
to MedStats
b1 + b3*x +- 1.96 sqrt[var(b1) + 2x*cov(b1,b3) + x^2*var(b3)] is an
approximate 95% CI for the log of the odds ratio. When the x-ranges
overlap as little as those you mention, I would expect the CI to be
fairly wide. However, even if the sample size is large enough to make
the CI narrow, there is still the question of whether the model is
close enough to the true regression to permit such extrapolations to
be taken seriously.

Ryan

unread,
Jul 25, 2010, 12:39:15 PM7/25/10
to MedStats
Ray,

Thank you for responding. Here's a simple example:

y = car accident (yes/no)
group = gender (male/female)
x = time driving on highway<-exposure type covariate

What if time on the highway for females ranges from 2.4 to 25.2 hours
and the range for males is 22.3 to 79.6 hours? If I start with the
full model:

logit(y) = bo
+ b1*group
+ b2*x
+ b3*group*x

and end up with a significant b3, I am now faced with the problem of
figuring out what to set time to when comparing males and females.
What would you suggest I do?

Ryan
> be taken seriously.- Hide quoted text -
>
> - Show quoted text -

BXC (Bendix Carstensen)

unread,
Jul 25, 2010, 1:46:17 PM7/25/10
to meds...@googlegroups.com
The model fitted is simply a logistic regression on x separately in each group. You would get the same fit if you fitted the model separately to each of the two groups separately.

So if the range of x is different in the two groups, report the ODDS or the PROBABILITY of the outcome as a function of x, using only the range of x where data is available in each group. That will give two curves covering each a different range.

If you want odds-ratios, you will need to decide on a reference point, for example somewhere in the intersection of x for the two groups.

Best regards
Bendix Carstensen

> --
> To post a new thread to MedStats, send email to
> MedS...@googlegroups.com .
> MedStats' home page is http://groups.google.com/group/MedStats .
> Rules: http://groups.google.com/group/MedStats/web/medstats-rules
>

Ted Harding

unread,
Jul 25, 2010, 5:30:37 PM7/25/10
to meds...@googlegroups.com
On 25-Jul-10 16:39:15, Ryan wrote:
> Ray,
> Thank you for responding. Here's a simple example:
>
> y = car accident (yes/no)
> group = gender (male/female)
> x = time driving on highway<-exposure type covariate
>
> What if time on the highway for females ranges from 2.4 to 25.2 hours
> and the range for males is 22.3 to 79.6 hours? If I start with the
> full model:
>
> logit(y) = bo
> + b1*group
> + b2*x
> + b3*group*x
>
> and end up with a significant b3, I am now faced with the problem of
> figuring out what to set time to when comparing males and females.
> What would you suggest I do?
>
> Ryan
> [...]

I don't think that is how you should be looking at it.
Suppose we code Males=1, Females=0. Then, denoting logit(y) by L,
the equation fo Males is:

M: L = b0 + b1 + b2*x + b3*x
= (b0 + b1) + (b2 + b3)*x

and for females it is:

F: L = b0 + b2*x

so:

b1 is the difference (Int.M - Int.F) (intercepts) and

b3 is the difference (Slope.M - Slope.F)

The fact that you get a "signficant b3" means that the slopes
are different. You can compare Males & Females at any time you
like, but the comparison will vary according to the time at which
you choose to make the comparison, because of the difference in
slopes. This would be the case even if the ranges of Male & Female
data were identical.

If you were to make your "comparison time" depend on where the
two time-ranges overlap, then that would restrict the possible
comparisons (all different) which you could make.

Ray's reply (snipped) showed how to approach a confidence interval
for the difference in log(Odds) between M and F at any given time.
Given your data (whatever they are) this will be narrower on some
range of time and wider elsewhere, but it will still be a valid
inference whatever the time. Your Male prediction will be less
accurate for small values of x, your Female predication less
accurate for large values of x, in each case because it is made
outside the range of the data.

As Bendix pointed out in an earlier reply, you are in effect
estimating separate regressions for Males & Females because
you have used a full interaction model. With different slopes
(say b3 > 0 so Male accident rate increases with x faster than
Female accident rate), males will have increasingly greater
accident rates than females as x increases. Surely this needs
to be exhibited in your comparison.

Therefore you not only can, but should, show the comparison
for several time-points x, since this is the only way to show
how the comparison between them varies according to x. The
fact that the precision will be variable too is something you
will have to live with!

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 25-Jul-10 Time: 22:30:34
------------------------------ XFMail ------------------------------

Ryan

unread,
Jul 25, 2010, 6:31:56 PM7/25/10
to MedStats
Ted,

Thank you for the detailed response. Your points are well taken--I
agree that I should show how the predicted odds ratios change at
different x-values. I appreciate that a significant interaction term
suggests that the effect of x is different between groups. I also
recognize (and have observed in previous work) that precision in
estimates will be compromised as I make comparisons outside the shared
range of x-values. My concern, however, is not so much with precision
in estimates when I get outside the shared range for males or females
(although that's always a concern!), but with extrapolation. That is,
if I never even once observed a female with a driving time beyond 25.2
hours, would it be acceptable for me to predict the odds (or
probability) of an accident at 30, 40, or even 50 hours for females?
This seems dangerous to me, but perhaps I'm being overly conservative.
Your reply suggests that setting x to a value outside the shared range
would be acceptable. I have to admit that this suggestion is quite
different to what I've been taught.

Thanks again for responding. I will continue to think about your
recommendation.

Best wishes,

Ryan

On Jul 25, 5:30 pm, (Ted Harding) <Ted.Hard...@manchester.ac.uk>
wrote:
> E-Mail: (Ted Harding) <Ted.Hard...@manchester.ac.uk>
> Fax-to-email:              +44 (0)870 094 0861        +44 (0)870 094 0861
> Date: 25-Jul-10                                       Time: 22:30:34
> ------------------------------ XFMail ------------------------------- Hide quoted text -

Ryan

unread,
Jul 25, 2010, 7:52:57 PM7/25/10
to MedStats
Dear Bendix,

Thanks for your help. I see your point that the equation I presented
is fitting a regression line for x in each group. Using the equation I
preseted previously, the equation for the predicted OR would be:

OR = exp ( b1 + b3 * {x-value} )

Needless to say, one would expect the predicted OR to change as
different x-values are plugged into this formula. I do wonder if there
are any alternatives to adjusting for an exposure variable (e.g.,
time) within the logistic regression framework.

Thanks,

Ryan
> > Rules:http://groups.google.com/group/MedStats/web/medstats-rules- Hide quoted text -

Kornbrot, Diana

unread,
Jul 26, 2010, 2:17:27 AM7/26/10
to meds...@googlegroups.com
This is an example of the inescapable fact that main effects are ALWAYS meaningless when interactions are present.
To make a prediction one needs to know BOTH the group membership and the x-value.
Hence, as Bendix states the most informative way of reporting the results is to give the equation, i.e. slope and intercept, for both groups. It is sensible to state that the logOR increases by so much for each increase in x for group 1 and for a different amount for group 2.
One can give the main effect at SPECIFIED  x value, + the interaction + the avaeraged rate of change with x, but this is much less easy to interpret.
Two linear equations is simple and easy to interpret
Best
Diana



Professor Diana Kornbrot
email: 
d.e.ko...@herts.ac.uk    
web:    http://web.me.com/kornbrot/KornbrotHome.html
Work
Centre for Lifespan & Chronic Illness Research, CLiCIR
School of Psychology
University of Hertfordshire
College Lane, Hatfield, Hertfordshire AL10 9AB, UK
voice:  +44 (0) 170 728 4626
Home
19 Elmhurst Avenue
London N2 0LT, UK
 voice:     +44 (0) 208 883  3657
 mobile:   +44 (0)
7855 415 425
fax:        +44 (0) 870 706 4997





Steve Simon, P.Mean Consulting

unread,
Jul 26, 2010, 10:28:46 AM7/26/10
to meds...@googlegroups.com
Ryan wrote:

> That is, if I never even once observed a female with a driving time
> beyond 25.2 hours, would it be acceptable for me to predict the odds
> (or probability) of an accident at 30, 40, or even 50 hours for
> females? This seems dangerous to me, but perhaps I'm being overly
> conservative.

No, you're not being overly conservative. Here's a joke that I quoted in
the January 2009 issue of my newsletter:
* http://www.pmean.com/news/2009-01.html#11

This story is found at the R.A. Fisher Hall (joke #24) of the Gary
Ramseyer's Internet Gallery of Statistics Jokes,
* http://my.ilstu.edu/~gcramsey/Fisher.html
and it shows the true meaning of the term "dangerous extrapolation." I
use this joke at the start of my class on regression analysis.

Two statisticians were traveling in an airplane from LA to New York.
About an hour into the flight, the pilot announced that they had lost an
engine, but don't worry, there are three left. However, instead of 5
hours it would take 7 hours to get to New York. A little later, he
announced that a second engine failed, and they still had two left, but
it would take 10 hours to get to New York. Somewhat later, the pilot
again came on the intercom and announced that a third engine had died.
Never fear, he announced, because the plane could fly on a single
engine. However, it would now take 18 hours to get to New York. At this
point, one statistician turned to the other and said, "Gee, I hope we
don't lose that last engine, or we'll be up here forever!"

Steve Simon, Standard Disclaimer
Sign up for The Monthly Mean, the newsletter that
dares to call itself "average" at www.pmean.com/news
"Data entry and data management issues with examples
in IBM SPSS," Tuesday, August 24, 11am-noon CDT.
Free webinar. Details at www.pmean.com/webinars

Bruce Weaver

unread,
Jul 26, 2010, 10:48:15 AM7/26/10
to MedStats
Hi Ryan. I think some respondents failed to understand that you
understand very well the nature of the model you are fitting
(including how to compare the two groups at several values of X to
illustrate the nature of the interaction), and that your question is
about the legitimacy of extrapolating well outside of the observed
range of X-values. Like you, I have been taught that this is a
somewhat dangerous game to play, because one cannot be certain that
the observed trend continues into the unobserved range of X values.

One solution that has been suggested is to go ahead and extrapolate.
As you know, you'll get pretty wide confidence intervals (on the
difference between the groups) if you do that. And you'll probably
also want to do some hand-waving in the discussion, saying that this
assumes that the trends for both groups continue as they are into the
unobserved ranges, etc.

But here's another option to consider: Center on different values for
the two groups (e.g., on the group means). See the example near the
bottom of the page given below. The example uses linear regression,
but the same approach could be used for logistic regression.

http://www.gseis.ucla.edu/courses/ed230bc1/notes4/center.html

Most of your readers will probably not be familiar with this approach,
so you'll have to take care to explain very clearly what the results
mean.

HTH.

Cheers,
Bruce
--
Bruce Weaver
bwe...@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/Home
"When all else fails, RTFM."

Ted Harding

unread,
Jul 26, 2010, 5:20:09 PM7/26/10
to meds...@googlegroups.com

Well, if we're on jokes about that kind of statistician, here's one
from the 1960s (yes, there were bomb scares on aircraft even then).

A business man needed to make a flight from London to New York,
but was scared he might find himself on a plane with a bomb on it,
brought on board by some passenger. So he consulted a statistician
friend for advice on how to reduces the risk.

The statistician asked him: "What's the chance an aircraft will
have a bomb brought onto it?"

The businessman replied "I'm told it's about 1 in 500, but
that's still too high for me."

The statistician then said: "Ah, that's OK then. All you need
to do is take a bomb on board yourself, so long as you don't
detonate it. If it's 1 in 500 that there's one bomb on board,
then it's 1 in 250,000 that there would be two, so you'd be
really safe."

(Mind you, he may not have been "that kind of statistician" at
all, but rather the kind skilled at giving reassuring advice
in terms that would be believed by his client).

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@manchester.ac.uk>


Fax-to-email: +44 (0)870 094 0861

Date: 26-Jul-10 Time: 22:20:06
------------------------------ XFMail ------------------------------

Message has been deleted

Belinda Dawson

unread,
Aug 7, 2010, 6:58:51 PM8/7/10
to meds...@googlegroups.com
Hi Bruce,
 
   Please help, thank you!
   The following link is not good:
 http://www.gseis.ucla.edu/courses/ed230bc1/notes4/center.html

 
> Date: Mon, 26 Jul 2010 07:48:15 -0700

> Subject: {MEDSTATS} Re: Logistic Regression - Extrapolation Issue

Bruce Weaver

unread,
Aug 9, 2010, 10:06:12 AM8/9/10
to MedStats
On Aug 7, 6:58 pm, Belinda Dawson <belindadaws...@hotmail.com> wrote:
> Hi Bruce,
>
>    Please help, thank you!
>    The following link is not good:
>  http://www.gseis.ucla.edu/courses/ed230bc1/notes4/center.html
>

It looks like the website has been re-organized. Try here:

http://web.archive.org/web/20080309021357/http://www.gseis.ucla.edu/courses/ed230bc1/notes4/center.html

Belinda Dawson

unread,
Aug 9, 2010, 12:35:05 PM8/9/10
to meds...@googlegroups.com
Bruce,
   I got it. Thank you very much!
Belinda
 
> Date: Mon, 9 Aug 2010 07:06:12 -0700
> Subject: Re: {MEDSTATS} Re: Logistic Regression - Extrapolation Issue
> From: bwe...@lakeheadu.ca
> To: meds...@googlegroups.com
>

Bjoern

unread,
Aug 10, 2010, 2:28:44 AM8/10/10
to MedStats
There is a 2001 paper by Rubin (Using propensity scores to help design
observational studies: application to the tobacco litigation. Health
Service & Outcomes Research Methodology 2: 169-188) that could be
useful here. In the paper Rubin was amongst other things looking at
how "different" two groups had to be for standard regression
adjustments to no longer be able to reliably adjust for fully observed
covariates (i.e. even without there being unobserved covariates).

I am not sure the article would necessarily help you with answering
the questions you seem to want to get from the data, but it may help
you with assessing whether to trust the logistic regression analysis
with respect to saying anything about group=0/1.

Best Regards,
Björn
Reply all
Reply to author
Forward
0 new messages