Some people may find it useful to read a summary of the mathematical
relationships between the RR = p2/p1, the OR = (p2/(1-p2))/(p1/(1-p1)),
and the difference = p2-p1 involving two proportions p1 and p2.
A little while ago I prepared an outline of the relationships between
these quantities, and in view of this correspondence I have uploaded
it onto my little website at
http://www.zen89632.zen.co.uk/R/TwoProportions/lambda_delta.pdf
in case anyone is interested. If anyone has comments or criticisms,
I would be grateful to hear of them.
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 17-Dec-09 Time: 09:53:02
------------------------------ XFMail ------------------------------
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules
_____________________________________________________
Doug Altman
Professor of Statistics in Medicine
Centre for Statistics in Medicine
University of Oxford
Wolfson College Annexe
Linton Road
Oxford OX2 6UD
email: doug....@csm.ox.ac.uk
Tel: 01865 284400 (direct line
01865 284401)
Fax: 01865 284424
www:
http://www.csm-oxford.org.uk/
EQUATOR Network - resources for reporting research
www:
http://www.equator-network.org/
To slightly disagree with Arin, the odds ratio is a great measure for
prospective cohort studies.
A major advantage of the odds ratio is that it does not impose any
restrictions on p1 and p2, whereas, for example, a risk ratio of 2 can
only apply to p1 <= 1/2. So models based on odds ratios will not
require 'mathematical' interactions (interactions that don't make
sense based on subject matter knowledge) just to keep probabilities
between 0 and 1.
Frank
On Dec 17, 3:53 am, (Ted Harding) <Ted.Hard...@manchester.ac.uk>
wrote:
> E-Mail: (Ted Harding) <Ted.Hard...@manchester.ac.uk>
Thanks Frank! That induced me to have another look at it, to see
how great it is -- and my eye promptly spotted an error! Namely,
Page 1 left-hand side, paragraph "For Constant RR", I wrote
"so 0 <= p1 <= 1 - 1/lambda", which (if you think and/or look at
the diagram) should be "so 0 <= p1 <= 1/lambda".
I have corrected this, and the new version is available as before:
http://www.zen89632.zen.co.uk/R/TwoProportions/lambda_delta.pdf
I plead initial laziness, in that I think I must have copied down
the preceding "For Constant Difference" and then edited the bits
that needed changing (but not all of them ... ). Be that as it may,
it links nicely into Frank's comment below:
> To slightly disagree with Arin, the odds ratio is a great measure
> for prospective cohort studies.
>
> A major advantage of the odds ratio is that it does not impose any
> restrictions on p1 and p2, whereas, for example, a risk ratio of 2
> can only apply to p1 <= 1/2. So models based on odds ratios will
> not require 'mathematical' interactions (interactions that don't
> make sense based on subject matter knowledge) just to keep
> probabilities between 0 and 1.
> Frank
Indeed. And this raises the perennial issue of the tensions between
[A] Mathematically smooth/tractable models
[B] Interpretation of models in terms of mechanisms
[C] Expressing the results of model fitting in terms people can grasp
with minimal risk of confusion or misinterpretation.
For a binary outcome, logistic regression models a linear predictor
for the log-Odds, so an increment in a covariate X emerges as a
proportional increment in log-Odds, i.e. as an Odds-Ratio. With
this linear predictor, the sufficient statistics are sum(Yi) and
sum(Yi*Xi) (outcome Yi = 0/1), and all fits smoothly into the
classical Fisherian theory of inference and information.
But what does it represent? One interpretation of a model for
the probability of Y=1 for a binary outcome:
Prob(Y=1|X=x) = F(x; alpha, beta)
is that there is an underlying latent varable U, not directly
observed, such that each potential subject has a value of U,
distributed over subjects with distribution function F as above,
which can be interpreted as a "tolerance" towards "stimulus" X:
if, given X=x. a subject has U=u, then Y=1 if u<x (insufficient
tolerance) -- subject (say) dies; while if u>x (sufficient
tolerance) then Y=0 (subject survives).
So, if you use a logistic regression, you are implicitly acting
as though there is some U which has the Logistic distribution:
Prob(U <= u) = F(u; alpha, beta) = exp(L)/(1+exp(L))
L = alpha + beta*u
But where, in the world of adopting probability distributions
to model naturally occurring variables, would you spontaneously
adopt a logistic distribution as a natural representation?
(Apart, of course, from when you are quietly coerced into it
by "spontaneously" adopting logistic regression, perhaps not
being aware of what it implies under its skin).
If you approach the question from this point of view, you might
more spontaneously adopt a Normal distribution for the underlying
latent variable. And then, of course, you would be led to use
a Probit model where F is now the distribution function for the
Normal distribution, not the Logistic.
And the Probit model was the first on the scene (Bliss, 1935;
further developed by Fisher, in part with Bliss). Interestingly,
however, Fisher (the discoverer of the concept of Sufficiency)
did not pursue the question of sufficiency in the Probit model
(which does not lead to interesting sufficient statistics).
The Logistic model seems to have raised into prominence by Berkson
in the 1950s, and somewhere along that line the simple sufficient
statistics emerged. Also, the natural interpretation of the
coefficients as proportionality constants for changes in log-OR
emerged (especially in connection with contingency tables).
However, an increment in log-OR is not simple to understand
or explain. How should I understand some exposure which "doubles
my Odds-Ratio for death"? I need a baseline odds (from which the
baseline risk can be calculated, of course) so then I can get
the odds if exposed (and then could get the exposed risk). But
it is the risk which is interesting, not the odds (unless someone
is taking bets on the outcome).
It is a bit easier to understand something which would double my
Risk-Ratio. But I still nedd the baseline. If, in round numbers,
a 50km car journey normally gives me a 1/10000 chance of becoming
a casualty, and (in the current wintry conditions here), that
risk is doubled, then it becomes 1/5000. That is the sort of
risk one normally treats lightly (though not too often).
But if I had a choice between (say) undergoing some procedure
which gave me a 50% chance of death, or not undergoing it which
would almost certainly result in death. then the choice between
the Risk (50%) and twice the Risk (100%) becomes starker.
However, often some result is expressed in the media as "X doubles
the risk of cancer", which tends to provoke a sensation of understanding
in the people who are given this message. Presumably they refer
"double the risk" to some intuitive notion of "the risk", which
implies that there is some populist Bayesin prior out there.
Few are they who ask "What, from 1 in 1,000,000 to 2 in 1,000,000?
Who cares?" (Provided 1 in a million is realistic, of course ... ).
The nice thing about the Logistic model is that the coefficients
which flow out of it relate directly to a risk-related entity,
namely the log-Odds. Setting up a model in terms of Risk is not
so straightforward. Using a GLM with log-Poisson link will do it,
but you are only safe from unrealistic results if you apply it
to fairly rare outcomes. Otherwise you can predict negative Risk,
or Risk > 1 (just as, in Frank's comment, you can't have RR>2
if p1 > 0.5).
But what is it that flows out of the Probit model? Nothing obviously
natural whatever.
Nevertheless, there is an interesting relationship between the
Logistic model and the Normal distribution. Imagine a population
consisting of two Groups, one labelled Y=0 and the other labelled Y=1.
In each, a variable X has a Normal distribution; the variance is
the same in both Groups, they differ only in their means.
Group 1 constitutes a proportion p of the population, Group 2
constitutes a proportion 1-p.
Now choose a member at random from the entire population, and observe
the value of X. Then the probability that that individual has Y=1
is given by the Logistic model. So you can at one and the same time
"spontaneously" adopt a Normal distribution for a naturally occurring
variable, and a Logistic model for the outcome Y.
But, of course, when one looks at typical epidemiological data,
and segregates them into a Y=1 group and a Y=0 Group. you are
very unlikely (as it appears in practice) to be dealing with a
situation where the distribution of X in each group would appear
to be Normal. So maybe the above relationship is not good comfort
for those who want to feel happy with both the "smoothness" of
the Logistic model and the "naturalness" of the Normal distribution.
So, all that being said, just *what* is the interpretation, as a
mechanism, if the Logistic distributikon implied by the Logistic
regression model? Or is it just a tractable approximation to the
Normal ditribution (and hence the Logistic model to the Probit
model)? Not forgetting, that in the Bioassay context where these
things were first developing, the Y=1 response rate was typically
fairly well clear of P=0 and fairly well clear of P=1 (e.g. over
a range 0.1 < P < 0.9); in such a range, there is not a lot of
difference between the Logistic model for Prob(Y=1|X=x) and the
Probit model. It is only when you get out into the P<0.05 (or
P>0.95) tail that they begin to differ markedly. But this sort
of prevalence is common in epidemiological studies -- so then
the question "Is the Logistic model adequately accurate for the
true mechanism?" becomes more pressing.
Not that any of this is intended to be a definitive resolution
of the issues [A], [B] and [C]. It is just a case of observing
them float in turn to the surface, as you turn the problem round.
Ted.
>
> On Dec 17, 3:53_am, (Ted Harding) <Ted.Hard...@manchester.ac.uk>
>> _http://www.zen89632.zen.co.uk/R/TwoProportions/lambda_delta.pdf
>>
>> in case anyone is interested. If anyone has comments or criticisms,
>> I would be grateful to hear of them.
>>
>> Ted.
>>
>> --------------------------------------------------------------------
>> E-Mail: (Ted Harding) <Ted.Hard...@manchester.ac.uk>
>> Fax-to-email: +44 (0)870 094 0861
>> Date: 17-Dec-09 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Time: 09:53:02
>> ------------------------------ XFMail ------------------------------
>
> --
> To post a new thread to MedStats, send email to
> MedS...@googlegroups.com .
> MedStats' home page is http://groups.google.com/group/MedStats .
> Rules: http://groups.google.com/group/MedStats/web/medstats-rules
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 18-Dec-09 Time: 17:47:09
------------------------------ XFMail ------------------------------
One could argue that the logistic model is more tied to the normal
distribution than is the probit model, because Bayes' rule gives you
the logistic model if you start with multivariate normality for X.
You're right that the probit model coefficients are very hard to
interpret.
I would argue that logistic models are very interpretable withing
envisioning latent variables. I think that various plots of predicted
probabilities and nomograms to obtain risk differences for a given
covariate setting are some of the best ways to go.
Cheers
Frank
On Dec 18, 11:47 am, (Ted Harding) <Ted.Hard...@manchester.ac.uk>
> ...
>
> read more »