encouraged to drop collinear variables. What am I missing. I have
mostly handled my analysis by dropping one where I see evidence of
collinearity (VIFs, Condition Index, etc).
Thanks!
mcap
mcap
"Co-linearity" means that the predicting variables are closely related
to one another and so many combinations of coefficients will predict the
outcome variable with similar precision. Hence we cannot estimate the
coefficients.
Martin
--
***************************************************
J. Martin Bland
Prof. of Health Statistics
Dept. of Health Sciences
Seebohm Rowntree Building Area 2
University of York
Heslington
York YO10 5DD
Email: mb...@york.ac.uk
Phone: 01904 321334
Fax: 01904 321382
Web site: http://www-users.york.ac.uk/~mb55/
***************************************************
Where do you draw the line between the two? How can you tell the
difference in some cases? With some of my variables like age and
experience, it is a case of collinearity. But in other cases it isn't
so clear.
I have tasks that I am measuring. Certain tasks are very likely to be
performed together in certain work situations. However, they are
theoretically distinct from each other. Is that confounding, even if
the collinearity diagnostics are positive? Or perhpas the definitions
of the two are being blurred by respondents (they shouldn't be - we
were careful). Is there a way to distinguish?
mcap
That's a good summary! I would add to it that there is a confusing
double use of "confounding" in Statistics, in that in Experimental
Design there is also the terminology of one factor (or interaction)
being confounded with another (totally or partially). In this sense,
total confounding is basically collinearity -- if you include two
factors, each confounded with the other, in the model you have too
many parameters, so exactly the same effect can be predicted using
different combinations of their coefficients.
Whether total or partial, however, this is not "confounding" in
the epidemiological sense explained by Martin above (though it is
possible to have data where there is confounding in both senses).
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 01-Dec-06 Time: 15:26:10
------------------------------ XFMail ------------------------------
You can have both, if you are really unlucky.
Martin
--
>I don't think it is an either-or thing. Confounding relates to the
>relationship of the predictor to the outcome and to another predictor,
>colinearity to the relationship of the predictors to one another.
>You can have both, if you are really unlucky.
Martin, from the way you've written that, it sounds like an inevitability
(at least in one direction), not something that requires any bad luck.
You say that confounding relates to the "relationship of the predictor ....
to another predictor" and that colinearity relates to the "relationship of
the predictors to one another".
Certainly if there are only two predictors, doesn't confounding (per your
definition) therefore automatically imply colinearity (per your definition)?
.. or am I missing something?
Kind Regards
John
----------------------------------------------------------------
Dr John Whittington, Voice: +44 (0) 1296 730225
Mediscience Services Fax: +44 (0) 1296 738893
Twyford Manor, Twyford, E-mail: Joh...@mediscience.co.uk
Buckingham MK18 4EL, UK
----------------------------------------------------------------
On a slightly different note, confounding does not have to be a bad
thing. Sometimes in experimental design, we purposely confound certain
effects in order to reduce the number of cells in a complex design and
thereby reduce the number of subjects needed.
Paul R. Swank, Ph.D.
Professor, Developmental Pediatrics
Director of Research,
University of Texas Health Science Center at Houston
At 10:41 01/12/06 -0600, Swank, Paul R wrote:
>Confounding refers only to the overlap between the predictor and the
>outcome being in common with another predictor. Thus, the correlation
>between predictors can be quite modest and still lead to confounding.
Agreed.
>Collinearity usually refers to high correlations between predictors.
In terms of 'usual parlance' and/or if you mean 'problematical
collinearity', I'd have to agree - but that is going beyond the
'definition' cited by Martin.
>Thus, if teo predictors are highly correlated then it is quite likely
>that the relation of one to some outcome will be confounded with the
>other.
'Quite likely'? Isn't it inevitable?
>However, the relation of a predictor to an outcome can be
>confounded with another predictor and still not be a problem as far as
>collinearity os concerned.
As above, agreed, providing one is defining collinearity as meaning high
correlation (or, perhaps, 'problematical collinearity').
Kind Regards,
I think it is certainly possible for two variables to be highly
correlated but that portion of one that is related to the outcome is not
part that is related to the other predictor. Possible, but not very
likely!
Paul R. Swank, Ph.D.
Professor, Developmental Pediatrics
Director of Research,
Children's Learning Institute
University of Texas Health Science Center at Houston
-----Original Message-----
From: MedS...@googlegroups.com [mailto:MedS...@googlegroups.com] On
Behalf Of John Whittington
Sent: Friday, December 01, 2006 11:05 AM
To: MedS...@googlegroups.com
Subject: {MEDSTATS} Re: Collinearity = confouding?
>Thanks all!! So, next question, can you have two predictors that are
>highly correlated but theoretically are distinct and not much chance
>that one is part of another or mistaken for another, etc.
I would have said that this is a pretty common situation - the most common
reason being that the two predictors (although clearly distinct) are
themselves manifestations or consequences of some other common factor (even
if that common factor is something pretty vague, like 'lifestyle', 'wealth'
or whatever). Certainly in the past, and probably still now, 'heavy
drinking' and smoking would probably be an example of such a 'pair' - as
would 'poor housing' and 'poor diet' etc.
>Agreed that it is a semantic issue. Unfortunately, that's how we get
>into lots of trouble sometimes in statistics, when these words take on
>imprecise meaning or multiple meanings.
Indeed. I agree totally with that.
>I think it is certainly possible for two variables to be highly
>correlated but that portion of one that is related to the outcome is not
>part that is related to the other predictor. Possible, but not very
>likely!
True, in theory. However, if two predictor variables were 'highly'
correlated, that would not leave much of a 'part' of either of them to be
independently predictive of the outcome. Hence, if one has two predictors
which are not only 'highly correlated' with one another and ALSO are
'highly predictive of' the outcome, then one surely has serious 'confounding'?
Peter Lane
Research Statistics Unit, GlaxoSmithKline
On Dec 1, 12:05 pm, John Whittington <Joh...@mediscience.co.uk> wrote:
> I think that this discussion is getting rather over-semantic, but ....
>
> At 10:41 01/12/06 -0600, Swank, Paul R wrote:
>
>
> >Collinearity usually refers to high correlations between predictors.
> In terms of 'usual parlance' and/or if you mean 'problematical
> collinearity', I'd have to agree - but that is going beyond the
> 'definition' cited by Martin.
>
I think one has to be careful. Absence of high
correlations does not mean the absence of potential
trouble. Here is an extreme example (the data were
originally posted by Jerry Dallal in sci.stat.math).
Hopefully this formats okay when posted (it doesn't
look promising as I write this in google groups).
> egdata
x1 x2 x3 y
1 18 88 106 13
2 72 45 117 43
3 36 63 99 50
4 75 26 101 77
5 22 83 105 23
6 99 71 170 68
7 69 53 122 6
8 6 49 55 51
9 86 99 185 37
10 85 64 149 10
11 87 7 94 32
12 93 32 125 69
13 44 88 132 4
14 34 34 68 13
15 84 28 112 18
> cor(egdata[,-4])
x1 x2 x3
x1 1.0000000 -0.3067903 0.6573389
x2 -0.3067903 1.0000000 0.5155893
x3 0.6573389 0.5155893 1.0000000
> test.lm <- lm(y~x1+x2+x3,data=egdata)
> summary(test.lm)
Call:
lm(formula = y ~ x1 + x2 + x3, data = egdata)
Residuals:
Min 1Q Median 3Q Max
-30.1912 -20.8404 0.9215 23.0400 34.5238
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.2500 23.4074 1.463 0.169
x1 0.1768 0.2254 0.784 0.448
x2 -0.1935 0.2563 -0.755 0.465
x3 NA NA NA NA
Residual standard error: 24.73 on 12 degrees of freedom
Multiple R-Squared: 0.1247, Adjusted R-squared: -0.02123
F-statistic: 0.8545 on 2 and 12 DF, p-value: 0.4498
Obviously, x3 = x1 + x2 which is linear dependence,
but no correlations are particularly high.
--
Kevin E. Thorpe
Assistant Professor, Department of Public Health Sciences
Faculty of Medicine, University of Toronto
This is a hot-button topic over in the sci.stat.* groups (strangely
so). One point that is made in those threads is that the purpose of
the model is important, prediction vs estimation. If the purpose is
prediction (eg. some financial outcome like stock price), then
collinearity, unless it is perfect collinearity and blows up your
model, is not a problem. If the purpose is estimation (eg. search for
risk factors of a disease), then it may be a problem. Bill Howells,
MS, data analyst
12 MODEL y
13 FIT x1,x2,x3
Message: term x3 cannot be included in the model because it is aliased
with terms already in the model.
(x3) = (x1) + (x2)
The linear dependence, or "aliasing", is easily detected and diagnosed
in the model-fitting algorithm, so any package should be able to
provide this information to let you know what is going on. It is really
much the same as colinearity, with one variable being highly (here
totally) correlated with a linear combination of the others; but as has
been pointed out, is harder to detect by a simple correlation approach.
There are two types of aliasing: "intrinsic aliasing" is when the
dependence is due to the model fitted, regardless of the actual data
(one level of a categorical variable is always aliased if you include
an overall mean or intercept term), whereas "extrinsic aliasing" is as
in the above example, when it is due to the actual observations of the
explanatories (though there may be some intrinsic mechanism behind
this, of course).
>Message: term x3 cannot be included in the model because it is aliased
>with terms already in the model.
>(x3) = (x1) + (x2)
>
>The linear dependence, or "aliasing", is easily detected and diagnosed
>in the model-fitting algorithm, so any package should be able to
>provide this information to let you know what is going on. It is really
>much the same as colinearity, with one variable being highly (here
>totally) correlated with a linear combination of the others; but as has
>been pointed out, is harder to detect by a simple correlation approach.
Peter, does GenStat ever produce a similar message when the correlation
between one variable and a linear combination of other variables is NOT
'total' - and, if so, do you know what criteria it uses for deciding how
high such a correlation has to be in order to constitute 'aliasing'?
The reason I ask obviously relates to what you go on to say:
>There are two types of aliasing: "intrinsic aliasing" is when the
>dependence is due to the model fitted, regardless of the actual data
>(one level of a categorical variable is always aliased if you include
>an overall mean or intercept term), whereas "extrinsic aliasing" is as
>in the above example, when it is due to the actual observations of the
>explanatories (though there may be some intrinsic mechanism behind
>this, of course).
If, as I rather suspect, such an error message only arises with perfect
correlation (between a variable and a linear combination of other
variables), then it could obviously detect 'intrinsic aliasing' but rarely,
if ever, would it detect real-world 'extrinsic aliasing'.
mcap
> I think one has to be careful. Absence of high
> correlations does not mean the absence of potential
> trouble. Here is an extreme example (the data were
> originally posted by Jerry Dallal in sci.stat.math).
> Hopefully this formats okay when posted (it doesn't
> look promising as I write this in google groups).
---- snip the example ----
And there's a flip side to Kevin's point: The presence of a high
correlation between two explanatory variables does not necessarily
indicate a problem. E.g., in a model with two predictors where X2 =
X1-squared, there will be a high correlation between X1 and X2. But
there's nothing at all wrong with the model (assuming it fits the data
well).
--
Bruce Weaver
bwe...@lakeheadu.ca
www.angelfire.com/wv/bwhomedir
That's certainly true if X1 is positive. But if X1 can take both
positive and negative values then it can be false. For example,
with X2 = X1^2:
X1 = {-2,-1,0,1,2}
X2 = {4,1,0,1,4}
corr(X1,X2) = 0.
> But there's nothing at all wrong with the model (assuming it
> fits the data well).
Again, that can depend. If X1 has fairly large values but
varies over a fairly small range, then X2 = X1^2 can be
a very close approximation to a linear function of X1, so
the collinearity problem can arise again and there will indeed
be "something wrong with the model". For example (ages of
patients no longer young):
X1 = {56,58,60,62,64}
X2 = {3138,3364,3600,3844,4096}
You can plot X2 vs X1 and barely see a visible departure from
a straight line.
Comparison of X2 with best-fitting linear function of X1:
X2 X2.fitted
3136 3128 (-8)
3364 3368 (+4)
3600 3608 (+8)
3844 3848 (+4)
4096 4088 (-8)
So general statements about this sort of thing can often be
refuted by particular data sets. There is no substitute for
looking properly at the data!
Best wishes,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 11-Dec-06 Time: 09:11:42
------------------------------ XFMail ------------------------------
GenStat produces aliasing messages when it detects singularity in the
matrix it inverts to get the solution. It uses a tolerance set by
default to eps*1e7, where eps is the smallest number for which 1+eps is
recognized as different from 1 (eps=1.1e-16 on my pc); if the diagonal
value of the matrix for the term to be estimated has reduced from its
original value (before fitting any terms) to a fraction smaller than
the tolerance, then it is considered aliased. This means that the
collinearity has to be pretty near exact; for example, the variates
a={0, 100, 100} and b={0, 100, 100.01} are not aliased, but they are if
b={0, 100, 100.001}. Before that point, GenStat fits the term, but
reports "near aliasing" with a message like
Message: the variance of some parameter estimates is seriously
inflated, due to near collinearity or aliasing between the following
parameters, listed with their variance inflation factors.
a 133346664.09
b 133346664.09
This is triggered if the inflation factor is greater than 100, as with
b={0, 100, 112} but not with b={0, 100, 113}.
>GenStat does detect real-world extrinsic aliasing, by checking variance
>inflation factors: it produces a message for this long befores it
>detects "collinearity". [snip]
>
>GenStat produces aliasing messages when it detects singularity in the
>matrix it inverts to get the solution. It uses a tolerance set by
>default to eps*1e7, where eps is the smallest number for which 1+eps is
>recognized as different from 1 (eps=1.1e-16 on my pc); if the diagonal
>value of the matrix for the term to be estimated has reduced from its
>original value (before fitting any terms) to a fraction smaller than
>the tolerance, then it is considered aliased. This means that the
>collinearity has to be pretty near exact;
As you will realise from my previous message, that's what I rather suspected.
>...Before that point, GenStat fits the term, but
>reports "near aliasing" with a message like
>
>Message: the variance of some parameter estimates is seriously
>inflated, due to near collinearity or aliasing between the following
>parameters, listed with their variance inflation factors.
>a 133346664.09
>b 133346664.09
>
>This is triggered if the inflation factor is greater than 100, as with
>b={0, 100, 112} but not with b={0, 100, 113}.
That's interesting. However, in terms of this discusion, when the message
says "..near collinearity or aliasing..." does this imply that it sees
those two things as different, or is it indicating that it is using the two
terms synonymously?
>In the GenStat text, collinearity and aliasing are intended as synonyms
>(I wrote the original version of this text some years ago when I worked
>at Rothamsted). I am not aware of any difference in meaning, though I
>haven't heard people talk of extrinsic or intrinsic collinearity: they
>describe the same type of property of the design matrix, and do not
>involve the response variable in the model.
Thanks for clarifying. I'm trying to get to the bottom of how people are
using these various words.
If anyone out there feels that there is some difference between the
meanings of 'collinearity' and 'aliasing', please speak up!
As for me, I have to say that, as I implied in my earlier posts, I have
always taken 'aliasing' to imply perfect collinearity (i.e. total
redundancy of one of the variables), anything less than that just being
called collinearity.
Peter
Peter Lane, Research Statistics Unit, GlaxoSmithKline
>I agree that "aliased" by itself should mean exact linear dependence,
>but I am used to seeing "partially aliased" -- perhaps as another way
>of saying "partially confounded", and so getting into the area already
>discussed that involves the response variable as well. Collinearity to
>me also means exact linear dependence, and I would expect to see "near
>collinearity" as a description of anything less.
Thanks. That all sounds very reasonable.