Collinearity = confouding?

mcap

unread,

Nov 30, 2006, 3:30:38 PM11/30/06

to MedStats

I am running LR with ordinal predictors entered as continuous. There
are high correlations between some of my continuous predictors. The
tests for colllinearity are positive. My
question may be a silly one......what is the difference between
collinearity and confounding. It seems like they would be the same
thing. However, one is encouraged to keep confounders in the model and

encouraged to drop collinear variables. What am I missing. I have
mostly handled my analysis by dropping one where I see evidence of
collinearity (VIFs, Condition Index, etc).

Thanks!
mcap

mcap

unread,

Nov 30, 2006, 3:43:54 PM11/30/06

to MedStats

I also have been assuming that changed in the ORs are due to confouding
where as inflation of the SEs and CIs is due to collinearity. Is this
wrong..........

mcap

Bland, M.

unread,

Dec 1, 2006, 9:26:50 AM12/1/06

to MedS...@googlegroups.com

"Confounding" is a term used in epidemiology to mean that confounding
variable is related both to the outcome variable and to a predictor of
interest. This can obscure important relationships, or create spurious
ones.

"Co-linearity" means that the predicting variables are closely related
to one another and so many combinations of coefficients will predict the
outcome variable with similar precision. Hence we cannot estimate the
coefficients.

Martin

--
***************************************************
J. Martin Bland
Prof. of Health Statistics
Dept. of Health Sciences
Seebohm Rowntree Building Area 2
University of York
Heslington
York YO10 5DD

Email: mb...@york.ac.uk
Phone: 01904 321334
Fax: 01904 321382
Web site: http://www-users.york.ac.uk/~mb55/
***************************************************

mcap

unread,

Dec 1, 2006, 10:17:22 AM12/1/06

to MedStats

Thanks Martin!! A couple more questions.

Where do you draw the line between the two? How can you tell the
difference in some cases? With some of my variables like age and
experience, it is a case of collinearity. But in other cases it isn't
so clear.

I have tasks that I am measuring. Certain tasks are very likely to be
performed together in certain work situations. However, they are
theoretically distinct from each other. Is that confounding, even if
the collinearity diagnostics are positive? Or perhpas the definitions
of the two are being blurred by respondents (they shouldn't be - we
were careful). Is there a way to distinguish?

mcap

Ted Harding

unread,

Dec 1, 2006, 10:26:14 AM12/1/06

to MedS...@googlegroups.com

On 01-Dec-06 Bland, M. wrote:
>
> "Confounding" is a term used in epidemiology to mean that confounding
> variable is related both to the outcome variable and to a predictor of
> interest. This can obscure important relationships, or create spurious
> ones.
>
> "Co-linearity" means that the predicting variables are closely related
> to one another and so many combinations of coefficients will predict
> the
> outcome variable with similar precision. Hence we cannot estimate the
> coefficients.
>
> Martin

That's a good summary! I would add to it that there is a confusing
double use of "confounding" in Statistics, in that in Experimental
Design there is also the terminology of one factor (or interaction)
being confounded with another (totally or partially). In this sense,
total confounding is basically collinearity -- if you include two
factors, each confounded with the other, in the model you have too
many parameters, so exactly the same effect can be predicted using
different combinations of their coefficients.

Whether total or partial, however, this is not "confounding" in
the epidemiological sense explained by Martin above (though it is
possible to have data where there is confounding in both senses).

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 01-Dec-06 Time: 15:26:10
------------------------------ XFMail ------------------------------

Bland, M.

unread,

Dec 1, 2006, 10:56:44 AM12/1/06

to MedS...@googlegroups.com

I don't think it is an either-or thing. Confounding relates to the
relationship of the predictor to the outcome and to another predictor,
colinearity to the relationship of the predictors to one another.

You can have both, if you are really unlucky.

Martin

--

John Whittington

unread,

Dec 1, 2006, 11:11:27 AM12/1/06

to MedS...@googlegroups.com

At 15:56 01/12/06 +0000, Bland, M. wrote:

>I don't think it is an either-or thing. Confounding relates to the
>relationship of the predictor to the outcome and to another predictor,
>colinearity to the relationship of the predictors to one another.
>You can have both, if you are really unlucky.

Martin, from the way you've written that, it sounds like an inevitability
(at least in one direction), not something that requires any bad luck.

You say that confounding relates to the "relationship of the predictor ....
to another predictor" and that colinearity relates to the "relationship of
the predictors to one another".

Certainly if there are only two predictors, doesn't confounding (per your
definition) therefore automatically imply colinearity (per your definition)?

.. or am I missing something?

Kind Regards

John

----------------------------------------------------------------
Dr John Whittington, Voice: +44 (0) 1296 730225
Mediscience Services Fax: +44 (0) 1296 738893
Twyford Manor, Twyford, E-mail: Joh...@mediscience.co.uk
Buckingham MK18 4EL, UK
----------------------------------------------------------------

Swank, Paul R

unread,

Dec 1, 2006, 11:41:48 AM12/1/06

to MedS...@googlegroups.com

Confounding refers only to the overlap between the predictor and the
outcome being in common with another predictor. Thus, the correlation
between predictors can be quite modest and still lead to confounding.
Collinearity usually refers to high correlations between predictors.
Thus, if teo predictors are highly correlated then it is quite likely
that the relation of one to some outcome will be confounded with the
other. However, the relation of a predictor to an outcome can be
confounded with another predictor and still not be a problem as far as
collinearity os concerned.

On a slightly different note, confounding does not have to be a bad
thing. Sometimes in experimental design, we purposely confound certain
effects in order to reduce the number of cells in a complex design and
thereby reduce the number of subjects needed.

Paul R. Swank, Ph.D.
Professor, Developmental Pediatrics
Director of Research,

University of Texas Health Science Center at Houston

John Whittington

unread,

Dec 1, 2006, 12:05:11 PM12/1/06

to MedS...@googlegroups.com

I think that this discussion is getting rather over-semantic, but ....

At 10:41 01/12/06 -0600, Swank, Paul R wrote:

>Confounding refers only to the overlap between the predictor and the
>outcome being in common with another predictor. Thus, the correlation
>between predictors can be quite modest and still lead to confounding.

Agreed.

>Collinearity usually refers to high correlations between predictors.

In terms of 'usual parlance' and/or if you mean 'problematical
collinearity', I'd have to agree - but that is going beyond the
'definition' cited by Martin.

>Thus, if teo predictors are highly correlated then it is quite likely
>that the relation of one to some outcome will be confounded with the
>other.

'Quite likely'? Isn't it inevitable?

>However, the relation of a predictor to an outcome can be
>confounded with another predictor and still not be a problem as far as
>collinearity os concerned.

As above, agreed, providing one is defining collinearity as meaning high
correlation (or, perhaps, 'problematical collinearity').

Kind Regards,

Swank, Paul R

unread,

Dec 1, 2006, 12:15:53 PM12/1/06

to MedS...@googlegroups.com

Agreed that it is a semantic issue. Unfortunately, that's how we get
into lots of trouble sometimes in statistics, when these words take on
imprecise meaning or multiple meanings.

I think it is certainly possible for two variables to be highly
correlated but that portion of one that is related to the outcome is not
part that is related to the other predictor. Possible, but not very
likely!

Paul R. Swank, Ph.D.
Professor, Developmental Pediatrics
Director of Research,

Children's Learning Institute

University of Texas Health Science Center at Houston

-----Original Message-----
From: MedS...@googlegroups.com [mailto:MedS...@googlegroups.com] On
Behalf Of John Whittington
Sent: Friday, December 01, 2006 11:05 AM
To: MedS...@googlegroups.com
Subject: {MEDSTATS} Re: Collinearity = confouding?

mcap

unread,

Dec 2, 2006, 11:31:34 AM12/2/06

to MedStats

Thanks all!! So, next question, can you have two predictors that are
highly correlated but theoretically are distinct and not much chance
that one is part of another or mistaken for another, etc. So, there
would be confouding (as the tasks are often performed together) but
they are behaving like they are at least a little collinear (not much
change in the coefficients and large changes in the SEs. Also the VIFs
are close to 4).

John Whittington

unread,

Dec 4, 2006, 5:41:33 AM12/4/06

to MedS...@googlegroups.com

At 08:31 02/12/06 -0800, mcap wrote (in part):

>Thanks all!! So, next question, can you have two predictors that are
>highly correlated but theoretically are distinct and not much chance
>that one is part of another or mistaken for another, etc.

I would have said that this is a pretty common situation - the most common
reason being that the two predictors (although clearly distinct) are
themselves manifestations or consequences of some other common factor (even
if that common factor is something pretty vague, like 'lifestyle', 'wealth'
or whatever). Certainly in the past, and probably still now, 'heavy
drinking' and smoking would probably be an example of such a 'pair' - as
would 'poor housing' and 'poor diet' etc.

John Whittington

unread,

Dec 4, 2006, 8:21:32 AM12/4/06

to MedS...@googlegroups.com

At 11:15 01/12/06 -0600, Swank, Paul R wrote:

>Agreed that it is a semantic issue. Unfortunately, that's how we get
>into lots of trouble sometimes in statistics, when these words take on
>imprecise meaning or multiple meanings.

Indeed. I agree totally with that.

>I think it is certainly possible for two variables to be highly
>correlated but that portion of one that is related to the outcome is not
>part that is related to the other predictor. Possible, but not very
>likely!

True, in theory. However, if two predictor variables were 'highly'
correlated, that would not leave much of a 'part' of either of them to be
independently predictive of the outcome. Hence, if one has two predictors
which are not only 'highly correlated' with one another and ALSO are
'highly predictive of' the outcome, then one surely has serious 'confounding'?

Peter Lane

unread,

Dec 6, 2006, 1:32:17 PM12/6/06

to MedStats

Colinearity can be a big problem between a categorical explanatory and
its interaction with a quantitative explanatory. They are certainly
theoretically distinct, but can be so colinear that software fails to
detect a difference. For example, you may have a linear effect of time
in the model, a categorical effect of treatment, and the interaction
between them (i.e. separate linear trens for each treatment group). By
changing the scale of the quantitative variable, you can alter the
colinearity as much as you want, if using the usual representation of
these effects in software; e.g. for equally spaced years -5, -4 ... +5,
there is no colinearity, but with a representation like 1996, 1997 ...
2006 there is massive colinearity.
By the way, I have found this thread very helpful. I had always
vaguely thought of the two terms confounding and colinearity as being
interchangeable, and it good to have the distinction pointed out so
clearly.

Peter Lane
Research Statistics Unit, GlaxoSmithKline

Kevin E. Thorpe

unread,

Dec 7, 2006, 10:37:16 AM12/7/06

to MedStats

On Dec 1, 12:05 pm, John Whittington <Joh...@mediscience.co.uk> wrote:
> I think that this discussion is getting rather over-semantic, but ....
>
> At 10:41 01/12/06 -0600, Swank, Paul R wrote:
>
>
> >Collinearity usually refers to high correlations between predictors.
> In terms of 'usual parlance' and/or if you mean 'problematical
> collinearity', I'd have to agree - but that is going beyond the
> 'definition' cited by Martin.
>

I think one has to be careful. Absence of high
correlations does not mean the absence of potential
trouble. Here is an extreme example (the data were
originally posted by Jerry Dallal in sci.stat.math).
Hopefully this formats okay when posted (it doesn't
look promising as I write this in google groups).

> egdata
x1 x2 x3 y
1 18 88 106 13
2 72 45 117 43
3 36 63 99 50
4 75 26 101 77
5 22 83 105 23
6 99 71 170 68
7 69 53 122 6
8 6 49 55 51
9 86 99 185 37
10 85 64 149 10
11 87 7 94 32
12 93 32 125 69
13 44 88 132 4
14 34 34 68 13
15 84 28 112 18

> cor(egdata[,-4])
x1 x2 x3
x1 1.0000000 -0.3067903 0.6573389
x2 -0.3067903 1.0000000 0.5155893
x3 0.6573389 0.5155893 1.0000000

> test.lm <- lm(y~x1+x2+x3,data=egdata)
> summary(test.lm)

Call:
lm(formula = y ~ x1 + x2 + x3, data = egdata)

Residuals:
Min 1Q Median 3Q Max
-30.1912 -20.8404 0.9215 23.0400 34.5238

Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.2500 23.4074 1.463 0.169
x1 0.1768 0.2254 0.784 0.448
x2 -0.1935 0.2563 -0.755 0.465
x3 NA NA NA NA

Residual standard error: 24.73 on 12 degrees of freedom
Multiple R-Squared: 0.1247, Adjusted R-squared: -0.02123
F-statistic: 0.8545 on 2 and 12 DF, p-value: 0.4498

Obviously, x3 = x1 + x2 which is linear dependence,
but no correlations are particularly high.

--
Kevin E. Thorpe
Assistant Professor, Department of Public Health Sciences
Faculty of Medicine, University of Toronto

Bill H

unread,

Dec 7, 2006, 2:47:38 PM12/7/06

to MedStats

This is a hot-button topic over in the sci.stat.* groups (strangely
so). One point that is made in those threads is that the purpose of
the model is important, prediction vs estimation. If the purpose is
prediction (eg. some financial outcome like stock price), then
collinearity, unless it is perfect collinearity and blows up your
model, is not a problem. If the purpose is estimation (eg. search for
risk factors of a disease), then it may be a problem. Bill Howells,
MS, data analyst

Peter Lane

unread,

Dec 8, 2006, 4:48:52 AM12/8/06

to MedStats

I tried this in GenStat, which reminded me of another piece of
terminology used in the context of these problems:

12 MODEL y
13 FIT x1,x2,x3

Message: term x3 cannot be included in the model because it is aliased
with terms already in the model.
(x3) = (x1) + (x2)

The linear dependence, or "aliasing", is easily detected and diagnosed
in the model-fitting algorithm, so any package should be able to
provide this information to let you know what is going on. It is really
much the same as colinearity, with one variable being highly (here
totally) correlated with a linear combination of the others; but as has
been pointed out, is harder to detect by a simple correlation approach.
There are two types of aliasing: "intrinsic aliasing" is when the
dependence is due to the model fitted, regardless of the actual data
(one level of a categorical variable is always aliased if you include
an overall mean or intercept term), whereas "extrinsic aliasing" is as
in the above example, when it is due to the actual observations of the
explanatories (though there may be some intrinsic mechanism behind
this, of course).

John Whittington

unread,

Dec 8, 2006, 12:33:35 PM12/8/06

to MedS...@googlegroups.com

At 01:48 08/12/06 -0800, Peter Lane wrote (in part):

>Message: term x3 cannot be included in the model because it is aliased
>with terms already in the model.
>(x3) = (x1) + (x2)
>
>The linear dependence, or "aliasing", is easily detected and diagnosed
>in the model-fitting algorithm, so any package should be able to
>provide this information to let you know what is going on. It is really
>much the same as colinearity, with one variable being highly (here
>totally) correlated with a linear combination of the others; but as has
>been pointed out, is harder to detect by a simple correlation approach.

Peter, does GenStat ever produce a similar message when the correlation
between one variable and a linear combination of other variables is NOT
'total' - and, if so, do you know what criteria it uses for deciding how
high such a correlation has to be in order to constitute 'aliasing'?

The reason I ask obviously relates to what you go on to say:

>There are two types of aliasing: "intrinsic aliasing" is when the
>dependence is due to the model fitted, regardless of the actual data
>(one level of a categorical variable is always aliased if you include
>an overall mean or intercept term), whereas "extrinsic aliasing" is as
>in the above example, when it is due to the actual observations of the
>explanatories (though there may be some intrinsic mechanism behind
>this, of course).

If, as I rather suspect, such an error message only arises with perfect
correlation (between a variable and a linear combination of other
variables), then it could obviously detect 'intrinsic aliasing' but rarely,
if ever, would it detect real-world 'extrinsic aliasing'.

mcap

unread,

Dec 8, 2006, 2:47:35 PM12/8/06

to MedStats

This has turned into a great thread. Thanks again for the insights.
In my case, I wasn't looking only correlations. I used collinearity
diagnostics from the linear regression menu and looked at VIFs,
condition indexes, and variance proportions, etc. Would this be the
"intrinsic" or "extrinsic" aliasing you are referring to? I was also
looking at the behavior of the other standard errors as each term was
entered in the LR models.

mcap

Bruce Weaver

unread,

Dec 10, 2006, 9:38:07 PM12/10/06

to MedStats

On Dec 7, 10:37 am, "Kevin E. Thorpe" <kevin.tho...@utoronto.ca> wrote:

> I think one has to be careful. Absence of high
> correlations does not mean the absence of potential
> trouble. Here is an extreme example (the data were
> originally posted by Jerry Dallal in sci.stat.math).
> Hopefully this formats okay when posted (it doesn't
> look promising as I write this in google groups).

---- snip the example ----

And there's a flip side to Kevin's point: The presence of a high
correlation between two explanatory variables does not necessarily
indicate a problem. E.g., in a model with two predictors where X2 =
X1-squared, there will be a high correlation between X1 and X2. But
there's nothing at all wrong with the model (assuming it fits the data
well).

--
Bruce Weaver
bwe...@lakeheadu.ca
www.angelfire.com/wv/bwhomedir

Ted Harding

unread,

Dec 11, 2006, 4:11:46 AM12/11/06

to MedS...@googlegroups.com

On 11-Dec-06 Bruce Weaver wrote:
>
> On Dec 7, 10:37 am, "Kevin E. Thorpe" <kevin.tho...@utoronto.ca> wrote:
>
>> I think one has to be careful. Absence of high
>> correlations does not mean the absence of potential
>> trouble. Here is an extreme example (the data were
>> originally posted by Jerry Dallal in sci.stat.math).
>> Hopefully this formats okay when posted (it doesn't
>> look promising as I write this in google groups).
>
> ---- snip the example ----
>
> And there's a flip side to Kevin's point: The presence of a high
> correlation between two explanatory variables does not necessarily
> indicate a problem. E.g., in a model with two predictors where X2 =
> X1-squared, there will be a high correlation between X1 and X2.

That's certainly true if X1 is positive. But if X1 can take both
positive and negative values then it can be false. For example,
with X2 = X1^2:

X1 = {-2,-1,0,1,2}
X2 = {4,1,0,1,4}
corr(X1,X2) = 0.

> But there's nothing at all wrong with the model (assuming it
> fits the data well).

Again, that can depend. If X1 has fairly large values but
varies over a fairly small range, then X2 = X1^2 can be
a very close approximation to a linear function of X1, so
the collinearity problem can arise again and there will indeed
be "something wrong with the model". For example (ages of
patients no longer young):

X1 = {56,58,60,62,64}
X2 = {3138,3364,3600,3844,4096}

You can plot X2 vs X1 and barely see a visible departure from
a straight line.

Comparison of X2 with best-fitting linear function of X1:

X2 X2.fitted
3136 3128 (-8)
3364 3368 (+4)
3600 3608 (+8)
3844 3848 (+4)
4096 4088 (-8)

So general statements about this sort of thing can often be
refuted by particular data sets. There is no substitute for
looking properly at the data!

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861

Date: 11-Dec-06 Time: 09:11:42
------------------------------ XFMail ------------------------------

Peter Lane

unread,

Dec 11, 2006, 4:15:46 AM12/11/06

to MedStats

GenStat does detect real-world extrinsic aliasing, by checking variance
inflation factors: it produces a message for this long befores it
detects "collinearity". In fact, there is no distinction between
intrinsic and extrinsic aliasing internally, unless the system knows
about it from the factorial structure; for example, one parameter of a
categorical factor is always excluded from fitting, by default. I give
below the details if you are interested.

GenStat produces aliasing messages when it detects singularity in the
matrix it inverts to get the solution. It uses a tolerance set by
default to eps*1e7, where eps is the smallest number for which 1+eps is
recognized as different from 1 (eps=1.1e-16 on my pc); if the diagonal
value of the matrix for the term to be estimated has reduced from its
original value (before fitting any terms) to a fraction smaller than
the tolerance, then it is considered aliased. This means that the
collinearity has to be pretty near exact; for example, the variates
a={0, 100, 100} and b={0, 100, 100.01} are not aliased, but they are if
b={0, 100, 100.001}. Before that point, GenStat fits the term, but
reports "near aliasing" with a message like

Message: the variance of some parameter estimates is seriously
inflated, due to near collinearity or aliasing between the following
parameters, listed with their variance inflation factors.
a 133346664.09
b 133346664.09

This is triggered if the inflation factor is greater than 100, as with
b={0, 100, 112} but not with b={0, 100, 113}.

John Whittington

unread,

Dec 11, 2006, 7:04:48 AM12/11/06

to MedS...@googlegroups.com

At 01:15 11/12/06 -0800, Peter Lane wrote:

>GenStat does detect real-world extrinsic aliasing, by checking variance
>inflation factors: it produces a message for this long befores it

>detects "collinearity". [snip]

>
>GenStat produces aliasing messages when it detects singularity in the
>matrix it inverts to get the solution. It uses a tolerance set by
>default to eps*1e7, where eps is the smallest number for which 1+eps is
>recognized as different from 1 (eps=1.1e-16 on my pc); if the diagonal
>value of the matrix for the term to be estimated has reduced from its
>original value (before fitting any terms) to a fraction smaller than
>the tolerance, then it is considered aliased. This means that the
>collinearity has to be pretty near exact;

As you will realise from my previous message, that's what I rather suspected.

>...Before that point, GenStat fits the term, but

>reports "near aliasing" with a message like
>
>Message: the variance of some parameter estimates is seriously
>inflated, due to near collinearity or aliasing between the following
>parameters, listed with their variance inflation factors.
>a 133346664.09
>b 133346664.09
>
>This is triggered if the inflation factor is greater than 100, as with
>b={0, 100, 112} but not with b={0, 100, 113}.

That's interesting. However, in terms of this discusion, when the message
says "..near collinearity or aliasing..." does this imply that it sees
those two things as different, or is it indicating that it is using the two
terms synonymously?

Peter Lane

unread,

Dec 12, 2006, 4:35:15 AM12/12/06

to MedStats

In the GenStat text, collinearity and aliasing are intended as synonyms
(I wrote the original version of this text some years ago when I worked
at Rothamsted). I am not aware of any difference in meaning, though I
haven't heard people talk of extrinsic or intrinsic collinearity: they
describe the same type of property of the design matrix, and do not
involve the response variable in the model.

John Whittington

unread,

Dec 12, 2006, 8:22:43 AM12/12/06

to MedS...@googlegroups.com

At 01:35 12/12/06 -0800, Peter Lane wrote:

>In the GenStat text, collinearity and aliasing are intended as synonyms
>(I wrote the original version of this text some years ago when I worked
>at Rothamsted). I am not aware of any difference in meaning, though I
>haven't heard people talk of extrinsic or intrinsic collinearity: they
>describe the same type of property of the design matrix, and do not
>involve the response variable in the model.

Thanks for clarifying. I'm trying to get to the bottom of how people are
using these various words.

If anyone out there feels that there is some difference between the
meanings of 'collinearity' and 'aliasing', please speak up!

As for me, I have to say that, as I implied in my earlier posts, I have
always taken 'aliasing' to imply perfect collinearity (i.e. total
redundancy of one of the variables), anything less than that just being
called collinearity.

Peter Lane

unread,

Dec 13, 2006, 6:04:55 AM12/13/06

to MedStats

I agree that "aliased" by itself should mean exact linear dependence,
but I am used to seeing "partially aliased" -- perhaps as another way
of saying "partially confounded", and so getting into the area already
discussed that involves the response variable as well. Collinearity to
me also means exact linear dependence, and I would expect to see "near
collinearity" as a description of anything less.

Peter
Peter Lane, Research Statistics Unit, GlaxoSmithKline

John Whittington

unread,

Dec 13, 2006, 11:28:29 AM12/13/06

to MedS...@googlegroups.com

At 03:04 13/12/06 -0800, Peter Lane wrote:

>I agree that "aliased" by itself should mean exact linear dependence,
>but I am used to seeing "partially aliased" -- perhaps as another way
>of saying "partially confounded", and so getting into the area already
>discussed that involves the response variable as well. Collinearity to
>me also means exact linear dependence, and I would expect to see "near
>collinearity" as a description of anything less.

Thanks. That all sounds very reasonable.

Reply all

Reply to author

Forward