Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

linear regression and multicollinearity

2 views
Skip to first unread message

hbe...@gmail.com

unread,
Nov 5, 2007, 11:25:12 PM11/5/07
to
Hi,

I have 2 questions:

1)
I know that multicollinearity may cause some problems, but may be not?
Suppose I've X1, X2 predictors and Y response variable with the
following data:
X1 X2 Y
-----------
1 2 3
2 4 6
3 6 9
4 8 12

X2 = 2*X1, there exists multicollinearity between X1 and X2.

When I try a least squares regression for Y = b0 + b1*X1 + b2*X2
I expect Y =X1 + X2
(b0 = 0 and b1= b2 = 1), the unbiased and minimum variance estimator
But with a software package, exactly R, I get that the system is
singular.
It's ok, if I have the X matrix
> X
[,1] [,2] [,3]
[1,] 1 1 2
[2,] 1 2 4
[3,] 1 3 6
[4,] 1 4 8

X and then R try to invert X' X (in R notation t(X) %*% X) that is not
invertible and I get an error.

Of course is not a real world case problem but, this is an error? is
common in other packages than R?

Thanks!
hb

hbe...@gmail.com

unread,
Nov 5, 2007, 11:57:59 PM11/5/07
to
I'm sorry, R linear regression works OK (lm function), only have a
problem the solve function for matrix equations, it seems that use
singular value decomposition and fails when the matrix A have a linear
dependent column (or row).

David Winsemius

unread,
Nov 6, 2007, 12:43:17 AM11/6/07
to
hbe...@gmail.com wrote in
news:1194325079....@50g2000hsm.googlegroups.com:

> I'm sorry, R linear regression works OK (lm function), only have a
> problem the solve function for matrix equations, it seems that use
> singular value decomposition and fails when the matrix A have a linear
> dependent column (or row).

As it should fail. It will fail in more instances than when you simply
have two rows or columns that are multiples. Form two colums that are
random numbers then form a 3rd column that is the sum of the first two.
X'X will also be rank deficient in that case.

Because of columns 2 and 3.

>> Of course is not a real world case problem but, this is an error? is
>> common in other packages than R?

Not an error. Your X"X matrix to be inverted should have used X[,1:2].
The dependent variable column, X[,3], is not in the hat matrix. That is
why the regression works and your inversion did not.

R> X
V1 V2 V3
1 1 2 3
2 1 4 6
3 1 6 9
4 1 8 12

R> Y <- t(X[,1:2]) %*% X[,1:2]

R> Z <- matinv(Y)
R> Z
V1 V2
V1 1.50 -0.25
V2 -0.25 0.05
attr(,"rank")
[1] 2
attr(,"swept")
[1] TRUE TRUE

--
David Winsemius

hbe...@gmail.com

unread,
Nov 6, 2007, 4:23:45 PM11/6/07
to
On Nov 6, 2:43 am, David Winsemius <doe_s...@comcast.n0T> wrote:

X[,3] is not the dependent variable.

X1 X2 Y
-----------
1 2 3
2 4 6
3 6 9
4 8 12

X =
1 1 2
1 2 4
1 3 6
1 4 8

(the first column multiplies the intercept b0)

Y =
3
6
9
12


But may be I understand your point, in place of invert X'X (this is
impossible), R invert a matrix W'W where W are the independent rows of
X, or use other technique to determine B1, ..., Bp regressor
coefficients ???

X <- matrix(c(1,1,1,1,1,2,3,4,2,4,6,8),nrow=4)
I3 <- matrix(c(1,0,0,0,1,0,0,0,1),nrow=3)
Y = t(X) %*% X
## impossible
solve(Y,I3)

I2 <- matrix(c(1,0,0,1),nrow=2)
Y = t(X[,1:2]) %*% X[,1:2]
## posible
solve(Y,I2)


I'm trying to understand deeply multicollinearity causes and
consequences, feel free to correct me if I confuse...
Multicollinearity is equivalent to ill posed X'X, in the worst case
X'X is not invertible.
Because we use (X'X)^{-1} for coefficients estimatators and X'X is ill
possed then we have a big sensibility of changes on the results (then
big variance) with small changes in some data of predictors, and this
is the reason why multicollinearity may cause a big problem ???

I'm looking now in http://en.wikipedia.org/wiki/Linear_least_squares
that says, when X'X is not invertible then use other methods based on
QR decomposition or singular value decomposition.

I appreciate very much your help! I'm trying to understand deeply the
essence of the subject, to understand why, in which cases and how much
may cause a problem or not (I think is more important than remember
rules, and some texts gives only rules without deep explanations, the
esscence). Then I'm trying to get and unify concepts from:
http://en.wikipedia.org/wiki/Linear_least_squares
http://en.wikipedia.org/wiki/Linear_regression

>
> R> X
> V1 V2 V3
> 1 1 2 3
> 2 1 4 6
> 3 1 6 9
> 4 1 8 12
>
> R> Y <- t(X[,1:2]) %*% X[,1:2]
>
> R> Z <- matinv(Y)
> R> Z
> V1 V2
> V1 1.50 -0.25
> V2 -0.25 0.05
> attr(,"rank")
> [1] 2
> attr(,"swept")
> [1] TRUE TRUE
>
> --
> David Winsemius

Thanks David for your response!

David Winsemius

unread,
Nov 7, 2007, 12:26:46 AM11/7/07
to
Dear DP;

You _should_ get an error when you try to invert a singular matrix. R is
behaving correctly.

When I run you your data through the function lm() the output contains
the line:
"Coefficients: (1 not defined because of singularities)".
Again R is giving appropriate warnings.

> str(x.df)
'data.frame': 4 obs. of 4 variables:
$ V1: num 1 1 1 1
$ V2: num 2 4 6 8
$ V3: num 3 6 9 12
$ y : num 3 6 9 12
> xdf.mdl<-lm(y ~ V2+V3,data=x.df)
> summary(xdf.mdl)

Call:
lm(formula = y ~ V2 + V3, data = x.df)

Residuals:
1 2 3 4
3.680e-16 -6.134e-16 1.227e-16 1.227e-16

Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.110e-15 6.374e-16 1.742e+00 0.224
V2 1.500e+00 1.164e-16 1.289e+16 <2e-16 ***
V3 NA NA NA NA

I do not understand what difficulties you are having because you are not
producing the output that you feel is incorrect. It appears you may need
to further study the meanings of "singular", "eigenvalue", and
"condition number".

--
David Winsemius

David Jones

unread,
Nov 7, 2007, 5:05:43 AM11/7/07
to
hbe...@gmail.com wrote:
>
>
> I appreciate very much your help! I'm trying to understand deeply the
> essence of the subject, to understand why, in which cases and how much
> may cause a problem or not (I think is more important than remember
> rules, and some texts gives only rules without deep explanations, the
> esscence). Then I'm trying to get and unify concepts from:
> http://en.wikipedia.org/wiki/Linear_least_squares
> http://en.wikipedia.org/wiki/Linear_regression
>
>

To "understand deeply" you should also consider the approach via the "generalized inverse" or "pseudoinverse"of a matrix: see
http://en.wikipedia.org/wiki/Generalized_inverse and, in particular the bit under "applications" on the page
http://en.wikipedia.org/wiki/Moore-Penrose_pseudoinverse

However, you may do better by finding some appropriate-level text book than relying on wiki., as it looks as if this material is written at a highly technical level. You might try something like "Matrix Algebra from a Statistician's Perspective" by D A Harville (Springer, 1997).

David Jones

Jack Tomsky

unread,
Nov 7, 2007, 12:54:57 PM11/7/07
to

The least-squares solutions in your example are all of the form, b0=0, b1=c, and b2=(3-c)/2, for any specified c.

One way of getting around the multicolinearlity is to reduce the problem to one of estimating two (rather than three) parameters. Restate the model as

Z = b0 + b2*X2,

where Z = Y-c*X1.

Then Zi = (3-c)*i and X2i = 2i.

X'X is a 2 by 2 nonsingular matrix whose first row is (N, Sum(2i)) and whose second row is (Sum(2i), Sum((2i)^2)).

Specifically, the first row is (N, N(N+1)) and the second row is (N(N+1), 2N(N+1)(2N+1)/3).

X'Z is a 2 by 1 vector whose elements are (3-c)N(N+1)/2 and (3-c)N(N+1)(2N+1)/3.

After multiplying X'X^(-1) by X'Z, you end up with b0 = 0 and b2 = (3-c)/2.

Jack

hbe...@gmail.com

unread,
Nov 11, 2007, 1:55:16 PM11/11/07
to
Thanks for answer!

I agree with you and R, in my second message I've tryed to explain
that R shows me the right results.

My difficulties are: understand how multicollinearity affects the
regression analysis and how it's related with a computational problems
(like ill and possed X'X matrix) and statistical problem (like with a
"small" change in predictors data may arrive to very different results
in the model).
I know the definitions eigenvalue, singular matrix and condition
number but I'm trying to understand implications in the statistics
area.
Again, thanks for answer!

hbe...@gmail.com

unread,
Nov 11, 2007, 1:57:43 PM11/11/07
to
I'll look for "Matrix Algebra from a Statistician's Perspective",
thanks!
About your wikipedia links, I'll look for more data examples in
relation with the multicollinearity problem than deeper theory.
Thanks for answer!

On 7 nov, 07:05, "David Jones" <dajx...@ceh.ac.uk> wrote:
> hbe...@gmail.com wrote:
>
> > I appreciate very much your help! I'm trying to understand deeply the
> > essence of the subject, to understand why, in which cases and how much
> > may cause a problem or not (I think is more important than remember
> > rules, and some texts gives only rules without deep explanations, the
> > esscence). Then I'm trying to get and unify concepts from:
> >http://en.wikipedia.org/wiki/Linear_least_squares
> >http://en.wikipedia.org/wiki/Linear_regression
>

> To "understand deeply" you should also consider the approach via the "generalized inverse" or "pseudoinverse"of a matrix: seehttp://en.wikipedia.org/wiki/Generalized_inverse and, in particular the bit under "applications" on the pagehttp://en.wikipedia.org/wiki/Moore-Penrose_pseudoinverse

hbe...@gmail.com

unread,
Nov 11, 2007, 2:08:10 PM11/11/07
to

Thanks Jack!
I know about some ways of getting around with the multicollinearity
problem (like eliminating variables or getting principal components of
the predictor variables and use the rotated base). I'm trying to
understand how variations in the data variates the quality of results.
With (1) orthogonal variables we haven't multicollinearity problem,
and (2) a (quasi perfect) linear dependence in predictor variables
(like X4 aprox= 2X1 + 3X2 - X3) we have a strong multicollinearity
problem and a small change in predictors data may cause big changes in
the model (this is the worst problem using multicollinear
predictors?); between the extreme cases (1) and (2) I'm trying to
visualize in some way how the predictors data affects the model.
Again, Thanks for answer!

David Winsemius

unread,
Nov 11, 2007, 5:17:05 PM11/11/07
to

> I agree with you and R, in my second message I've tryed to explain
> that R shows me the right results.
>
> My difficulties are: understand how multicollinearity affects the
> regression analysis and how it's related with a computational problems
> (like ill and possed X'X matrix) and statistical problem (like with a
> "small" change in predictors data may arrive to very different results
> in the model).
> I know the definitions eigenvalue, singular matrix and condition
> number but I'm trying to understand implications in the statistics
> area.


Short answer following Myers, "Classical and Modern Regression with
Applications". The variance of predictions is proportional to
x_i'*inv(X'X)x_i. Some of the diagonal elements of inv(X'X) will be
large when multicollinearity exists. "Variance inflation factors" (VIF)
are the diagonal elements of the singular decomposition of X'X.

Perhaps reading one of these items (found with a search on VIF and
"condition number") will help:
<http://www.nd.edu/~rwilliam/stats2/l11.pdf>
<http://www.masil.org/documents/multicollinearity.pdf>

Adding CRAN to that search strategy to get r-specific hits produced:
<http://www.sci.usq.edu.au/courses/STA3301/StudyBook.pdf>
..see pages 3.26-3.33

And:
<http://www-personal.umich.edu/~jwbowers/CLASSES/PS532f07/HANDOUTS/handout7.pdf>
..see pages 2 and 14 and any pages in between that catch yur eye.

The condition number is obtained in R with:
kappa(<matrix or model object>)

--
David Winsemius

Richard Ulrich

unread,
Nov 11, 2007, 10:47:59 PM11/11/07
to
On Sun, 11 Nov 2007 11:08:10 -0800, hbe...@gmail.com wrote:

> On 7 nov, 14:54, Jack Tomsky <jtom...@ix.netcom.com> wrote:

[snip, previous]


>
> Thanks Jack!
> I know about some ways of getting around with the multicollinearity
> problem (like eliminating variables or getting principal components of
> the predictor variables and use the rotated base). I'm trying to
> understand how variations in the data variates the quality of results.
> With (1) orthogonal variables we haven't multicollinearity problem,

right.

> and (2) a (quasi perfect) linear dependence in predictor variables
> (like X4 aprox= 2X1 + 3X2 - X3) we have a strong multicollinearity
> problem and a small change in predictors data may cause big changes in
> the model (this is the worst problem using multicollinear
> predictors?);

Is change of coefficients seen as a problem by you?

If two models give (almost) exactly the same predictions,
then it is fair, by *most* standards, to say that they are the
same model. Or, "The same model can be described in
more than one way."

The "problem" when two models exist with different coefficients
depends on other some other criterion. Does one replicate
or cross-validate better than the other? - either of two solutions
may work as well if the non-independence is mechanical (using B
C, and B/C). Having suppressor variables that are incidental seems
to be a good clue that one particular equation will not be robust.


"Sense" is another sort of criterion. If you have highly correlated
variables, it can be plain silly to pretend that you have coefficients
that are worth "interpreting" for their nominal values. I like to
combine the variables where it makes sense to combine them,
leaving pretty good independence, so that I *can* make sense
of coefficients. But you cannot start out by "making sense" when
you read a set of correlated partial regression coefficients, unless
you are already aware of which outcomes are effectively the same.

> between the extreme cases (1) and (2) I'm trying to
> visualize in some way how the predictors data affects the model.
> Again, Thanks for answer!


--
Rich Ulrich, wpi...@pitt.edu
http://www.pitt.edu/~wpilib/index.html

hbe...@gmail.com

unread,
Nov 15, 2007, 4:23:25 PM11/15/07
to
On Nov 11, 7:17 pm, David Winsemius <doe_s...@comcast.n0T> wrote:

> hbe...@gmail.com wrote innews:1194807316.4...@d55g2000hsg.googlegroups.com:
>
> > I agree with you and R, in my second message I've tryed to explain
> > that R shows me the right results.
>
> > My difficulties are: understand how multicollinearity affects the
> > regression analysis and how it's related with a computational problems
> > (like ill and possed X'X matrix) and statistical problem (like with a
> > "small" change in predictors data may arrive to very different results
> > in the model).
> > I know the definitions eigenvalue, singular matrix and condition
> > number but I'm trying to understand implications in the statistics
> > area.
>
> Short answer following Myers, "Classical and Modern Regression with
> Applications". The variance of predictions is proportional to
> x_i'*inv(X'X)x_i. Some of the diagonal elements of inv(X'X) will be
> large when multicollinearity exists. "Variance inflation factors" (VIF)
> are the diagonal elements of the singular decomposition of X'X.
>
> Perhaps reading one of these items (found with a search on VIF and
> "condition number") will help:
> <http://www.nd.edu/~rwilliam/stats2/l11.pdf>
> <http://www.masil.org/documents/multicollinearity.pdf>
>
> Adding CRAN to that search strategy to get r-specific hits produced:
> <http://www.sci.usq.edu.au/courses/STA3301/StudyBook.pdf>
> ..see pages 3.26-3.33
>
> And:
> <http://www-personal.umich.edu/~jwbowers/CLASSES/PS532f07/HANDOUTS/han...>

> ..see pages 2 and 14 and any pages in between that catch yur eye.
>
> The condition number is obtained in R with:
> kappa(<matrix or model object>)
>
> --
> David Winsemius

Thanks David !!!
I'll read your links, that was I'm looking for...

hbe...@gmail.com

unread,
Nov 15, 2007, 4:28:56 PM11/15/07
to
On Nov 12, 12:47 am, Richard Ulrich <Rich.Ulr...@comcast.net> wrote:
> On Sun, 11 Nov 2007 11:08:10 -0800, hbe...@gmail.com wrote:
> > On 7 nov, 14:54, Jack Tomsky <jtom...@ix.netcom.com> wrote:
> [snip, previous]
>
> > Thanks Jack!
> > I know about some ways of getting around with the multicollinearity
> > problem (like eliminating variables or getting principal components of
> > the predictor variables and use the rotated base). I'm trying to
> > understand how variations in the data variates the quality of results.
> > With (1) orthogonal variables we haven't multicollinearity problem,
>
> right.
>
> > and (2) a (quasi perfect) linear dependence in predictor variables
> > (like X4 aprox= 2X1 + 3X2 - X3) we have a strong multicollinearity
> > problem and a small change in predictors data may cause big changes in
> > the model (this is the worst problem using multicollinear
> > predictors?);
>
> Is change of coefficients seen as a problem by you?

Is a good question!

> If two models give (almost) exactly the same predictions,
> then it is fair, by *most* standards, to say that they are the
> same model. Or, "The same model can be described in
> more than one way."
>
> The "problem" when two models exist with different coefficients
> depends on other some other criterion. Does one replicate
> or cross-validate better than the other? - either of two solutions
> may work as well if the non-independence is mechanical (using B
> C, and B/C). Having suppressor variables that are incidental seems
> to be a good clue that one particular equation will not be robust.
>
> "Sense" is another sort of criterion. If you have highly correlated
> variables, it can be plain silly to pretend that you have coefficients
> that are worth "interpreting" for their nominal values. I like to
> combine the variables where it makes sense to combine them,
> leaving pretty good independence, so that I *can* make sense
> of coefficients. But you cannot start out by "making sense" when
> you read a set of correlated partial regression coefficients, unless
> you are already aware of which outcomes are effectively the same.
>
> > between the extreme cases (1) and (2) I'm trying to
> > visualize in some way how the predictors data affects the model.
> > Again, Thanks for answer!
>
> --
> Rich Ulrich, wpi...@pitt.eduhttp://www.pitt.edu/~wpilib/index.html

I agree, some decisions about remove or keep variables depends on
sense and the problem domain...

Thanks Rich!!!

0 new messages