There have been lot of discussions on multicollinearity recently and
also in the past and I would like to provide MY understanding on this
issue. I don't claim that the following is the ONLY explanation and
works for ALL occasions. I might make a few mistakes because the
followings are from my memory. Any comments are welcome but please
don't blame.
MLR model is (from matrix algebra)
Y=XB
The least square estimator of B is
B_hat=inv(X'X)X'Y
Two measures are widely used to detect multicollinearity or
ill-conditioning from statistics and numerical analysis: variation
inflation factor (VIF) and condition number (the ratio of the largest
eigenvalue to the smallest eigenvalue). Both numbers should be one.
Some authors suggest VIF larger than 10 for practice. However, I've
encountered several cases even if VIF is much smaller than 10.
I had suggested my favorites/approaches in handling multicollinearity
is to use PCA and PCR in previous posting because PCA is the same as
eigenvalue/vector decomposition (EVD) of the covariance matrix (X'X)
and eigenvalue and eigen vectors are the core concept in many
disciplines. So, what is eigenvalue and eigen vector?
Clarification on terminologies: PCA is mathematically the same as SVD
of X and EVD of X'X. PCA is also called KLT transformation
(Kanuhen-Loeve Transformation) in electrical engineering and
Hotelling's transformation. Principal components are also called modes
in modal analysis (for structural analysis of bridges and buildings).
SVD of X (n x p):
X=USV'
What SVD does is that data matrix X is decomposed into three matrices.
The first matrix (U) shows the similarity of "rows". The second matrix
(S) shows the "degree" of similarity and and the similarity is
expressed by singular values which is related to the number of columns
(variables). S is diagonal and symmetric. The third matrix (V) shows
the similarity of "variables". U is the eigen vectors of XX'. V is the
eigen vectors of X'X. By the way, rank(X) =rank(X') =rank(X'X)
=rank(XX') and S'=S because S is symetric.
EVD is related to SVD since EVD of X'X can be achieved by SVD:
X'X
= (USV')'(USV')
=VS'U'USV'
=VDV'
where U'U = V'V=I (identity matrix) and S'S =D.
The columns in U and V are orthnomal or orthogonal. D is symetric
and the elements in D are called eigenvalues. Thus, the squares of
the elements in S are the same as eigenvalues. The sum of eigenvalues
are the same as the number of columns (thus variables).
If X'X is the same as correlation matrix by standardizing the columns,
then the sum of eigenvalues are equal to the number of variables
(trace(X'X)=trace(D)). For example, if X has 12 variables then the sum
of eigenvalues are 12. Thus, if the eigenvalues of the first PC is,
let's say, 6, then 50 % of total variance is accounted by the first PC.
So, what is eigenvalue and eigen vector?
PCA is then;
X=USV'+E
=TV'+E
= PC scores*PC loadings + Errors, where T=US.
PCR is then use T (scores) instead of X. Therefore,
B_hat=inverse(X'X)X'Y=inverse(T'T)T'Y
I would like to welcome any comments on the following that was posted
in the past (especially taking partial derevatives):
### start #####
As an analogy for least square estimation, let's Y=10X1+5X2. The
partial derevative of X1 (dY/dX1) is 10 and X2 (dY/dX2) is 5. Thus X1
is more sensitive than X2 because the they represent the slope,
right?....... Not quite right because, taking the partial derivative of
X1 already assumes that the variable X2 are held CONSTANT. If X1 and
X2 are collinear, there is no way to hold other variable held constant,
thus the least square estimator automatically assume that the input
variables are orthogonal.
Multicollinearity is an ubiquitous problem affecting the computation of
inverse matrix. Artificial neural networks is not free from the
multicollinearity since the chain rule of partial derivatives is used
in order to derive the weights, where the partial derevative assumes
that other variables are hold constant and there is no way from
observational data that the other variables are hold constant.
Therefore any methods that utilize partial derevative suffer
multicollinearity (i.e., steepest descent method). Thus sensitivity
analysis of dY/dXi is often impossible (especially using observational
data not from design of experiments).
### end ###
Why ?
The OLS (ordinary least square) estomator is (from taking partial
derevatives) :
Errors (E) =square(y-y_hat)=square(y-B1X1-B2X2) where y_hat =
B1X1+B2X2.
dE/dB1=??
dE/dB2=??
Bs in MLR are called "partial" regression coefficients because, for
example, B1 is the slope of X1 when X2 is held constant. However, if
X1 and X2 are highly correlated, there is no way of holding X2
constant. Therefore solution of OLS is possible from mathematical view
point, however if X variables are highly correlated, the marginal
contribution of X1 or X2 can NOT be analyzed. This is where the
interpretations of B are often not warrented. (I have to face everyday
that what people ask me is "What is the most important variable to
predict ### and what is the sensitivity of the variable? Even though
the data are observational data....... long story short, I gave
up.....and use PCR and PLS)
Hope this helps.
Sangdon Lee.
You have covered much theoretical ground surrrounding the B_hat
above relative to multicollinearity -- which is the term used to mean
the Ill-conditioning or near-singularity of X'X.
The accommodation (or removal of the ill-effects caused by)
multicollinearity is the REMOVAL of one of more redundunt
variables in the X matrix that are nearly-linearly-dependent.
ALL of the problems and ill-effects that are attributed to
multicollinearity
are self-inflicted, by those who insist on keeping ALL of the redundent
variables. In those cases, even if the numerical results are correct,
the exactly correct estimates of B_hat may be statistically wrong, as
shown by Rubin, Beaton, and Barone in their 1976 JASA paper.
The simple CORRECT approach to multicollinearity problems
Encountered in multiple regression is:REMOVE one or more
of the statistically redundunt variables. Doing anything else is
fooling oneself into thinking if one squeezes turnips hard enough,
blood will come out of them.
For the remainder of Sangdon's post, I'll address only those points
that are EITHER mis-directed or incorrect.
> I had suggested my favorites/approaches in handling multicollinearity
> is to use PCA and PCR in previous posting because PCA is the same as
> eigenvalue/vector decomposition (EVD) of the covariance matrix (X'X)
> and eigenvalue and eigen vectors are the core concept in many
> disciplines. So, what is eigenvalue and eigen vector?
Hadi and Ling cautioned why PCR should NOT be used.
http://www.amstat.org/publications/tas/index.cfm?fuseaction=hadi1998
PCA decomposes the X'X matrix into Principal Components (linear
combinations of X) that are orthogonal, and decreasing in variances.
(1st PC has the largest variance; 2nd PC the second largest, orthogonal
To the 1st; etc.) It does NOT consider the role of Y in the multiple
regression of Y on the X's. There are always the same number of
PC's as theer are the original X's.
PCR (Principle Components Regression) regresses Y on the PCs.
If Y is regressed on ALL of the PCs, then it gets exactly the same
result back as the original full regression.
So, PCR generally regresses Y on the PCs that have the largest
eigenvalues (the PCs with the largest variances).and drop the PCs
with the smallest eigenvalues.
That's exactly what's WRONG with PCR, because the discarded
PCs may account for most of the fit in Y while the kept PCs
still contain all of the original X's. That was one of the points in
the Hadi-Ling paper.
> I would like to welcome any comments on the following that was posted
> in the past (especially taking partial derevatives):
>
> ### start #####
> As an analogy for least square estimation, let's Y=10X1+5X2. The
> partial derevative of X1 (dY/dX1) is 10 and X2 (dY/dX2) is 5.
Why not just call them the coefficient of X1 and X2 respectively?
>Thus X1
> is more sensitive than X2 because the they represent the slope,
> right?....... Not quite right because, taking the partial derivative of
> X1 already assumes that the variable X2 are held CONSTANT. If X1 and
> X2 are collinear, there is no way to hold other variable held constant,
> thus the least square estimator automatically assume that the input
> variables are orthogonal.
The coefficient of Xi, is the partial CORRELATION between Y and Xi,
given all the other Xs in the multiple regression equation. The notion
of "holding something constant" in the interpretation of these partial
correlations is fallacious, as pointed out by Mosteller and Tukey in
their book on regression. Multicollinearity or not, speaking about
"holding some variable constant" is only in the vocabulary of those
unfamiliar with the notion of "partial correlation" that NOTHING is,
or can be held, constant.
> Multicollinearity is an ubiquitous problem affecting the computation of
> inverse matrix.
True, only to a numerical analyst. The trivial ubiquitous solution
for a statistician is to drop the redundant variables. Even if the
inverse of X'X is computed exactly (as in Longley's data), the
statistical ill-effects do not go away, as shown by Rubin, Beaton,
and Barone.
> Bs in MLR are called "partial" regression coefficients because, for
> example, B1 is the slope of X1 when X2 is held constant.
This is NOT correct, as mentioned above.
The simple regression coefficient of X is r(x,y). s(y)/s(x),
where r is the simple correlation, and s denotes the standard
deviations.
In multiple regression, the regression coefficient of Xi, for any i,
is the PARTIAL CORRELATION between Xi and Y, given all the
other Xj (j not equal to i) in the equation, times the same quotient
of partial standard deviations. Statistical interpretation of any
partial correlation does not depend on the notion of a slope or the
mistaken notion of keeping something constant.
In 2005, I made a post on the point that "Simple, Partial, and
Multiple Correlations" are all SIMPLE correlations, if you know
which simple correlation to which to relate each one.
> However, if
> X1 and X2 are highly correlated, there is no way of holding X2
> constant. Therefore solution of OLS is possible from mathematical view
> point, however if X variables are highly correlated, the marginal
> contribution of X1 or X2 can NOT be analyzed.
Get rid of that notion of holding something constant. The entire
paragraph above is both mathematically and statistically wrong.
>This is where the
> interpretations of B are often not warrented.
The interpretation of Bi, the multiple regression coefficient of Xi,
is ALWAYS warranted, and ALWAYS mean the same, the effect
of Xi on Y, in the presence of all other Xjs in the equation.
Most users of multiple regression MISINTERPRETS a multiple
regression coefficient as if it were a simple regression coefficient.
That's where the "expected sign" fallacy comes from.
>(I have to face everyday
> that what people ask me is "What is the most important variable to
> predict ### and what is the sensitivity of the variable? Even though
> the data are observational data.......
Then you need to learn the meaning of the multiple regression
coefficients FIRST, yourself, and then try to explain to them
that ALL of the variables in a multiple regression (and their
coefficients) are inter-related (in the partial correlation sense),
and one cannot isolate a single variable and say that is the
"most important" because is is ill-defined in any sense of that term.
> long story short, I gave
> up.....and use PCR and PLS)
Short and decisive verdict: You jumped from the frying pan
into the fire.
-- Bob.
<< The coefficient of Xi, is the partial CORRELATION between Y and Xi,
given all the other Xs in the multiple regression equation. The notion
of "holding something constant" in the interpretation of these partial
correlations is fallacious, as pointed out by Mosteller and Tukey in
their book on regression. Multicollinearity or not, speaking about
"holding some variable constant" is only in the vocabulary of those
unfamiliar with the notion of "partial correlation" that NOTHING is,
or can be held, constant. >>
Reef Fish is likely referring to
Frederick Mosteller and John W. Tukey, "Data Analysis and Regression, a
second course in statistics", Addison-Wesley, 1977.
The book is bright green, as if in contrast to the dull-covered stat
books of that era.
Chapter 13 of this book, "Woes of Regression Coefficients" is
particularly good reading. Section F considers the issue of "Sometimes
x's can be 'Held Constant' "
> http://www.amstat.org/publications/tas/index.cfm?fuseaction=hadi1998
> the Hadi-Ling paper.
I'm certainly not going to comment on all the above!
There's small and possibly instructive or illustrative data set that I
made use of when I was writing some regression software for a Commodore
PET microcomputer in the 1970's. It is called (by me anyway) "The Hald
data". Four "independent" variables, one dependent, thirteen data rows.
It is roughly speaking compositional data, though the totals are always
less than though close to 100. After this time I can't find the
original reference, I'm afraid, though if I tried /really/ hard I might
eventually.
It provides a nice example - at least I thought so at the time - for
regression diagnostics, such as VIFs, the (equivalent) multiple
correlation coefficients, principal components, and eigenvalues. I
needed, and still do, numerical examples to help me get things straight,
and this filled the bill.
I could post the numbers if anyone is interested I suppose, or would
they be copyright :-(( ?
Robin
Since they are used for examples in MATLAB regression documentation
and are available to users with the single command "load hald", I doubt
if they are copyright protected.
Hope this helps.
Greg
> > The simple CORRECT approach to multicollinearity problems
> > Encountered in multiple regression is:REMOVE one or more
> > of the statistically redundunt variables. Doing anything else is
> > fooling oneself into thinking if one squeezes turnips hard enough,
> > blood will come out of them.
< snipped entire long post except the above >
> I'm certainly not going to comment on all the above!
>
> There's small and possibly instructive or illustrative data set that I
> made use of when I was writing some regression software for a Commodore
> PET microcomputer in the 1970's. It is called (by me anyway) "The Hald
> data". Four "independent" variables, one dependent, thirteen data rows.
That's the famous Hald data, in Draper and Smith's book to illustrate
the regression on all subsets (only 15 of them) by showing the details
of all those regressions.
> It provides a nice example
Not so much for multicollinearity (because it wasn't all that severe),
but
a nice example to illustrate the "elbow rule" as well as what one can
do
to summarize the results of ALL POSSIBLE REGRESSIONS, as well
as using some measures such as Mallow's Cp, Hocking's Jp and other
measures for choosing the "best" subset size and model to use.
Weisberg's book has an example on all subsets with Hald's data, or
some well-known data set comparable to it. I recall when I was giving
an invited talk at the Applied Stat Department (where Cook and Weisberg
were) of the U of Minn, I used Speakeasy to illustrate the POWER of
that language by duplicating everything in Weisberg's example, with an
all-possible regresssion program (with Weisberg's output, the Cp, Jp,
measure, and the plot) all in less than half a page of code of the
basic
language of Speakeasy.
That I can easily show in a SHORT post. :-) When I get home,
because
I don't have Speakeasy on my travelling PC in Manhattan NY now.
-- Bob.
(huge snip)
> I'm certainly not going to comment on all the above!
What's the point in quoting it all, then? Puzzled.
Andy
--
spargeatbtinternetdotcom
FWIW(NVM),IMO(NVH)...
However, it is doubtful that any regression software actually
calculates inv(X'X) in order to calculate Bhat. In MATLAB,
for example, the four most common ways to estimate Bhat
are
Bhat1 = X\Y % [n p ] = size(X), [n m] = size(Y)
Bhat2 = pinv(X)*Y
and
Bhat3 = regress(y,[ones(n,1) X])
Bhat4 = stepwisefit(X,y)
for m =1, Y =y.
Bhat1 is the solution in the least squares sense. The effective
rank of X, rankX, is determined from the QR decomposition of X
with pivoting. A solution Bhat1 is computed which has at most
rankX nonzero components per column. If rankX < p this will
usually not be the same solution as pinv(X)*y.
Bhat2 is the solution in the least squares sense that minimizes
norm(Bhat2). pinv(X) is the Moore-Penrose pseudoinverse of
X.
Bhat3 is obtained using the QR decomposition of X (X=Q*R),
via
Bhat3 = R\*Q'*y
When m = 1, Bhat3 should be the same as Bhat1. However,
REGRESS provides summary regression statistics.
Bhat4 is the result of a stepwise search for significant predictors.
The search can begin with any number of predictors and can be
constrained to keep any subset of them. Arbitrary values of
p-to-add and p-to-remove can be specified. Otherwise, default
values of 0.05 and 0.1 are used. The most common use
of STEPWISEFIT is for forward and backward searches with
no subset of predictors required to be used. Typically, different
search parameters yield different predictor selections.
STEPWISEFIT also provides summary regression statistics.
The rank and conditioning of X should be determined before
trying to estimate Bhat. In MATLAB it is as simple as
rankX = rank(X)
condX = cond(X)
MATLAB HELP yields the following insights:
EPS Floating point relative accuracy. EPS returns the distance
from 1.0 to the next largest floating point number
(EPS = 2.2204e-016 on my machine).
SVD Singular value decomposition. [U,S,V] = SVD(X)
produces S, a diagonal matrix of nonnegative singular
values in decreasing order, with the same dimension as X
and unitary matrices U, V so that X = U*S*V'. S = SVD(X)
only returns a vector containing the singular values.
[U,S,V] = SVD(X,0) produces the "economy size"
decomposition. If X is n-by-p with n > p, then only the
first p columns of U are computed and S is p-by-p.
Since X' = V*S'*U', X'*X*V = V*Lp and X*X'*U = U*Ln
where Lp = S'*S and Ln = S*S' are symmetric diagonal
matrices containing the squares of the singular values.
Therefore, V and U are eigenvector matrices of X'*X
and X*X', respectively.
NORM Matrix or vector norm. NORM(X) is the largest singular
value of X, MAX(SVD(X)).
RANK Matrix rank. RANK(X) provides an estimate of the number
of linearly independent rows or columns of the matrix X.
RANK(X,tol) is the number of singular values of X that
are larger than tol. RANK(X) uses the default
tol0 = MAX(SIZE(X)) * NORM(X) * EPS.
COND Condition number with respect to inversion. COND(X)
returns the ratio of the largest singular value of X to
the smallest. Large condition numbers indicate a nearly
singular matrix.
PINV Pseudoinverse. P = PINV(X) produces a matrix P of the
same dimensions as X' so that X*P*X = X, P*X*P = P
and X*P and P*X are Hermitian. The computation is
based on SVD(X) and any singular values less than a
tolerance are treated as zero. The default tolerance is
tol0.
> Two measures are widely used to detect multicollinearity or
> ill-conditioning from statistics and numerical analysis: variation
> inflation factor (VIF) and condition number (the ratio of the largest
> eigenvalue to the smallest eigenvalue). Both numbers should be one.
cond(X'*X) = 1 is highly unusual (e.g., it happens when X'*X is
unitary) and is certainly not necessary for linear independence.
Necessary and sufficient conditions for linear dependence are
rankX < min(n,p) or, equivalently, condX = inf.
A sufficient condition for multicollinearity (near or exact linear
independence) is
condX > threshold
However, I don't recall the theoretical value for the threshold. It
is probably proportional to an inverse power of eps. I usually
assume that the number of predictors should probably be
reduced when condX > 100. However, in some cases I have
used a threshold of 200 or more.
I usually find it convenient to use standardized variables
[Xn meanX stdX yn meany stdy] = prestd(X,y);
and concentrate on the transformed equation
yn = Xn*Bn.
When p is not large, I use STEPWISEFIT in the forward
and/or backward modes to help select a satisfactory subset
of predictors. When p is large I use a PCA transformation to
reduce the dimensionality of the predictor space before
using STEPWISEFIT.
Both of these techniques are suboptimal. It would probably be
better to use partial-least-squares-regression (PLSR).
In the case of small p I usually examine the all-variable
correlation coefficient matrix
Czz = corrcoef([X y]') = corrcoef([Xn yn]')
to try to understand the relationships among the selected
predictors, the eliminated predictors and the response.
I also use scatter plots to help understand the relationships.
> Some authors suggest VIF larger than 10 for practice. However, I've
> encountered several cases even if VIF is much smaller than 10.
I have not encountered VIF in the statistical summaries of REGRESS
and STEPWISEFIT.
> I had suggested my favorites/approaches in handling multicollinearity
> is to use PCA and PCR in previous posting because PCA is the same as
> eigenvalue/vector decomposition (EVD) of the covariance matrix (X'X)
> and eigenvalue and eigen vectors are the core concept in many
> disciplines. So, what is eigenvalue and eigenvector?
I find it convenient to use standardized predictors Xn, and the
spectral (i.e., eigen) analysis of the sample correlation coefficient
matrix Cxx = Xn'*Xn/(n-1):
Cxx = Czz(1:p,1:p);
if p <= n
[V L] = eig(Cxx,); % Cxx*V = V*L;
else
[V L] = eigs(Cxx,n);
end
traceL = trace(L) % traceL = trace(Cxx)
diagL = diag(L) % vector of eigenvalues (nonincreasing order)
Since sum(diagL) = traceL, I determine Nvar, the maximum number
of principal components to consider by specifying a minimum percent
of traceL, say 99.9%, so that
sum(diagL(1:Nvar-1)) < 0.999*traceL <= sum(diagL(1:Nvar))
The resulting matrix of predictors are the PCs
Xnt = V(:,1:Nvar)' * Xn;
Another standardization results in
Bntnhat = Xntn\yn % or one of the other 3 solutions
ynhat = Xntn * Bntnhat;
yhat = meany + ynhat .*stdy % (.* is element-by-element multiplication)
etc.
As Reef Fish has pointed out, the elements of V(:,Nvar+1:p)
contain the coefficients of linear dependencies.
> Clarification on terminologies: PCA is mathematically the same as SVD
> of X and EVD of X'X. PCA is also called KLT transformation
> (Kanuhen-Loeve Transformation) in electrical engineering and
> Hotelling's transformation. Principal components are also called modes
> in modal analysis (for structural analysis of bridges and buildings).
>
> SVD of X (n x p):
>
> X=USV'
>
> What SVD does is that data matrix X is decomposed into three matrices.
> The first matrix (U) shows the similarity of "rows". The second matrix
> (S) shows the "degree" of similarity and and the similarity is
> expressed by singular values which is related to the number of columns
> (variables). S is diagonal and symmetric.
Then this is the MATLAB economy transformation where U only contains
p columns.
> The third matrix (V) shows
> the similarity of "variables".
Similarity is a very poor choice of terminology. It has a precise
meaning w.r.t matrix transformations. It does not seem to apply
here.
Please define what you mean by the term.
> U is the eigen vectors of XX'. V is the
> eigen vectors of X'X. By the way, rank(X) =rank(X') =rank(X'X)
> =rank(XX') and S'=S because S is symetric.
>
> EVD is related to SVD since EVD of X'X can be achieved by SVD:
> X'X
> = (USV')'(USV')
> =VS'U'USV'
> =VDV'
> where U'U = V'V=I (identity matrix) and S'S =D.
>
> The columns in U and V are orthnomal or orthogonal. D is symetric
> and the elements in D are called eigenvalues. Thus, the squares of
> the elements in S are the same as eigenvalues. The sum of eigenvalues
> are the same as the number of columns (thus variables).
>
> If X'X is the same as correlation matrix by standardizing the columns,
Cxx = X'*X/(n-1)
> then the sum of eigenvalues are equal to the number of variables
> (trace(X'X)=trace(D)). For example, if X has 12 variables then the sum
> of eigenvalues are 12. Thus, if the eigenvalues of the first PC is,
> let's say, 6, then 50 % of total variance is accounted by the first PC.
> So, what is eigenvalue and eigen vector?
>
> PCA is then;
> X=USV'+E
> =TV'+E
> = PC scores*PC loadings + Errors, where T=US.
>
> PCR is then use T (scores) instead of X. Therefore,
> B_hat=inverse(X'X)X'Y=inverse(T'T)T'Y
However, X illconditioned ==> T is illconditioned unless you eliminate
the eigenvector columns corresponding to "small" eigenvalues!
> I would like to welcome any comments on the following that was posted
> in the past (especially taking partial derevatives):
>
> ### start #####
> As an analogy for least square estimation, let's Y=10X1+5X2. The
> partial derevative of X1 (dY/dX1) is 10 and X2 (dY/dX2) is 5. Thus X1
> is more sensitive than X2 because the they represent the slope,
> right?....... Not quite right because, taking the partial derivative of
> X1 already assumes that the variable X2 are held CONSTANT. If X1 and
> X2 are collinear, there is no way to hold other variable held constant,
> thus the least square estimator automatically assume that the input
> variables are orthogonal.
No. The least square estimator does not assume that the input
variables are orthogonal. It is difficult to correct your logic because
it makes no sense.
> Multicollinearity is an ubiquitous problem affecting the computation of
> inverse matrix.
Moot point since the inverse should never be computed to estimate
Bhat.
However, agreed that multicollinearity affects the computation of Bhat.
> Artificial neural networks is not free from the
> multicollinearity since the chain rule of partial derivatives is used
> in order to derive the weights, where the partial derevative assumes
> that other variables are hold constant and there is no way from
> observational data that the other variables are hold constant.
Although it is highly recommended that multicollinearity be removed
before training an ANN, it is not necessary. The analogous
Linear Regression scenario is characterized by the "solution"
Bhat = pinv(X)*y
which always exists even when X'X is singular.
I think you can convince yourself by using an iterative approach
to minimizing the objective function (y-Xb)'*(y-Xb) when X is
rank deficient and X'X has no inverse.
> Therefore any methods that utilize partial derevative suffer
> multicollinearity (i.e., steepest descent method). Thus sensitivity
> analysis of dY/dXi is often impossible (especially using observational
> data not from design of experiments).
> ### end ###
I don't follow. If there are linear dependencies they have to be
identified before any input variable sensitivity analysis can make
sense. Take
y =b0 + b1*x1+b2*x2
subject to
0 = c0+c1*x1+c2*x2
Therefore
dx2/dx1 = -c1/c2 and dx1/dx2 = -c2/c1
Consequently
dy/dx1 = b1 - (-c1/c2)*b2
dy/dx2 = b2 - (c2/c1)*b1
> Why ?
> The OLS (ordinary least square) estomator is (from taking partial
> derevatives) :
>
> Errors (E) =square(y-y_hat)=square(y-B1X1-B2X2) where y_hat =
> B1X1+B2X2.
> dE/dB1=??
> dE/dB2=??
>
> Bs in MLR are called "partial" regression coefficients because, for
> example, B1 is the slope of X1 when X2 is held constant.
Wha? Do you mean it is the slope of y when X2 is held constant?
> However, if
> X1 and X2 are highly correlated, there is no way of holding X2
> constant. Therefore solution of OLS is possible from mathematical view
> point,
This makes no sense. The partial derivatives in the minimization
problem are w.r.t. the Bis, not the Xis.
>however if X variables are highly correlated, the marginal
> contribution of X1 or X2 can NOT be analyzed. This is where the
> interpretations of B are often not warrented. (I have to face everyday
> that what people ask me is "What is the most important variable to
> predict ### and what is the sensitivity of the variable? Even though
> the data are observational data....... long story short, I gave
> up.....and use PCR and PLS)
I find it enlightening to plot bi (standardized variables) and Cyxi vs
i. If the xi are uncorrelated they will be equal. However, that is
seldom the case and the curves can be significantly different,
even if X is not ill-conditioned.
As I showed above, the sensitivity to any variable can be calculated
provided you take into account any linear dependencies. A good
way to take into account linear dependencies is to remove redundant
original variables or transform and remove insignificant principal
components.
For example,
y = b0 + b1*x1 +b2*x2
= b0 + b1*x1+b2*(-c0-c1*x1)/c2
= (b0-c0*b2/c2) + (b1-c1*b2/c2)*x1
Hope this helps.
Greg
The data can be found in:
http://www.ndsu.nodak.edu/qsar_soc/resource/datasets/hald.htm
> > It provides a nice example
>
> Not so much for multicollinearity (because it wasn't all that severe),
> but a nice example to illustrate the "elbow rule" as well as what
> one can do to summarize the results of ALL POSSIBLE REGRESSIONS,
> Weisberg's book has an example on all subsets with Hald's data, or
> some well-known data set comparable to it. I recall when I was giving
> an invited talk at the Applied Stat Department (where Cook and Weisberg
> were) of the U of Minn, I used Speakeasy to illustrate the POWER of
> that language by duplicating everything in Weisberg's example, with an
> all-possible regresssion program (with Weisberg's output, the Cp, Jp,
> measure, and the plot) all in less than half a page of code of the
> basic language of Speakeasy.
The Example from Weisberg 1980, p. 200 was Moore's (1975) data
from an oxygen uptake experiment, rather than the Hald data.
> That I can easily show in a SHORT post. :-) When I get home,
> because I don't have Speakeasy on my travelling PC in Manhattan NY now.
I am home now. This is the Speakeasy program for ALL SUBSETS,
slightly modified to plot SSE vs p instead of Cp-p vs p as in
Weisberg:
LISTING OF PROGRAM ALLSUBS
1 PROGRAM
2 ASK("MATRIX X =","X=");ASK("DEPENDENT VARIABLE = ","Y=")
3 SWEEP1(X,Y,A);NX=NOELS(X(1,));IND=1;FOR I=2,NX;IND=IND,I,IND;NEXT I
4 B=A;FOR I=1,NX;SWEEP(B,I,ANS);B=ANS;NEXT I;SSEF=B(NX+1,NX+1);FREE B
5 N=NOELS(Y);SSTOT=(N-1)*VARIANCE(Y)
6 ID=-(INTS(NX).EQ.INTS(NX));P=NX+1;CP=P;RSQ=1-SSEF/SSTOT;SSE=SSEF
7 FOR
L=1,NOELS(IND);ID(IND(L))=-ID(IND(L));IN=ID.EQ.1;VAR=LOC(IN*INTS(I))
8 MODEL=1+NOELS(VAR);SWEEP(A,IND(L),ANS);A=ANS;SSE1=A(NX+1,NX+1)
9 RSQ1=1-SSE1/SSTOT;RSQ=RSQ,RSQ1;P=P,MODEL;SSE=SSE,SSE1
10 CP1=SSE1*(N-NX-1)/SSEF+2*MODEL-N;CP=CP,CP1
11 MODEL=MODEL,CP1,RSQ1,SSE1,VAR;MODEL;NEXT L
12 TABULATE P,CP,RSQ,SSE;NICEGRAPH(SSE:P)
13 END
"Nicegraph", like "Tabulate", are basic commands in Speakeasy that
automatically provides the appropriate format for tabulation and
ascii plot of the arguments.
This is what the Allsubs output for the Hald data looks like:
:_allsubs
EXECUTION STARTED
MATRIX X = hx
DEPENDENT VARIABLE = hy
MODEL (A 5 Component Array)
2 202.55 .53395 1265.7 1
MODEL (A 6 Component Array)
3 2.6782 .97868 57.904 1 2
MODEL (A 5 Component Array)
2 142.49 .66627 906.34 2
MODEL (A 6 Component Array)
3 62.438 .84703 415.44 2 3
MODEL (A 7 Component Array)
4 3.0413 .98228 48.111 1 2 3
Allsubs computes all possible subsets in the least amount of
computation, using the Hamiltonian path. The above tabulation
for each subset is p (number of parameters, including the
constant; Mallow's Cp; R-square; SSE; X's in model.
The order of the subsets is in Hamiltonian path order of
(X1), (X1, X2), (X2), (X2, X3), (X1, X2, X3) <the above>, etc.
MODEL (A 6 Component Array)
3 198.09 .54817 1227.1 1 3
MODEL (A 5 Component Array)
2 315.15 .28587 1939.4 3
MODEL (A 6 Component Array)
3 22.373 .93529 175.74 3 4
MODEL (A 7 Component Array)
4 3.4968 .98128 50.836 1 3 4
MODEL (A 8 Component Array)
5 5 .98238 47.864 1 2 3 4
MODEL (A 7 Component Array)
4 7.3375 .97282 73.815 2 3 4
MODEL (A 6 Component Array)
3 138.23 .68006 868.88 2 4
MODEL (A 7 Component Array)
4 3.0182 .98234 47.973 1 2 4
MODEL (A 6 Component Array)
3 5.4959 .97247 74.762 1 4
MODEL (A 5 Component Array)
2 138.73 .67454 883.87 4
P CP RSQ SSE
* ******** ****** ********
5 5 .98238 47.864
2 202.55 .53395 1265.7
3 2.6782 .97868 57.904
2 142.49 .66627 906.34
3 62.438 .84703 415.44
4 3.0413 .98228 48.111
3 198.09 .54817 1227.1
2 315.15 .28587 1939.4
3 22.373 .93529 175.74
4 3.4968 .98128 50.836
5 5 .98238 47.864
4 7.3375 .97282 73.815
3 138.23 .68006 868.88
4 3.0182 .98234 47.973
3 5.4959 .97247 74.762
2 138.73 .67454 883.87
The tabultion above is in the same order as that of the
Hamiltonian Path. In particular, the two best models of
3 parameters are: (x1, x2, with SSE = 57.9; and x1, x4,
with SSE = 74.76).
....+....+....+....+....+....+....+....+....+....+....+....+....+....
. *
.
1800 +
+
.
.
.
.
S .
.
S . *
.
E 1200 + *
+
.
.
. *
.
. * *
.
.
.
600 +
+
.
.
. *
.
.
.
. * *
.
0 + * * *
+
....+....+....+....+....+....+....+....+....+....+....+....+....+....
.5 1.5 2.5 3.5 4.5 5.5
6.5
1 2 3 4 5 6
P
MANUAL MODE
When properly aligned, the vertical * above p=2 shows the SSEs
of the 4 simple regressions of Y on X.
Over p = 3 are the models with combinations of two X's. The
* are the SSEs corresponding to:
p Cp R-Square SSE
3 2.6782 .97868 57.904
3 62.438 .84703 415.44
3 198.09 .54817 1227.1
3 22.373 .93529 175.74
3 138.23 .68006 868.88
3 5.4959 .97247 74.762
but because of the plot scale, the 57.9 and 74.7 appear as the same
"*".
One can easily see that the "Elbow Rule" calls for the use of
either the (X1, X2), or (X1, X4) model, over all possible subsets.
There is clear multicollinearity among the X's, to make the full
model SSE
5 5 .98238 47.864 1 2 3 4
virtually indistinguishable from the models with 3 X's:
4 3.0413 .98228 48.111 1 2 3
4 3.4968 .98128 50.836 1 3 4
4 7.3375 .97282 73.815 2 3 4
4 3.0182 .98234 47.973 1 2 4
with the MSE of the full model 47.864/(13-5) = 5.983
greater than the MSE of the smallest of the models with 3 X's
47.974/(13-4) = 5.3304
while the multicollinearity condition among the X's is not severe at
all
to have caused no computational problem in any of the models!
But the all possible subsets considerations strongly suggest that
any model with greater than TWO of the indepdendent variables
is statistically highly redundunt.
-- Bob.
Bob wrote:
>>ALL of the problems and ill-effects that are attributed to
>>multicollinearity
>>are self-inflicted, by those who insist on keeping ALL of the redundant
>>variables. In those cases, even if the numerical results are correct,
>>the exactly correct estimates of B_hat may be statistically wrong, as
>>shown by Rubin, Beaton, and Barone in their 1976 JASA paper.
I read the Beaton, Rubin and Barone (1976) paper. They showed that
even computers using "....40 decimal digits of accuracy produce a very
poor estimate of regression coefficients due to multicollinearity".
Their conclusions are questioned by Dent and Cavander. Dent and
Cavander concluded that "It is suggested that this paper is
noninformative, misleading, and possibly irrelevant". Dent and
Cavander suggested and used SVD for multicollinearity, which is another
name for PCA and thus PCR. IMO, the Beaton's paper is not a good
reference to support your statement .
Warren T. Dent and David C. Cavander, "More on computational accuracy
in regression" J. of the American Statistical Association, Sep. 1997,
v.72, no. 359, pp. 598~602.
>>The simple CORRECT approach to multicollinearity problems
>>Encountered in multiple regression is:REMOVE one or more
>>of the statistically redundunt variables. Doing anything else is
>>fooling oneself into thinking if one squeezes turnips hard enough,
>>blood will come out of them.
>>Hadi and Ling cautioned why PCR should NOT be used.
>>http://www.amstat.org/publications/tas/index.cfm?fuseaction=hadi1998
>>PCR (Principle Components Regression) regresses Y on the PCs.
>>If Y is regressed on ALL of the PCs, then it gets exactly the same
>>result back as the original full regression.
>>So, PCR generally regresses Y on the PCs that have the largest
>>eigenvalues (the PCs with the largest variances).and drop the PCs
>>with the smallest eigenvalues.
>>That's exactly what's WRONG with PCR, because the discarded
>>PCs may account for most of the fit in Y while the kept PCs
>>still contain all of the original X's. That was one of the points in
>>the Hadi-Ling paper.
I read your paper (Ali S. Hadi and Robert F. Ling) also. The following
is abstract from your paper.
" Many textbooks on regression analysis include the methodology of
principal components regression (PCR) as a way of treating
multicollinearity problems. Although we have not encountered any strong
justification of the methodology, we have encountered, through carrying
out the methodology in well-known data sets with severe
multicollinearity, serious actual and potential pitfalls in the
methodology. ....We also illustrate .....it is possible for the PCR to
fail miserably in the sense that when the response variable is
regressed on all of the p principal components (PCs), the first (p-1)
PCs contribute nothing toward the reduction of the residual sum of
squares, yet the last PC alone (the one that is always discarded
according to PCR methodology) contributes everything. ...."
I believe this statement is beyond the reasonable and logical
conclusions from the examples you used in the paper. It is well-know
that PCR fails if the last PCs (which are usually discarded) contribute
to explain Y. However, it is also well-known that PCR is very useful
if the first PCs contribute to explain Y. Actually, Jackson's book
(you used Jackson's book as a reference to support your statement)
devoted a chapter explaining the benefits of PCR , however, only two
paragraphs were used to explain the shortcomes of PCR approach.
In some other threads, you stated that PCA, PCR, PLS (partial least
square), ICA (independent component analysis) are useless. Just check
"Chemometrics" and "signal processing" disciplines to see tons of
papers use these methods to handle multicollinearity and other
purposes.
Hope this helps.
Sangdon Lee
Dear Greg
The above is a nice summary for the regression functions in Matlab and
you provided very good practical approaches in your response that I
agree and I also apply similar approaches. I do not have much time to
respond to your response in detail but I want to respond some of your
comments.
Even though the backslash (\) and 'pinv' use QR decompositions, they
are very sensitive to small changes in input data, however, SVD is not
(thus PCA and PCR). I have not used 'regress' command because I don't
have the licence for that tool but I believe the 'regress' is also
sensitive to small changes .
I provide three examples below to show the unstable beta coefficients
from backslash and pinv. I used data from Neter, Kutner, Nachtsheim,
Wasserman 's book ("Applied linear statistical models"). The data are
standardized and I used backslash and pinv for each cases.
Case 1: Data are standardized and their values have FOUR decimal
places.
Case 2: Only the first sample is rounded to THREE decimal points. The
remaining samples are not changed at all.
Case 3: The first sample is deleted.
In summary, the Beta coefficients are different at the second decimal
place.
% case A: base case
% Bhat1a =
% 0.0000
% 4.2635
% -2.9285
% -1.5613
%
% Case B: the first sample is rounded up to three decimal place.
% Bhat1b =
% 0.0000
% 4.2695
% -2.9340
% -1.5636
% Case C: the first sample is deleted
% Bhat1c =
% 0.0439
% 6.3492
% -4.8613
% -2.3337
Hope this helps.
Sangdon Lee
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% data from Neter's book to demonstrate multicollinearity
%%%% Case A %%%
% Standardized X with four decimal points
Xsa=[1.0000 -1.1556 -1.5417 0.4058
1.0000 -0.1204 -0.2617 0.1590
1.0000 1.0740 0.1395 2.5719
1.0000 0.8948 0.5979 0.9542
1.0000 -1.2353 -1.7136 0.8993
1.0000 0.0587 0.5215 -1.0748
1.0000 1.2134 1.4003 -0.0055
1.0000 0.5166 0.1777 0.8171
1.0000 -0.6380 -0.2426 -1.2119
1.0000 0.0388 0.4451 -0.7732
1.0000 1.1536 1.0373 0.6526
1.0000 1.0143 1.0564 0.1865
1.0000 -1.3149 -0.8921 -1.2667
1.0000 -1.1158 -1.3315 0.2687
1.0000 -2.1311 -1.6181 -1.7329
1.0000 0.8351 0.6170 0.6800
1.0000 0.4768 0.7890 -0.5264
1.0000 0.9745 1.4194 -0.8280
1.0000 -0.5186 -0.5674 -0.1426
1.0000 -0.0209 -0.0325 -0.0329];
Ysa=[-1.62450 0.51017 -0.29278 -0.01860 -1.42866 0.29474 1.35228
1.01935 0.21640 -0.17528 1.01935 1.37187 -1.66367 -0.46904 -1.44824
0.72559 0.47100 1.01935 -1.05656 0.17724]';
[n p]=size(Xsa);
% back slash command
Bhat1a=Xsa\Ysa
% pinv command
Bhat2a=pinv(Xsa)*Ysa
%%%% Case B %%%%%
% Only first sample is rounded upto 3 decimal points.
Xsb=[1.0000 -1.156 -1.542 0.406
1.0000 -0.1204 -0.2617 0.1590
1.0000 1.0740 0.1395 2.5719
1.0000 0.8948 0.5979 0.9542
1.0000 -1.2353 -1.7136 0.8993
1.0000 0.0587 0.5215 -1.0748
1.0000 1.2134 1.4003 -0.0055
1.0000 0.5166 0.1777 0.8171
1.0000 -0.6380 -0.2426 -1.2119
1.0000 0.0388 0.4451 -0.7732
1.0000 1.1536 1.0373 0.6526
1.0000 1.0143 1.0564 0.1865
1.0000 -1.3149 -0.8921 -1.2667
1.0000 -1.1158 -1.3315 0.2687
1.0000 -2.1311 -1.6181 -1.7329
1.0000 0.8351 0.6170 0.6800
1.0000 0.4768 0.7890 -0.5264
1.0000 0.9745 1.4194 -0.8280
1.0000 -0.5186 -0.5674 -0.1426
1.0000 -0.0209 -0.0325 -0.0329];
Ysb=[-1.625 0.51017 -0.29278 -0.01860 -1.42866 0.29474 1.35228 1.01935
0.21640 -0.17528 1.01935 1.37187 -1.66367 -0.46904 -1.44824 0.72559
0.47100 1.01935 -1.05656 0.17724]';
% back slash command
Bhat1b=Xsb\Ysb
% pinv command
Bhat2b=pinv(Xsb)*Ysb
%%%% Case C %%%%%
% Only first sample is deleted
Xsc=[1.0000 -0.1204 -0.2617 0.1590
1.0000 1.0740 0.1395 2.5719
1.0000 0.8948 0.5979 0.9542
1.0000 -1.2353 -1.7136 0.8993
1.0000 0.0587 0.5215 -1.0748
1.0000 1.2134 1.4003 -0.0055
1.0000 0.5166 0.1777 0.8171
1.0000 -0.6380 -0.2426 -1.2119
1.0000 0.0388 0.4451 -0.7732
1.0000 1.1536 1.0373 0.6526
1.0000 1.0143 1.0564 0.1865
1.0000 -1.3149 -0.8921 -1.2667
1.0000 -1.1158 -1.3315 0.2687
1.0000 -2.1311 -1.6181 -1.7329
1.0000 0.8351 0.6170 0.6800
1.0000 0.4768 0.7890 -0.5264
1.0000 0.9745 1.4194 -0.8280
1.0000 -0.5186 -0.5674 -0.1426
1.0000 -0.0209 -0.0325 -0.0329];
Ysc=[0.51017 -0.29278 -0.01860 -1.42866 0.29474 1.35228 1.01935 0.21640
-0.17528 1.01935 1.37187 -1.66367 -0.46904 -1.44824 0.72559 0.47100
1.01935 -1.05656 0.17724]';
% back slash command
Bhat1c=Xsc\Ysc
% pinv command
Bhat2c=pinv(Xsc)*Ysc
% Results
% Eventhough onle the first sample is slightly changed
% (rounded upto 3rd decimal points),
% the Bhats are different (from the 2nd digit)
% case A: base case
% Bhat1a =
% 0.0000
% 4.2635
% -2.9285
% -1.5613
%
% Case B: the first sample is rouned upto three decimal place.
% Bhat1b =
% 0.0000
% 4.2695
% -2.9340
% -1.5636
% Case C: the first sample is deleted
% Bhat1c =
% 0.0439
% 6.3492
% -4.8613
% -2.3337
Nothing I've read or seen since 1976 changed my understanding and
succinct summary above. Every counter-argument I have seen only
serve to bolster the truth of the statement above.
With that preface, I'll address only Sangdon's faux pas in HIS
understanding of both the Rubin, Beaton, and Barone paper of 1976
and my paragraph above about it.
> I read the Beaton, Rubin and Barone (1976) paper. They showed that
> even computers using "....40 decimal digits of accuracy produce a very
> poor estimate of regression coefficients due to multicollinearity".
But that is NOT the important conclusion of that paper. The EXACT
numerical solutions for beta-hat is known, by calculator (given in 1967
by Longley), and it doesn't require 40 decimal digits.
RBB's paper was about the exact numerical values being "statistically
incorrect". I don't believe you actually read the paper -- of if you
do,
understood what it was saying! They perturbed BEYOND the digit
reported at random, 1000 times and computed 1000 sets of numerically
correct solutions (hence REMOVED ALL multicollinearity effect of
the NUMERICAL ACCURACY type). The distribution of the 1000
solutions suggested that the "exact solution" is statistically highy
improbable to be "correct" statistically.
The ill effect of multicollinearity in that problem was the sensitivity
of
the NUMERICAL solution to small perturbation of the data.
The KEY element of the RBB paper was that they perturbed EACH
data value that are KNOWN not to be exact, such as the sizes of
populations given to the nearest 1000. But they perturbed each
such number BEYOND the visible digit, to be consistent with the
reported and figure and still bound the "exact solution" statistically
wrong.
That is, instead of using 1000, they would randomly generate
numbers between 1000 +- a uniform random number between
0 and .5 so that the resulting number is always rounded or
truncated to 1000.00 exactly.
The NUMERICAL ACCURACY problem is ... T R I V I A L !!!
It's the numerically EXACT betas that are statistically wrong that
is the lesson and result of that paper!
> Their conclusions are questioned by Dent and Cavander. Dent and
> Cavander concluded that "It is suggested that this paper is
> noninformative, misleading, and possibly irrelevant". Dent and
> Cavander suggested and used SVD for multicollinearity,
I have not read the Dent and Cavander paper. But I don't need to,
given your 4 lines above. The use of SVD is a computational problem
that is well-known to get better NUMERICAL ACCURACY than many
other matrix inversion methods.
But the NUMERICAL ACCURACY of the Longley solution is a NON-
problem. SVD can't do any better than the desk-calculator exact
solution or any digital computer solution of ZERO error in the
estimated betas!
They apparently were discussing the NUMERICAL accuracy of the
Longley solution while MISSING the entire point of the RBB paper
about numerically correct solutions (in the presence of a high
degree of multicollinearity may be the WRONG statistical solution.
The preceding paragraph IS the lesson from the RBB paper of 1976.
Nothing you've said in your post lessen the truth of that finding
in the slightest way.
> which is another name for PCA and thus PCR.
This is a Sangdon Lee misunderstanding of PCR at best, and a
conceptual ERROR that PCA and PCR are synonymous.
PCA is a Principle Components Analysis or orthogonal decomposition
into eigenvalues and eigenvectors of the X'X matrix.
As I had mentioned, it always yields the SAME number of PCs as the
original number of indepdent variables. Thus, If Y is regressed on
ALL of the PCs (no one on earth ever does) in PCR, then one gets
the identical solutions, and if they are exact to begin with, as in the
Longley case, absolutely nothing is gained by doing the computation
any other way. Exact is exact!
So, there is absolutely NO advantage in using SDV if the fitted beta
can be computed EXACTLY by any other means.
In practice, the MISGUIDED who use PCR in the multicollinearity
situation ALWAYS keeps some of the Principal Components (each
contains the same number of original X's), while DISCARDING
one or many of the other principal components.
Since the PCs are by design/definition orthogonal, the
multicollinearity
indeed disappears completely because the PCs will have mutually
zero correlations among them.
But in so doing, the PCR users are only fooling THEMSELVES.
They insisted on keeping ALL of the original (hence redundant
independent variables), while not having any justification for
discarding the other PCs.
That was exactly the issue addressed by the Hadi and Ling paper
of 1998. PCR has everything to lose, and nothing to gain.
> IMO, the Beaton's paper is not a good
> reference to support your statement .
The RBB paper is the ONLY paper in the literature that articulates,
and articulates WELL, the difference between NUMERICAL accuracy
and STATISTICAL accuracy (not in terms of the usual standard err.
but in terms of severe BIAS to the extent of being statistically
impossible for the numerically correct solution in the Longley case).
>
> Warren T. Dent and David C. Cavander, "More on computational accuracy
> in regression" J. of the American Statistical Association, Sep. 1997,
> v.72, no. 359, pp. 598~602.
Note the keyword there, "computational accuracy".
Read it CAREFULLY, in the light of what I said above, and had said
previously, that the NUMERICALLY correct solution may be the
wrong one, for the reasons RBB gave.
> >>The simple CORRECT approach to multicollinearity problems
> >>Encountered in multiple regression is:REMOVE one or more
> >>of the statistically redundunt variables. Doing anything else is
> >>fooling oneself into thinking if one squeezes turnips hard enough,
> >>blood will come out of them.
I stand by THAT statement, which is a mere re-statement of the first
four lines Sangdon quoted at the beginning.
> In some other threads, you stated that PCA, PCR, PLS (partial least
> square), ICA (independent component analysis) are useless.
I stand by THAT statement also, on the condition that a researcher
is attempting to do a Multiple Regression Problem -- which is self-
contained, with its own remedies for the ills of multicollinearity --
which is one of the EASIEST and the most COMMON SENSE
accommodation -- namely, discarding/removing something that is
known to be statistically REDUNDANT!
> Just check
> "Chemometrics" and "signal processing" disciplines to see tons of
> papers use these methods to handle multicollinearity and other
> purposes.
That is Argumentum ad Ignoratium, Argumentum ad Populum, and
Argumentum Verecundiam, all wrapped into one Great FALLACY!
That merely confirms the extent of statistical abuse by people in
the social and economic sciences, and in other NON-statistical
areas in which Statistics is widely abused!
> Hope this helps.
>
> Sangdon Lee
Yes, it helped to EXPOSE the folly of those who think there is
useful or magic about the use of PCR, which is one of the WORST
abuses of statistics outside of the proper area of statistics.
Now that SVD is agreed by all (I knew it 40 years ago that it is
numerically more stable than many other matrix inversion and
decomposition methods), here's something you can do a little
literature search to find ANY paper that uses PCR (in statistics
or other journals) which actually shows the COVARIANCE
or CORRELATION matrix of the data in the published paper.
(Most papers won't show it).
Then use your SVD to find the eigenvalues and eigenvectors
of the matrix. In more occasion than one, when the matrix was
actually shown in the published paper, I had found the matrix
to contain NEGATIVE eigenvalues!
<To the average reader: A sample covariance matrix ALWAYS
has non-negative eigenvalues.>
That means these PCR authors were completely oblivious to
their "garbage" (of not having enough significant figures in the
matrix or data) to be using the nonsensical data in their PCR
analysis, and not realize that it was just another case of
"Garbage IN, Garbage OUT".
Others are doing as badly -- but simply oblivious to the quackery
in the methods of PCR and related methods.
Hope this helps in your bloody-turnip hunting. :-)
-- Bob.
1. QR is stable if the condition number is sufficiently small.
It's stability is much better than LU (e.g., Gaussian Elimination).
Given standardized variables, one way to reduce the condition
number is to reduce the number of multicollinear variables via
pruning or merging. Another way is to reduce the effect of
predictor correlations by using a regularized form Xa = Q*(R+a*I).
2. The stability of pinv(X,c)*y = pinv(Xc,0) (i.e., truncated
pseudoinversion) depends on the specified tolerance, c. pinv is
defined via SVD; it is one of the most stable solutions and is
essentially PCR. However, a better stable solution is obtained
via regularized pseudoinversion Xd = U*(S+d*I)*V'.
The success of these modified solutions depends on
1. Detecting linear dependence and multicollinearity using rankX and
condX.
2. Obtaining good values for a, c, or d. It is obvious that they
should depend on the machine epsilon and the maximum and minimum
singular values of X.
A search of numerical analysis texts should yield details.
> I provide three examples below to show the unstable beta coefficients
> from backslash and pinv.
That is precisely why you must calculate rankX and condX first. If
they indicate linear dependence or multicollinearity, you have to
modify your approach.
Not enough. What are rankX, condX, se, RMSE and R^2?
Hope this helps.
Greg
-----SNIPPED EXAMPLE.
Greg
The above statement ("holding something constant.... is fallacious") is
quite perplexing to me. I haven't read the book yet, (Frederick
Mosteller and John W. Tukey, "Data Analysis and Regression, a second
course in statistics", Addison-Wesley, 1977). I quote several
paragraphs from Neter et. al. (Neter, Kutner, Nachtshem, Wasserman,
Applied Linear Statistical Models 4th edition, The McGraw-Hill
Companies, Inc., 1996). Similar paragraphs can be found from various
books.
"....The parameter B1 and B2 are called partial regression coefficients
because they reflect the partial effect of one predictor variable when
the other predictor variable is included in the model and is held
constant. (Neter et al. pp.219). .... "
".... The parameter Bk indicates the change in the mean response E{Y}
with a unit increase in the predictor variable Xk, when all other
predictor variables in the regression model are held constant (Neter et
al. pp220)......"
"The common interpretation of a regression coefficient as measuring
the change in the expected value of the response variable when the
given predictor variable is increased by one unit while all other
variables are held constant......". (Neter's book, pp290)......""
What did I miss? If someone could elaborate the above statement, I
really appreciate.
Thanks.
Sangdon Lee
When I mentioned the following, I made a mistake.
>>Even though the backslash (\) and 'pinv' use QR decompositions, they
>>are very sensitive to small changes in input data, however, SVD is not
>>(thus PCA and PCR).
All the decomposition methods (including SVD, EVD) are sensitive.
However, SVD is much more stable than other decomposition methods. I
quote a paragraph from Moler et. al.
".... Trying to find the inverse of a nearly singular matrix is an
inherently sensitive problem. ..... No algorithm working with finite
precision arithmetic can be expected to obtain a computed inverse that
is not contaminated by large errors. (pp. 4)."
Cleve Moler, and Charles Van Loan, Ninteen dubious ways to compute the
exponential of a matrix, twenty-five years later, SIAM Review, 2003.
45, 1, pp.3-49.
The reason that I prefer PCA and PCR is that when a small change (i.e.,
deletion of a sample) occurs in the X, the beta coefficients from PCR
don't change much whereas the MLR using backslash or pinv change a lot
(assuming that the first few PC are associated with Y. Compare case A
versus C in my previous posting). Also, by plotting the first few
principal component scores, correlation structure among variables and
samples can be explored.
The following is where I need clarification. When I learned
derivatives and integration in undergraduate, taking partial derivative
of X assumes that the terms without X are considered as "constant".
For example, F(x,y)=2x^2+4xy+y^2, then
dF/dx=4x+4y.
dF/dy=4x+2y.
By the same analogy, the OLS estimators are to find partial derivative
for b1 and b2 for the residual (E).
E =(y-y_hat)^2=(y-B1*X1-B2*X2)^2, where y_hat = B1*X1+B2*X2.
Taking dE/dB1 assumes that the terms without B1 are considered as
constant. The same for taking dE/dB2. Right? Taking partial
derivative is mathematically feasible to derive the beta coefficients.
However, if X1 and X2 are highly collinear (especially from
observational data), it is impossible to hold other variables as
constant. Did I miss something here?
I appreciate any comments.
Thanks.
Sangdon Lee
I see why you and others are confused about this matter of holding
other variables constant, in the Neter et al context, which is
ENTIRELY different from the interpretation of the PARTIAL
correlation coefficient and its dependence on the COVARIATION
between Y and each X, in the presence of the COVARIATION
between that Y and X and all the other X's.
I've used several different editions of the Neter et al book, since
its first edition. The only thing I found objectionable in their
book is making it a separate chapter and called it "Polynomial
Regression", when that is just like any other multiple regression
in a Linear Model. Calling it Polynomial regression promotes
misuse and misunderstanding of the meaning of Linear Models!
I've known John Neter and Mike Kutner well to know that they
are above the usual misinterpretation of the regression
coefficient by others who thought the idea of a PARTIAL
correlation is hold something constant! I think THAT's the
crux of the present issue -- in the cited passage, they are
talking abour something entirely different from the partial
correlation idea, as I'll explain.
The passages of Neter et al cited was discussing the
MEAN change or the change in E(Y) per unit change in
the i-th regression coefficient. As such, they were not
talking about the difference in STATISTICAL interpretation
between a SIMPLE or a PARTIAL correlation at all, or
had anything to do the OLS or the method of estimation
of the coefficients!
It would have the SAME interpretation in the MEAN change
in Y if Bk consists of SIMPLE correlation info, as in the
orthogonal X case (in which the "expected sign" notion
would be valid; or consists of PARTIAL correlation (in
which the "expected sign" becomes a fallacy when there
is no explantion of WHY; as in the common OLS case
of correlated X's. In fact, Neter et al's unit change
interpretation remains the same if Bk were estimated
by robust regression methods, which does NOT
depends on neither the simple NOR the partial
correlation. And FINALLY, taking all the formal
estimation methods aside, Bk could have been estimated
by EYE. and all of the cited statements would still
apply, in terms of unit change in mean Y when the
other variables are held constant.
In ALL of those cases, the interpretation of Bk, in the
per unit change interpretation, must presume all other
variabels don't change, hence "held constant" is used in
that sense.
But in the case of EXPECTED SIGN, and especially in the
OLS context, it makes a vast difference whether the Bk
is simple correlation or partial correlation based, and that
there is NOTHING that is held constant or can be held
constant in the COVARIATIONAL changes.
That is in fact Mosteller and Tukey's imbedded idea.
if the X's are mutually correlated, then the covariation between
Y and each particular Xi, depends on the covariation of both
Y and Xi, with all the other Xj's -- because nothing is held
constant and nothing can be help constant in the covariation!
> What did I miss? If someone could elaborate the above statement, I
> really appreciate.
I think what you (and most likely the other "held constant"
mis-users) do, to misapply the "held constant" idea in "per unit
change in the MEAN Y, to the distinction between simple and
partial correlations.
The covariation aspects of a partial correlation can be better seen
by this simple relation between PARTIAL and SIMPLE correlations
in the case of only two X's in the model.
[ r (y,2) - r(y,1).r(2,1)]
r (y X2 | X1) = ---------------------------------------------
sqrt[ (1 - r(y,1)**2) (1 - r(2,1)**2) ]
Now you can clearly see, that the partial correlation depends on
the correlation between Y and X1, r(y,1); Y and X2, r(y,2);
and between X1 and X2, r(1,2).
Notice NOTHING is, or can be, held constant in that covariational
relation.
When there are THREE independent variables, the explicit formula
between the 2nd order partial r( Y X3 | X1, X2) already becomes
unwieldy, in terms of simply correlations -- but the messy
formulas only points to the fact that ALL of the simple correlations
are at work, to comprise the covariation between Y and any given X.
In any event, the only terminology that seems odd is their use
of "partial regression coefficients" to denote multiple correlation
coefficients. I suspect that's introduced by the new co-author.
It is always understood that OLS multiple regression correlations
CONTAIN partial correlation information, without calling it a
"partial regression coefficient".
Since I don't have any editor of the Neter et al book, I cannot check
where there was indeed a change of the term in their new edition(s).
-- Bob.
Again, pinv uses SVD.
> All the decomposition methods (including SVD, EVD) are sensitive.
> However, SVD is much more stable than other decomposition methods.
SVD deals with X. EVD deals with X'*X. Therefore it is inherently less
stable.
The stability of both is enhanced because the transformation matrices
U,V
have orthogonal columns and the unstable components are concentrated
along the lower diagonal of D.
> quote a paragraph from Moler et. al.
>
> ".... Trying to find the inverse of a nearly singular matrix is an
> inherently sensitive problem. ..... No algorithm working with finite
> precision arithmetic can be expected to obtain a computed inverse that
> is not contaminated by large errors. (pp. 4)."
> Cleve Moler, and Charles Van Loan, Ninteen dubious ways to compute the
> exponential of a matrix, twenty-five years later, SIAM Review, 2003.
> 45, 1, pp.3-49.
However, the original problem isn't to find the inverse of a matrix. It
is
to find the Least-square-error of the linear equation X*b = y.
> The reason that I prefer PCA and PCR is that when a small change (i.e.,
> deletion of a sample) occurs in the X, the beta coefficients from PCR
> don't change much whereas the MLR using backslash or pinv change a lot
> (assuming that the first few PC are associated with Y.
Sometimes that assumption is invalid. Especially in classification
scenarios where directions of class separation can be orthogonal
to directions of maximum scatter/spread/variance.
> Compare case A
> versus C in my previous posting). Also, by plotting the first few
> principal component scores, correlation structure among variables and
> samples can be explored.
I plot
1. response-predictor simple correlation coefficients vs predictor
index.
2. response-predictor partial correlation coefficients vs predictor
index
(p small-to-moderate)
3.response-PC correlation coefficients vs PC index (p moderate to
large).
However, I have never plotted PC scores vs predictor index.
I don't see how variable correlation structure can be easily deduced
from
them. Please explain.
> The following is where I need clarification. When I learned
> derivatives and integration in undergraduate, taking partial derivative
> of X assumes that the terms without X are considered as "constant".
> For example, F(x,y)=2x^2+4xy+y^2, then
> dF/dx=4x+4y.
> dF/dy=4x+2y.
>
> By the same analogy, the OLS estimators are to find partial derivative
> for b1 and b2 for the residual (E).
> E =(y-y_hat)^2=(y-B1*X1-B2*X2)^2, where y_hat = B1*X1+B2*X2.
> Taking dE/dB1 assumes that the terms without B1 are considered as
> constant. The same for taking dE/dB2. Right? Taking partial
> derivative is mathematically feasible to derive the beta coefficients.
> However, if X1 and X2 are highly collinear (especially from
> observational data), it is impossible to hold other variables as
> constant. Did I miss something here?
Yes. When there is linear dependence it is no longer a simple
minimization problem. It is a constrained minimization problem.
For example,
minimize E = ( y - b1*x1 - b2*x2 - b3*x3 )^2
subject to the linear constraint
F = a1*x1 + a2*x2 + a3*x3 = 0.
Solving the latter for x3 and substituting yields
minimize E = ( y - (b1+b3*a1/a3)*x1 - (b2+b3*a2/a3)*x2 )^2
which is equivalent to removing x3 from the regression and
solving
minimize E = ( y - c1*x1 - c2*x2 )^2
Hope this helps.
Greg
Correction:
The plot is made regardless of the size of p. However, choosing
original predictors over PCs for the regression model is only
done when p is small to moderate.
> 3. response-PC correlation coefficients vs PC index (p moderate to
> large).
>
> However, I have never plotted PC scores vs predictor index.
i.e. in a regression setting to try to deduce correlation structure.
> I don't see how variable correlation structure can be easily deduced
> from them. Please explain.
I do see that the plots may be useful, given the dominant PCs, in
trying to determine the dominant original predictors. Further work
toward that identification could be zeroing PC scores whose squares
contribute insignificantly to the eigenvectors unit length. What is
left may be helpful, but I still don't see how correlation structure is
going to be deduced.
Here is an excerpt from the Mosteller & Tukey chapter on regression
coefficients (pp. 318-319) that may help.
----- start of excerpt ----
13F. Sometimes x's can be "Held Constant"
We have been careful to point out--using x and t = x^2--that id does not
generally make sense to try to interpret the coefficints of x(i) in
terms of what "would happen if the other x's were held constant". In
this section, we try to go ahead a little, sounding a few of the most
necessary warnings.
Polynomial fits. When it comes to fitting polynomials, whether as simple as
b(1)x + b(2)x^2
or as complex as
b(1)x + b(2)x^2 + b(3)x^3 + b(4)x^4 + b(5)x^5 ,
it rarely pays to try to interpret coefficients. Pictures of the
fits--or of the difference in two fits to two sets of data--can be very
helpful, but the coefficients themselves are rarely worth a hard look.
Unrelated x's. If the x's are not closely related, either functionally
or statistically, we may be able to get away with interpreting b(i) as
the "effect of x(i) changing while the other x's keep their same
values." If we want to tap expert judgment about the value of b(i),
some set of words like those in quotes may be the best we can use.
----- end of excerpt ----
To elaborate on the point about polynomial fits, if x(2) = x(1)^2, it is
impossible to hold one of the x's constant while increasing the other by
one unit. So the interpretation of coefficients described under
"Unrelated x's" above cannot apply across the board for all models, at
least not without some modification.
Here's a question for RF Bob, or anyone else who cares to comment. Is
this modification of the usual statement any more acceptable to you?
"The coefficient b(i) gives the change in the fitted value of Y that
accompanies a one-unit increase in x(i) while all other x's, except
those that are functionally related to x(i), are held constant."
Cheers,
Bruce
--
Bruce Weaver
bwe...@lakeheadu.ca
www.angelfire.com/wv/bwhomedir
> Here's a question for RF Bob, or anyone else who cares to comment. Is
> this modification of the usual statement any more acceptable to you?
>
> "The coefficient b(i) gives the change in the fitted value of Y that
> accompanies a one-unit increase in x(i) while all other x's, except
> those that are functionally related to x(i), are held constant."
>
Not true, because if those that are functionally related to x(i) (such
as x(i)^2) vary, the change in the fitted Y for a unit change in x(i)
will be different from the coefficient of x(i).
It is true, however, (using your phrasing as closely as possible) that
the coefficient b(i) gives the change in the fitted value of Y that
accompanies a one-unit increase in x(i) while all other x's are held
constant in those cases where it is feasible to vary x(i) alone.
(There's also a potential problem with the word "increase", which should
not be interpreted longitudinally when the data come from a
cross-sectional study, but that's yet another issue.)
Thanks Jerry, for stepping in while I was busy entertaining myself
elsewhere.
I agree with Jerry's answer, and even IF one made enough
amendments to the above three-line statement, in the form
of footnotes, exceptions, etc., what you stated is LIKELY to
worsen the confusion rather than remove it. Your statement
is ALREADY unnecessarily complicated and confusing to
make it useful in any real sense of applied multiple regression.
Given what had been discussed in these groups, I think the best
thing to do is to think of the PARTIAL correlation aspects of the
coefficient, as the relation between X and Y, in the PRESENCE
of all other variables in the model.
The rest will take care of themselves, whether the X's are
orthogonal, correlated, or multilinearly related.
Just say "NO" to holding anything constant. :-)
The problem is self-solving. Thanks to Sangdon's quotes from
Neter, Wasserman, Kutner, and Nachsheim, one can see what
they talk about there is NOT the proper interpretation of the
meaning of the partial correlation coefficient in a regression
fitted model <as distinct and separate from the meaning of a
simple correlation>.
Neter et al were discussing the meaning of the coefficient
itself, in terms of "PER UNIT CHANGE", and their interpretation
is true, with or without data, whether the variables are orthgoonal,
and doesn't even depend on whether the estimation CONSISTS
of the use of simple correlation OR partial correlation, as
in OLS (Ordinary Least Squares) in the usual multiple
regression problems.
in that respect, if one simple NEVER use the terminology of
"holding something constant" with respect to the PARTIAL
correlation aspects of the problem, then all of the common
misinterpretations in the "expected signs" and attempts to
attached specific meaning to ONE X, while neglecting the
effects of other X's in the model that contributed to THAT
estimated coefficient, and any other kinds of related misuse,
will vanish, by themselves!
-- Bob.
Right. Good catch Jerry.
To re-iterate what Mosteller & Tukey said, "it rarely pays to try to
interpret coefficients [for a polynomial fit]. Pictures of the fits--or
of the difference in two fits to two sets of data--can be very helpful,
but the coefficients themselves are rarely worth a hard look."
>
> It is true, however, (using your phrasing as closely as possible) that
> the coefficient b(i) gives the change in the fitted value of Y that
> accompanies a one-unit increase in x(i) while all other x's are held
> constant in those cases where it is feasible to vary x(i) alone.
Yes, that works better. The wording near the end is somewhat suggestive
of experimental manipulation of x(i), though. To avoid that, I might
change it to "where it is feasible for x(i) to vary alone".
>
> (There's also a potential problem with the word "increase", which should
> not be interpreted longitudinally when the data come from a
> cross-sectional study, but that's yet another issue.)
I think of it as the difference between the fitted values of Y for two
cases (or subjects) that are identical on all of the x's except for
x(i), where they differ by one unit. But only for those situations
where it is feasible for x(i) to vary alone, of course!
On 1 May 2006 15:41:29 -0700, "Reef Fish"
<Large_Nass...@Yahoo.com> wrote:
[snip]
>
> I've used several different editions of the Neter et al book, since
> its first edition. The only thing I found objectionable in their
> book is making it a separate chapter and called it "Polynomial
> Regression", when that is just like any other multiple regression
> in a Linear Model. Calling it Polynomial regression promotes
> misuse and misunderstanding of the meaning of Linear Models!
There are different ways of organizing the topics, and none
of them are perfect.
For those of us who see Polynomial regression as a
distinctly special case in several regards, Bob's preference
for blurring the distinctions seems like an alternate way
to '[promote] misuse and misunderstanding.'
[snip, many lines]
> In any event, the only terminology that seems odd is their use
> of "partial regression coefficients" to denote multiple correlation
> coefficients. I suspect that's introduced by the new co-author.
> It is always understood that OLS multiple regression correlations
> CONTAIN partial correlation information, without calling it a
> "partial regression coefficient".
>
> Since I don't have any editor of the Neter et al book, I cannot check
> where there was indeed a change of the term in their new edition(s).
I've long been comfortable with the term, "partial regression
coefficients." I'm a little surprised that Bob does not like it,
because it seems to me to emphasize his own point --
the single coefficients cannot (easily) be interpreted alone.
That point is absolute for Polynomial regression; that
is one of the special circumstances of polynomial regression.
I've never studied from the Neter book. I imagine I picked
up the phrase in some course, 30 years ago, and I've felt
that it was apt. And I've used it when I start to explain why
the regression coefficients do not reflect what someone
naively 'expected'.
--
Rich Ulrich, wpi...@pitt.edu
http://www.pitt.edu/~wpilib/index.html
My last two lines above was manifested in what I observed in
sci.stat.math, where the term LINEAR MODEL, which has a
standard meaning in Statistic to mean a model iinear in the
PARAMTERS of the model, is confused by many readers to
mean a linear FUNCTIONAL model.
A polynomial model is a LINEAR model (in the parameters),
but a NONLINEAR FUNCTION MODEL in the X's.
That is where a common confusion sets in, so much so that
Richard Ulrich was the only person who was wrong on
EVERY answer he submitted on an elementary quiz of
stated models as to where they are LINEAR models or not.
If the FUNCTIONAL terms in the regression models are
trigonometric functions, there is absolute no need to call
it a trigonometric model.
It is just another case of a LINEAR model.
That is my objection. The same confusion inducing chapter
title is found in many elementary textbooks. At least most of
those textbooks are not advanced enough to discuss the notion
of LINEAR models.
The book by Neter et al is VERY explicit in its presentation of
LINEAR models in the correct sense. It uses LINEAR models
to set up experimental designs which are clearly nonlinear in
the X's. It talks about fitting nonlinear surfaces in X which are
clearly nonlinear functional models, yet LINEAR models within
the context of regression analysis.
>
> There are different ways of organizing the topics, and none
> of them are perfect.
>
> For those of us who see Polynomial regression as a
> distinctly special case in several regards, Bob's preference
> for blurring the distinctions seems like an alternate way
> to '[promote] misuse and misunderstanding.'
Its ironic that the person in these group MOST CONFUSED by
the notion of a LINEAR regression model and a linear FUNCTIONAL
model should be the first to speak up about my objection to the
use of the term "Polynomial models" -- and that my objection
was that it confuses people like Richard Ulrich!!!
Ulrich says he sees a Polynomial regression as a distinctly
special case in serveral regards without mentioning ANY of
what he considered "special regards". His follow up sentence
is especially ironic to call it
RU> an alternate way to '[promote] misuse and misunderstanding'
coming from someone who HAD misused and misunderstood what
a LINEAR model was, and seems still confused about it.
>
> [snip, many lines]
>
> > In any event, the only terminology that seems odd is their use
> > of "partial regression coefficients" to denote multiple correlation
> > coefficients. I suspect that's introduced by the new co-author.
> > It is always understood that OLS multiple regression correlations
> > CONTAIN partial correlation information, without calling it a
> > "partial regression coefficient".
>
> I've long been comfortable with the term, "partial regression
> coefficients." I'm a little surprised that Bob does not like it,
> because it seems to me to emphasize his own point --
> the single coefficients cannot (easily) be interpreted alone.
>
> That point is absolute for Polynomial regression; that
> is one of the special circumstances of polynomial regression.
REALLY now? Your misunderstanding about partial
correlations and linear models are much deeper than I had
thought.
Why is the fact that "the single coefficients cannot (easily)
be interpreted alone" any different in Polynomial regression
(as a special of a linear model), as any otherlinear regression
models that have correlated X's?
> I've never studied from the Neter book.
I suspect you have not studied from ANY book written by competent
statisticians. Else, as in the Neter et al book, ANYONE who had
studied it with any modicum of success would have had a clear
understanding about LINEAR models, and about the proper
notion about partial correlations.
Furthermore, such students would NEVER have made all those
errors about multiple regression that Richard Ulrich had made
(and was pointed out by me), all in the archives of this newsgroup.
-- Bob.
P.S.
After I have finished this reply to Ulrich, I found a webpage
which showed the OUTLINE of Neter et al's 4th Edition, chapter
by chapter.
I think Neter et al must have heard my objection to the use of
"Polynomial Regression" in what I recall to be Chapter 9 in the
early editions. :-) It's now GONE.
Furthermore, the book is FULL of chapters titles and BOOK title
as well with the keyword "Linear Models". I'll make a separate
post about the current content and how it IMPROVED upon the
earlier editions, such as the removal of the misleading chapter
and title of "Polynomial Regression".
There is NO LONGER a separate discussion of Polynomial
Regression in Neter et al's book. If I had known that before
I started this reply, I could have just pointed the current chapter
titles to Ulrich.
Literally speaking, Bob lies. There was no quiz with submissions.
Bob invented "questions" and inferred "answers" from posts.
Bob has difficulty in reading my posts, and then is apparently
incapable of accepting my correction of his own mis-reading
of my content and intent. I see that he has the same trouble
with other people, too, so I figure this is a sound observation.
[snip, a bunch]
RF >
> There is NO LONGER a separate discussion of Polynomial
> Regression in Neter et al's book. If I had known that before
> I started this reply, I could have just pointed the current chapter
> titles to Ulrich.
If Bob did not post his stream-of-consciousness reaction to
whatever he reads, line by line, it should cut down his volume
in a useful way. I used to think that delay would improve his
reading comprehension, but there is little evidence that time helps.
Literally speaking, you may call it just a solicited poll with my
9 specific questions on the identification of Linear models vs
nonlinear. Several readers submitted there responses to those
specific questions, including Jerry Dallel. The answers were
NOT inferred. They were directly from the archives. I don't
want to waste time to dig it up.
Richard Ultich was the ONLY ONE in sci.stat.math who MISSED
all of the questions he answered, including calling this
Y = bX
an example of a NONLINEAR regression because the constant
a was missing. LOL. Only one in the WORLD who would call
that a nonlinear regression. Richard Ulrich.
> Bob has difficulty in reading my posts, and then is apparently
> incapable of accepting my correction of his own mis-reading
> of my content and intent.
This line is getting VERY, VERY old, Richard Ulrich.
> [snip, a bunch]
> RF >
> > There is NO LONGER a separate discussion of Polynomial
> > Regression in Neter et al's book. If I had known that before
> > I started this reply, I could have just pointed the current chapter
> > titles to Ulrich.
-- Bob.
You, however, have yet to back up you claim that the model
E(y) = b x1 + b^2 x2
is a linear model, i.e. "...where the term LINEAR MODEL, which has a
standard meaning in Statistic to mean a model iinear in the
PARAMTERS of the model ...".
>
>> Bob has difficulty in reading my posts, and then is apparently
>> incapable of accepting my correction of his own mis-reading
>> of my content and intent.
>
> This line is getting VERY, VERY old, Richard Ulrich.
>
Indeed. Perhaps if you didn't bring this subject up, neither would Rich.
Bob
--
Bob O'Hara
Department of Mathematics and Statistics
P.O. Box 68 (Gustaf Hällströmin katu 2b)
FIN-00014 University of Helsinki
Finland
Telephone: +358-9-191 51479
Mobile: +358 50 599 0540
Fax: +358-9-191 51400
WWW: http://www.RNI.Helsinki.FI/~boh/
Journal of Negative Results - EEB: www.jnr-eeb.org
Show us how he admitted it. Of course he was wrong! He
mouth danced around it just like he did all the rest.
>
> You, however, have yet to back up you claim that the model
>
> E(y) = b x1 + b^2 x2
That's a Linear model with a constraint Dead Horse.
One that Bob O'Hara failed miserably not only on the above
but several other examples.
Leave the Dead Horse alone or go back and review the dead
thread.
>
> is a linear model, i.e. "...where the term LINEAR MODEL, which has a
> standard meaning in Statistic to mean a model iinear in the
> PARAMTERS of the model ...".
You mean you STILL failed to see that the model is a LINEAR
COMBINATIONS of the parameters? (By a mere notational substitution
of b1 and b2 and then put a constrain on b2?
That the same mistake Ulrich make by declaring the model with
constrain a = 0 is a (his) NONLINEAR model, in Y = bX. LOL
>
> >
> >> Bob has difficulty in reading my posts, and then is apparently
> >> incapable of accepting my correction of his own mis-reading
> >> of my content and intent.
> >
> > This line is getting VERY, VERY old, Richard Ulrich.
> >
> Indeed. Perhaps if you didn't bring this subject up, neither would Rich.
Nor would Bob O'Hara. and you are both STILL WRONG about
the same items in which you failed.
Consistent, I say.
-- Bob.
Ok, let's make this simple. If E(y) = b x1 + b^2 x2 is linear in the
parameters, then if we differentiate w.r.t. the parameters, we should
get a constant, i.e. something that doesn't depend on the parameters. True?
So, if I do this, I get
d E(y)/db = x1 + 2b x2
which still depends on b. Where have I gone wrong?
> That the same mistake Ulrich make by declaring the model with
> constrain a = 0 is a (his) NONLINEAR model, in Y = bX. LOL
>
>>>>Bob has difficulty in reading my posts, and then is apparently
>>>>incapable of accepting my correction of his own mis-reading
>>>>of my content and intent.
>>>
>>>This line is getting VERY, VERY old, Richard Ulrich.
>>>
>>
>>Indeed. Perhaps if you didn't bring this subject up, neither would Rich.
>
>
> Nor would Bob O'Hara.
Indeed. Get the hint. Please.
Bob
--
Bob O'Hara
Dept. of Mathematics and Statistics
P.O. Box 68 (Gustaf Hällströmin katu 2b)
FIN-00014 University of Helsinki
Finland
Telephone: +358-9-191 51479
Mobile: +358 50 599 0540
Fax: +358-9-191 51400
WWW: http://www.RNI.Helsinki.FI/~boh/
Journal of Negative Results - EEB: http://www.jnr-eeb.org
That at least a NEW one that shows the depth of Bob O'Hara's
ignorance about Linear Regression Models.
In Y = sin(x),
where is an unknown parameter b to be estimated in a Regression?
> --
> Bob O'Hara
>
> Dept. of Mathematics and Statistics
> FIN-00014 University of Helsinki
> Finland
Bob O'Hara, when it comes to Statistics, you are a disgrace
to the Department to which you attach your name, a disgrace
to the University of Helsinki, and a disgrace to your country
Finland, which ranks only behind Hong Kong, as the Top
country in the training of 15 year olds, in a recent international
study and report.
-- Bob.
But if you don't like that, try Y = sin(b1 x).
And also please get back to me on the rest of my message: I genuinely
want to know where my logic is going wrong. Or if it isn't, then how
you can claim that the model is linear.
I'll ignore the abuse.
Bob
--
Bob O'Hara
Department of Mathematics and Statistics
P.O. Box 68 (Gustaf Hällströmin katu 2b)
FIN-00014 University of Helsinki
Finland
Telephone: +358-9-191 51479
Mobile: +358 50 599 0540
Fax: +358-9-191 51400
WWW: http://www.RNI.Helsinki.FI/~boh/
Journal of Negative Results - EEB: www.jnr-eeb.org
Bob O'Hara calls Y = sin (x) a linear model because it is easy to
fit
even though there is
NO PARAMETER
>
> But if you don't like that, try Y = sin(b1 x).
Bob O'Hara, you should heed the Dennis First Law of Hole:
"When you find yourself in a Hole, you should STOP digging."
You have just proven, beyond any shadow of a doubt, of your
ignorance about Linear Regression Models.
The model Y = sin (b1 x), where b1 is a parameter, is,
by any definition in any book of statistics, NOT a Linear Regresion
Model!
Have you tried to write that as Y = X b where b is your b1?
> And also please get back to me on the rest of my message:
No need. It's a Dead Horse. Go back to school and learn
about Statistical Linear Models. You had never learned it in the
first place; and when taught in sci.stat.math, you STILL couldn't
learn, even though almost everyone else had learned.
> I genuinely
> want to know where my logic is going wrong. Or if it isn't, then how
> you can claim that the model is linear.
The logic is that you failed to heed the definition of a Linear Model.
You failed to understand the meaning of a Linear Combination of
the parameters. That's just a start.
> I'll ignore the abuse.
> --
> Bob O'Hara
> Department of Mathematics and Statistics
> FIN-00014 University of Helsinki
> Finland
What abuse? I did not say it the first 10 times you made such
blunders. I did not say it the first 50 times you made such
blunders. I finally said it to you because you are STILL ARGUING,
and ARGUING, on the simplest of a mathematical definition of
a statistical term of what constitues a Linear Regression Model.
I therefore will repeat to you:
Bob O'Hara, when it comes to Statistics, you are a disgrace
to the Department to which you attach your name, a disgrace
to the University of Helsinki, and a disgrace to your country
Finland, which ranks only behind Hong Kong, as the Top
country in the training of 15 year olds, in a recent international
study and report.
Furthermore, I must relunctantly add you to the list of those in
these groups to whom I have said that their FREE lessons
are over, because I had already expended at least 10 times
the amount of time I normally would go over the same material,
to ANY student I have ever taught, in the very same topic of
Linear Regression Models.
Buy you the book by Neter, Kutner, Nachsheim, and Wasserman,
4th edition, which I reviewed here on their Chapter topics,
relative to the Polynomial Regression comment being a LINEAR
model, that put you into the same class now, with:
Richard Ulrich, Greg Heath, with Anon Bob O'Hara being the
newly elected "Frequent Abuser of Statistics" in sci.stat.* groups.
If it's any consolation to you, none of you has quite made it
yet to the same class of Luis A. Afonso.
Good Luck (you'll need it), and goodbye.
-- Bob.
> Have you tried to write that as Y = X b where b is your b1?
>
Yes, you can expand it out as a polynomial. Do you STILL fail to see
that the model is a LINEAR COMBINATIONS of the parameters? (By a mere
notational substitution of b1, b2,..., bi and then put a constrain on
the bi's?).
I fail to see the difference in the logic, and you haven't given me any
explanation.
>
>> And also please get back to me on the rest of my message:
>
> No need. It's a Dead Horse. Go back to school and learn
> about Statistical Linear Models. You had never learned it in the
> first place; and when taught in sci.stat.math, you STILL couldn't
> learn, even though almost everyone else had learned.
>
>> I genuinely
>> want to know where my logic is going wrong. Or if it isn't, then how
>> you can claim that the model is linear.
>
> The logic is that you failed to heed the definition of a Linear Model.
> You failed to understand the meaning of a Linear Combination of
> the parameters. That's just a start.
>
PLEASE SHOW ME! You have consistently refused to deal with this issue,
which should be very simple.
How is E(Y) = b x1 + b^2 x2 a linear combination of the parameter b?
If you can show this, then I'll be happy. If you can't, then it looks
like the failure of understanding is yours, and you're either being
extremely stupid, or very dis-honest.
Bob
--
Bob O'Hara
Department of Mathematics and Statistics
P.O. Box 68 (Gustaf Hällströmin katu 2b)
FIN-00014 University of Helsinki
Finland
Telephone: +358-9-191 51479
However it you think about the phrase "linear in the parameters", you can
see that BL is right. Given the equation:
E(Y) = b x1 + b^2 x2, it is linear in the parameters. There is no
requirement on how that parameter is expressed (power, square root, log,
exponential, etc.). It just requires a linear form.
The equation Y=sin(X) has no parameters to fit. Although the function (sin)
can be expressed as a polynomial, there are no polynomial parameters to fit.
They are all implicit in the reduction to an infinite series (as a
polynomial)..
Again, the fitted coefficients in a polynomial (such as the NIST Filip data
set of 11 polynomial coefficients) can be obtained by linear regression. The
literature reports that Stata and JMP 4.0.5 can't solve this fit problem. If
however the 10 power terms in X are centered and treated as a linear
multivariate problem, Excel 2000 will give accurate parameter values, with
the lowest LRE accuracy measure of coefficient values being 10.5. This
comes from my URL on the faults and errors in Excel.
DAH
But more importantly, my model, E(Y) = b(x1 + b x2) is not of this form:
you can't write an f(b) that makes it linear in b.
Draper & Smith discuss non-linear models in their book on regression,
and point out that some non-linear models can be written in a linear
form, and call them "intrinsically linear". The point being (I assume)
that it makes it easier to develop algorithms that can search for
optimal solutions as the parameter space of the intrinsically linear
model is a sub-space of the parameter space a linear model).
> The equation Y=sin(X) has no parameters to fit. Although the function (sin)
> can be expressed as a polynomial, there are no polynomial parameters to fit.
As Reef Fish pointed out. That's why I changed then changed it to E(Y)
= sin(bX).
> They are all implicit in the reduction to an infinite series (as a
> polynomial)..
>
> Again, the fitted coefficients in a polynomial (such as the NIST Filip data
> set of 11 polynomial coefficients) can be obtained by linear regression. The
> literature reports that Stata and JMP 4.0.5 can't solve this fit problem. If
> however the 10 power terms in X are centered and treated as a linear
> multivariate problem, Excel 2000 will give accurate parameter values, with
> the lowest LRE accuracy measure of coefficient values being 10.5. This
> comes from my URL on the faults and errors in Excel.
>
OK, but you're going to have problems fitting an infinite number of terms!
If you're going to claim that E(Y) = sin (bX) is a linear model, then
you're claiming that any function that can be written as a polynomial is
a linear model. In which case most examples of non-linear models would
be described as linear.
Bob
--
Bob O'Hara
David Heiser,
Your should have simply given your paragraphs below, without the above
gratuitously derogatory statements about my "method in the classroom"
when you haven't seen the number of "lectures" I had given Bob O'Hara
FREE and he ended where he is today -- just look at his follow-up to
your
post!
My undergrad student who had my Applied Linear Models course from
the textbooks by Neter, Wasserman, and Kutner, would have mastered
the concept of LINEAR MODELS in two lectures, while Bob O'Hare and
Richard Ulrich, after the equivalent of about 20 or more lectures,
repeating the SAME substance, are still completely oblivious the
concept and methods.
You are ABSOLUTELY CORRECT about how simple the concepts are
and your explanation of why sin(x) is NOT a regression or linear model,
and Bob O'Hara's same ridiculous response to YOUR response says
it all.
If you have the patience to re-explain to Bob O'Hara the same idea a
dozen times, Bob O'Hara will be exactly where he is today, if not
worse.
His argument about sin(x) being a linear model and SIN(b1X) being a
linear model ARE in fact much worse than his previous misconceptions.
-- Reef Fish Bob.
Sorry folks. That posting name was used, and meant to be used ONLY
in the newsgroup rec.travel.cruises. I'll have to check the posting
name
more carefully next time I post.
The message to David Heiser stands of course.
I was using _reductio ad absurdum_ to argue that Reef Fish's argument
that bx1+b^2x2 is a linear model isn't valid. Of course, my argument
may be wrong, in which case I hope someone will explain why it's wrong.
Bob
--
Bob O'Hara
Dept. of Mathematics and Statistics
P.O. Box 68 (Gustaf Hällströmin katu 2b)
FIN-00014 University of Helsinki
Finland
Telephone: +358-9-191 51479
Mobile: +358 50 599 0540
Fax: +358-9-191 51400
WWW: http://www.RNI.Helsinki.FI/~boh/
Journal of Negative Results - EEB: http://www.jnr-eeb.org
Y = sin(x),
Anon Bob> How does that stop it being a linear model (or not)?
Anon Bob> It's just one that's very easy to fit!
RF> Bob O'Hara calls Y = sin (x) a linear model because it is
easy to
RF> fit even though there is NO PARAMETER
Anon Bob retorts,
Anon Bob> But if you don't like that, try Y = sin(b1 x).
David Heiser had explained to Bob O'Hara as follows,
DH> The equation Y=sin(X) has no parameters to fit. Although
DH> the function (sin) can be expressed as a polynomial, there
DH> are no polynomial parameters to fit. They are all implicit
DH> in the reduction to an infinite series (as a polynomial)..
Now Anon Bob O'Hare tried to obfuscate his TOTAL confusion
about Linear Models by citing a series of posts unrelated to his
NEW ERRORS, followed by his irrelevant excuses about his
latest absurd statements about linear models!
And David Heiser had the audacity to make snide remarks and
invalid inference about my teaching methods, when he knew
NOTHING about my teaching methods, nor had he seen the
persistent errors made by Bob O'Hara since the Linear Models
thread started in 2005, resolved and finished!
And Bob O'Hara is STILL making the same errors about Linear
Models, and making NEW errors because his lack of understanding
about linear models!
-- Reef Fish Bob.