Patrick van Lonkhuijzen <
p.van.lo...@research.kpn.com>
wrote:
> the literature I read on cluster analysis suggests a problem with cluster
> analysis and multicollinearity. there was suggested that it could be found
> by looking at the correlations. now with the cluster analysiss I'm trying I
> come with the following problems the correlations are immens or almost zero.
Please explain what you mean by "immens" here.
> but when the correlations are zero the interpretability gives a problem.
I don't see it as a problem if these correlations are equal to or
nearly equal to 0. In fact, in some ways, that is better.
> how
> bad is it to ignore milticollinearity and where can I find some literature
> on milticollinearity?
I'm not an expert in cluster analysis, but here are some thoughts
(maybe someone with more expertise can give a better answer).
There are at least two potential problems with multi-collinearity in
cluster analysis. The first occurs if you try to construct a matrix
of Euclidean distances (or some related metric) between each pair of
cases. In this case, if you use a simple Euclidean distance formula
(without taking into account correlations between measures), then you
will give too much weight to variables that are highly correlated.
For example, suppose, among 20 differen person variables, you include,
"height," "arm length," and "leg length." These variables highly
correlated and carry redundant information. In a sense, all three
variables reflect the same underlying trait. If you calculate a
simple Euclidean distance--e.g.,
distance(i,j)^2 = SUM [x(ik) - x(jk)]^2
k
where i and j are two subjects,
x(ik), x(jk) are the scores of subject i and j on variable k
then this underlying trait will count three times towards the
definition of distance, and contribute three times as much as it
should to the cluster analysis solution. If your correlations
are not very high, then I wouldn't worry too much about this
problem.
The second problem with multi-collinearity is simply one of
computation. There are many types of clustering algorithms available.
I would not be surprised if some are unable to deal with highly
multicollinear data--for example, an algorithm may need to invert the
within-group correlation matrix, which cannot be done if one variable
is a linear combination of others. Again, I wouldn't necessarily
worry about this as long as the computer program didn't produce any
error messages or strange results.
A simple expedient to avoid potential problems is to perform a factor
analysis or principal components analysis of your data, and to do the
cluster analysis on the factors or components instead of the raw data.
Alternatively, I believe there are some type of cluster analysis that,
in eff1ect, do this for you, by performing a factor analysis at each
iteration of the clustering process.
The SAS manual for PROC CLUSTER contains a lot of information, and
you might check it for a description of the various algorithms
available.
John
--
John Uebersax
jsueb...@yahoo.com