Mahalanobis distance in Cluster analysis?

I.M. Boerefijn

unread,

Jul 7, 2005, 12:37:09 PM7/7/05

to

Hello,

because some of my data suffers from multicollinearity, I would like to use
the Mahalanobis distance (D²) for my cluster analysis. But I can't find this
tool in SPSS (version 12.01). Does anybody has a solution for this problem?

Thanks in advance!

Irene

Amulet

unread,

Jul 7, 2005, 2:36:22 PM7/7/05

to

it should be under regression, use analyze, regression, and check for
all the available option

I.M. Boerefijn

unread,

Jul 8, 2005, 4:53:22 AM7/8/05

to

Thank you for your quick response Amulet,

but I don't really understand how this technique used under Regression
analysis will help me with my (hierachical) Cluster analysis.
By the way the correllation I spoke of, has a coefficient of 0,86. This
involves 4 of my 23 variables (so 2 pairs both have coefficients of 0,86).
Is there another way to deal with this high correllation when using cluster
analysis?

"Amulet" <amul...@hotmail.com> schreef in bericht
news:1120761382.4...@g47g2000cwa.googlegroups.com...

Gottfried Helms

unread,

Jul 8, 2005, 9:16:49 AM7/8/05

to

Am 07.07.05 18:37 schrieb I.M. Boerefijn:

If I recall things right, mahalanobis distacnes are just the
distances between the orthogonal components instead of the
correlated variables.
To get orthogonal components from correlated variables you
can use factor analysis (PC-extraction, all factors) and save the
factor-scores as new variables.
Now you perform your cluster-analysis on that factor-scores.

Gottfried Helms

I.M. Boerefijn

unread,

Jul 8, 2005, 11:01:40 AM7/8/05

to

Gottfried thanks for your response,

you mean I have to extract the same number of factors as the number of
variables I have?
I take it that principal component analysis and an varimax rotation must be
used here?

I compared the factor sultions given by the original variables and the
factors: they lead to quite different results.

"Gottfried Helms" <he...@uni-kassel.de> schreef in bericht
news:dalugp$ojj$05$1...@news.t-online.com...

Gottfried Helms

unread,

Jul 8, 2005, 12:02:38 PM7/8/05

to

Am 08.07.05 17:01 schrieb I.M. Boerefijn:

> Gottfried thanks for your response,
>
> you mean I have to extract the same number of factors as the number of
> variables I have?
> I take it that principal component analysis and an varimax rotation must be
> used here?

Yes, but I think a rotation of the "components" doesn't affect the distances
measured between "cases" (if taken over *all components*). The rotation
could only be useful to have better "interpretable" components as
representants for your bundles of correlated variables - but if you
don't use those components later you win nothing by any rotation.

>
> I compared the factor sultions given by the original variables and the
> factors: they lead to quite different results.
>

Yes, because without computing the mahalanobis-(orthogonal
components-) distances the variables with high correlation
cumulate their weight; in the extreme, if you would just
generate copies of some variables (they would be perfectly
correlated), you simply multiplied their weight in the
computation of the distance-values.

Gottfried Helms

Richard Wright

unread,

Jul 8, 2005, 4:06:03 PM7/8/05

to

As I understand it, using the scores from an ALL factor PC-extraction
will produce a cluster-analysis that is identical to using the
standard scores with their multicollinearity. So why use that approach
in a cluster analysis? Of course it is a different matter with
discriminant analysis, where multicollinearity can stuff up the
discriminant analysis and using scores of all PCs is a way around that
problem.

Gottfried Helms

unread,

Jul 9, 2005, 8:09:02 AM7/9/05

to

Am 08.07.05 22:06 schrieb Richard Wright:

>
>
> As I understand it, using the scores from an ALL factor PC-extraction
> will produce a cluster-analysis that is identical to using the
> standard scores with their multicollinearity. So why use that approach
> in a cluster analysis? Of course it is a different matter with
> discriminant analysis, where multicollinearity can stuff up the
> discriminant analysis and using scores of all PCs is a way around that
> problem.
>

Well, I just checked in BORTZ (Statistik für Sozialwissenschaftler),
p.552 (5'th ed )
He writes, that you use all factors for mahalanobis. I'll check
it later with experimental data.

He also mentions a procedure to compute factorscores
based on a PCA-with-varimax-rotation, using only relevant
factors, as being meaningful. But that's not mahalanobis
(following his explanations).

Regards -

Gottfried Helms

I.M. Boerefijn

unread,

Jul 9, 2005, 8:27:57 AM7/9/05

to

Now I'm starting to get a bit confused about how to cope with the
multicollinearity conducting cluster analysis. Richard: you say the ALL
factor PC-extraction is not a good solution?
Is it then perhaps the best answer to conduct a 'regular' Factor analysis.
And use the extraced factors (in my case it gives 7 factors; extracted from
the initial 23 variables) for the Cluster analysis? I read somewhere that
scientists don't agree about whether using 'regular' factor analysis is
appropriate or not when conducter cluster analysis...

I also have to conduct a discriminant analys, using the same variables. So
if I get it right, the method proposed by Gottfried would be for sure a GOOD
solution for that purpose?

Thanks for all the response so far,

Irene

"Richard Wright" <richwri...@tig.com.au> schreef in bericht
news:nnmtc1l80lkaguapj...@4ax.com...

I.M. Boerefijn

unread,

Jul 9, 2005, 8:34:30 AM7/9/05

to

Gottfried, I'm sorry: I wrote my last message without having seen your last
message today!

"He also mentions a procedure to compute factorscores
based on a PCA-with-varimax-rotation, using only relevant
factors, as being meaningful. But that's not mahalanobis
(following his explanations)."

So this is basically the same thing I mentioned a minute ago.

"I.M. Boerefijn" <i.m.bo...@home.nl> schreef in bericht
news:daofsj$cic$1...@news3.zwoll1.ov.home.nl...

Richard Wright

unread,

Jul 9, 2005, 4:52:47 PM7/9/05

to

Was Bortz perhaps writing in the context of discriminant analysis?

Richard Wright

unread,

Jul 9, 2005, 5:00:49 PM7/9/05

to

On Sat, 9 Jul 2005 14:27:57 +0200, "I.M. Boerefijn"
<i.m.bo...@home.nl> wrote:

>Now I'm starting to get a bit confused about how to cope with the
>multicollinearity conducting cluster analysis. Richard: you say the ALL
>factor PC-extraction is not a good solution?

No, I am saying that I think it is no solution, not that it is a bad
solution. This is because the results of using PCA scores from all PCs
will give you the same result as a cluster analysis done on the
original z-scores.

Since my first reply I have refreshed my mind on discriminant
analysis. Experimentally the same results are achieved by the two
methods - on the original values and on the scores of all PCs.

Of course in cluster analysis you may want to get rid of the 'noise'
in the minor PCs and therefore select only the scores from the first
few PCs. But this is not an issue of multicollinearity.

Richard Ulrich

unread,

Jul 9, 2005, 5:31:22 PM7/9/05

to

On Sun, 10 Jul 2005 07:00:49 +1000, Richard Wright
<richwri...@tig.com.au> wrote:

> On Sat, 9 Jul 2005 14:27:57 +0200, "I.M. Boerefijn"
> <i.m.bo...@home.nl> wrote:
>
> >Now I'm starting to get a bit confused about how to cope with the
> >multicollinearity conducting cluster analysis. Richard: you say the ALL
> >factor PC-extraction is not a good solution?
>
> No, I am saying that I think it is no solution, not that it is a bad
> solution. This is because the results of using PCA scores from all PCs
> will give you the same result as a cluster analysis done on the
> original z-scores.

I see several problems here. Is there one specific type of
clustering being considered? I don't believe they will
all act the same.

PCA scores allow identification of a "multivariate-only"-
outlier as an outlier; Euclidean distance will not. So, I don't
think the analyses *should* be the same. That's just the
clearest practical case. Changing the "distance" metric
surely ought to affect the eventual clustering, by theory.

[snip, rest]

--
Rich Ulrich, wpi...@pitt.edu
http://www.pitt.edu/~wpilib/index.html

Richard Wright

unread,

Jul 9, 2005, 11:04:51 PM7/9/05

to

On Sat, 09 Jul 2005 17:31:22 -0400, Richard Ulrich
<Rich....@comcast.net> wrote:

<earlier stuff snipped>

>>No, I am saying that I think it is no solution, not that it is a bad
>>solution. This is because the results of using PCA scores from all PCs
>>will give you the same result as a cluster analysis done on the
>>original z-scores.

>I see several problems here. Is there one specific type of
>clustering being considered? I don't believe they will
>all act the same.

I was assuming good old-fashioned 'all other things being equal'.

Obviously different methods of cluster analysis won't all act the
same, but I was assuming that the OP would use the same method of
cluster analysis in both cases.

>
>PCA scores allow identification of a "multivariate-only"-
>outlier as an outlier; Euclidean distance will not. So, I don't
>think the analyses *should* be the same. That's just the
>clearest practical case. Changing the "distance" metric
>surely ought to affect the eventual clustering, by theory.

Again, I was assuming that the OP would use Euclidean distance in both
cases.

To return to the OP's original problem, I replied that there was no
sense in using the scores of all PCs to get rid of multicollinearity
before doing a cluster analysis. My point was that I thought the
scores of all PCs should give a clustering result that is the same as
a cluster analysis on the standard scores of the original data.

I have tested this supposition on experimental data.

The data is at

http://homepages.ihug.com.au/~richwrig/testing.xls

and the resulting dendrograms using both approaches at

http://homepages.ihug.com.au/~richwrig/StdScr&PCs.gif

I am a lurker on the SPSS group, and not an SPSS user. So I can't give
the SPSS pathways for obtaining the above. However to keep all other
things equal for comparing the results:

(1) The cluster analysis of the original data is done on standard
scores (z-scores).

(2) The PCA is done using the correlation matrix.

(3) The output of PCA scores is unstandardized.

(4) The PCA scores remain unstandardized in the subsequent cluster
analysis.

To sum up. The two dendrograms are identical. So I conclude that there
is no point in the OP using all PC scores to remove multicollinearity.

Gottfried Helms

unread,

Jul 10, 2005, 4:36:19 AM7/10/05

to

Am 10.07.05 05:04 schrieb Richard Wright:

>
> The data is at
>
> http://homepages.ihug.com.au/~richwrig/testing.xls
>
> and the resulting dendrograms using both approaches at
>
> http://homepages.ihug.com.au/~richwrig/StdScr&PCs.gif
>
> I am a lurker on the SPSS group, and not an SPSS user. So I can't give
> the SPSS pathways for obtaining the above. However to keep all other
> things equal for comparing the results:
>
> (1) The cluster analysis of the original data is done on standard
> scores (z-scores).
>
> (2) The PCA is done using the correlation matrix.
>
> (3) The output of PCA scores is unstandardized.
>
> (4) The PCA scores remain unstandardized in the subsequent cluster
> analysis.
>
> To sum up. The two dendrograms are identical. So I conclude that there
> is no point in the OP using all PC scores to remove multicollinearity.

Well, I did the same test with SPSS and came to different results.
I created 5 variables z1 to z5 with the following correlations (N=50):

Korrelation nach Pearson
z1 z2 z3 z4 z5
-------------------------------------------------------
z1 1 .945 .896 .250 -.203
z2 .945 1 .954 .223 -.173
z3 .896 .954 1 .249 -.234
z4 .250 .223 .249 1 .029
z5 -.203 -.173 -.234 .029 1
-------------------------------------------------------

Then I computed factorscores with PCA (PC1 to PC5) without
rotation, and factorscores with Varimax-rotation (VM1 to VM5).

Now I clustered for Z1 to Z5, storing the clusternumber (generating 10 clusters)
in CLU_Z, clustered for PC1 to PC5 storing CLU_PC, and clustered for
VM1 to VM5 storing the clusternumbers into CLU_VM.

The dendrograms are already slightly different; I could provide the
graphic if desired. To have things in a short form I just computed
the differences of clusternumbers of the three clustering-types.

clu_z = cluster-number for cases after clustering by original z-variables
clu_pc = cluster-number for cases after clustering by pc-scores (unrotated)
clu_vm = cluster-number for cases after clustering by pc-scores (varimax rotated)

Frequencies of cluster-number differences:

difference clu_z - clu_z - clu_pc -
in cl-nmbr clu_pc clu_vm clu_vm
freq freq freq
---------------------------------------------------------
-9.00 1 1 0
-7.00 1 1 0
-6.00 1 1 0
-5.00 1 1 0
-3.00 2 2 0
-2.00 1 1 0
-1.00 2 2 0
0.00 33 33 50
1.00 4 4 0
2.00 1 1 0
3.00 2 2 0
7.00 1 1 0

Only the clustering between PC and varimax-scores is identical.

Another aspect, which I came across just now is then, what the
meaning of the clustering by factor-scores is. Since clustering
is not scale-invariant (at least for seuclid-method), there is
an impact in that all orthogonal component-scores have the
same stddev - regardless whether they represent the principal
component or just residual-components. Thus using PCA-scores
gives the residual-components a weight, which seems not reasonable
in a first view, and possibly a more sensible way to use that
scores should possibly be to weight them by the eigenvalues
of each component. (The clusternumber is again different for
such a set of factorscores, as I just checked by simulation).
I've no opinion on all this by the moment...

Gottfried Helms

unread,

Jul 10, 2005, 5:04:14 AM7/10/05

to

Richard -

Am 09.07.05 22:52 schrieb Richard Wright:

> Was Bortz perhaps writing in the context of discriminant analysis?
>

No. It is in the chapter "cluster-analyse".
But I'm a bit surprised now, that he didn't write about the
conceptual impacts of that method; after dealing a bit with that
thing (see my other posting) I find it important to mention those
impacts.
The only comment is
"with mahalanobis distances one gets an euclidean measure
of distance, which is corrected (*1) for correlations between
the items"

This short comment seems far too sparse to me and possibly
even leads into a wrong direction, which I may had adapted
up to this current discussion.

Gottfried Helms

(1) "bereinigt" in german; perhaps a better translation "purgified" (?)

Richard Wright

unread,

Jul 10, 2005, 5:16:47 AM7/10/05

to

You are absolutely right about the weighting. That is why I used the
unstandardized scores output by StatistiXL. These are proportional to
the eigenvalues, as you suggest. In other words their variance
decreases as the eigenvalues decrease, so the PCs with smaller
eigenvalues contribute less information to the clustering. This is
surely a desirable property.

Then you have to be certain that the cluster analysis routine does not
surreptitiously destroy this desirable weighting by standardizing the
unstandardized PC scores! Does SPSS give you an option here?

See my conditions (3) and (4) above. If these conditions are not met
then the equivalance of the standardized data and unstandardized PC
scores is destroyed and dendrograms will not be the same.

Richard Wright

Richard Ulrich

unread,

Jul 10, 2005, 7:57:07 PM7/10/05

to

On Sun, 10 Jul 2005 19:16:47 +1000, Richard Wright
<richwri...@tig.com.au> wrote:

[snip, various comments]
RW > >>

> >> (1) The cluster analysis of the original data is done on standard
> >> scores (z-scores).
> >>
> >> (2) The PCA is done using the correlation matrix.
> >>
> >> (3) The output of PCA scores is unstandardized.
> >>
> >> (4) The PCA scores remain unstandardized in the subsequent cluster
> >> analysis.
> >>

[snip, counter-example by GH.]

>
> You are absolutely right about the weighting. That is why I used the
> unstandardized scores output by StatistiXL. These are proportional to
> the eigenvalues, as you suggest. In other words their variance
> decreases as the eigenvalues decrease, so the PCs with smaller
> eigenvalues contribute less information to the clustering. This is
> surely a desirable property.

If the lengths aren't standardized to the eigenvalues, it won't
be the Mahanolobis distance that is represented.

>
> Then you have to be certain that the cluster analysis routine does not
> surreptitiously destroy this desirable weighting by standardizing the
> unstandardized PC scores! Does SPSS give you an option here?
>
> See my conditions (3) and (4) above. If these conditions are not met
> then the equivalance of the standardized data and unstandardized PC
> scores is destroyed and dendrograms will not be the same.
>

I'm assuming that your example was too small, or
your correlations were not large enough to differentiate
the Euclidean distances from the M. distances.

Can you look at your distance (between cases) matrix?

Gottfried Helms

unread,

Jul 11, 2005, 1:31:18 AM7/11/05

to

Hi Richard Wright/Richard Ulrich -

Am 11.07.05 01:57 schrieb Richard Ulrich:

> On Sun, 10 Jul 2005 19:16:47 +1000, Richard Wright
> <richwri...@tig.com.au> wrote:
>
> [snip, various comments]
> RW > >>
>
>>>>(1) The cluster analysis of the original data is done on standard
>>>>scores (z-scores).
>>>>
>>>>(2) The PCA is done using the correlation matrix.
>>>>
>>>>(3) The output of PCA scores is unstandardized.
>>>>
>>>>(4) The PCA scores remain unstandardized in the subsequent cluster
>>>>analysis.
>>>>
>
> [snip, counter-example by GH.]

Well, that counter-example unfortunately has not exactly been counter...
I erred in using the eigenvalues as multiplicator for the standardized
component-scores while it should have been the sqrt of eigenvalues...
silly.
*With* the correct multiplication I confirm Richard Wrights
result; and I should have seen it earlier: with that un-standardization
the factor scores are just a rotation of the original variables data,
and that rotation does not affect the distance. The latter was seen
analoguously in the comparision of PC- and Varimax-rotated scores (which,
however, were scaled wrongly, but to the same amount).
With the so-corrected scaling of componentsscores I got identity between
the clustering of Z-variables-distances and mahalanobis-distances.

>
>>You are absolutely right about the weighting. That is why I used the
>>unstandardized scores output by StatistiXL. These are proportional to
>>the eigenvalues, as you suggest. In other words their variance
>>decreases as the eigenvalues decrease, so the PCs with smaller
>>eigenvalues contribute less information to the clustering. This is
>>surely a desirable property.
>
>
> If the lengths aren't standardized to the eigenvalues, it won't
> be the Mahanolobis distance that is represented.

That makes it a bit irritating, that in the resources I have/found this
aspect is not stated explicitely. For instance my cite of J.BORTZ
suggests in that view of things, that the author may have had an only
vague idea, why he had added a paragraph about mahalanobis-distances -
which, *if* they are computed correctly (unstandardized) are useless
to modify the results of the clustering.

>
>
>>Then you have to be certain that the cluster analysis routine does not
>>surreptitiously destroy this desirable weighting by standardizing the
>>unstandardized PC scores! Does SPSS give you an option here?
>>
>>See my conditions (3) and (4) above. If these conditions are not met
>>then the equivalance of the standardized data and unstandardized PC
>>scores is destroyed and dendrograms will not be the same.
>>
>
>
> I'm assuming that your example was too small, or
> your correlations were not large enough to differentiate
> the Euclidean distances from the M. distances.
>
> Can you look at your distance (between cases) matrix?
>

In my simulation-data the distance matrices come out to be
identical now (the same with a set of 150 random-cases for the
same configuration).

Gottfried Helms

I.M. Boerefijn

unread,

Jul 11, 2005, 2:59:13 AM7/11/05

to

I see my question lead to quite a discussion here!

At this point, what do you all think the best solution to problem is?

Thanks!

"Gottfried Helms" <he...@uni-kassel.de> schreef in bericht

news:dat0c4$ad6$03$1...@news.t-online.com...

Gottfried Helms

unread,

Jul 11, 2005, 6:17:20 AM7/11/05

to

Am 11.07.05 08:59 schrieb I.M. Boerefijn:

> I see my question lead to quite a discussion here!
>
> At this point, what do you all think the best solution to problem is?
>
> Thanks!
>
> "Gottfried Helms" <he...@uni-kassel.de> schreef in bericht
> news:dat0c4$ad6$03$1...@news.t-online.com...
>

Irene -

after that information about correct computation of mahalanobis
distances I have to re-sort my thoughts before I'm able to give
advice. If that information is correct, from which follows,
that mahalanobis-distances do not improve original distances
it could only be meaningful to use the first relevant
component-scores, I think. But that's a different idea from
that simple "mahalanobis distances correct for correlations",
as I had it in mind for many years after reading over some
statistic handbooks. So I 've to pass the talk to other expertise
currently, I'm afraid...

(can't really) hope, (that) this helps ;-(

Regards-

Gottfried Helms

Richard Ulrich

unread,

Jul 11, 2005, 1:35:04 PM7/11/05

to

Oops! I'm sorry. I got it backwards.

You are correct in noting that the rotated structure, preserving
the original magnitudes, should give the same distances as the
raw scores

The key insight is that the M. distance *will* be different
from the Euclidean; if it is not, then something is wrong.

The PC scores are uncorrelated. For the M. distances,
*each* direction of divergence matters the same. Being 3
SD from the centroid counts the same, whether it is in the
direction of the major axis or any minor axis. -- Thus, the
PC scores should be standardized (same variance) so that
each counts the same, and that is what gets you the M.
distance. That is opposite what I said earlier.

That also seems to imply that you need the whole set of
component scores, rather than just using the 'important' ones.

[snip, detail]

Hope I've got it right this time.

Gottfried Helms

unread,

Jul 11, 2005, 3:04:03 PM7/11/05

to Rich....@comcast.net

Am 11.07.05 19:35 schrieb Richard Ulrich:

>
> Hope I've got it right this time.
>

poor Irene... ;-)

Two definitions now...

In BORTZ it is mentioned, that the empirical
variable-matrices are multiplied with the inverse
of its co*variance*-matrix as the base-definition
of Mahalanobis; that in fact indicates the use
of standardized scores. But I admit, I'd feel
relieved, if I found some final reference, which
I could apply immediately.

So there seems to remain a little job of research -

regards -

Gottfried Helms

Richard Wright

unread,

Jul 12, 2005, 5:38:13 AM7/12/05

to

My original point was that using the scores of ALL principal
components will not deal with your perceived problem with
multicollinearity and cluster analysis - I say 'perceived' because I
am not sure it is a problem.

Perhaps some of the more mathematically inclined can help here. My
hunch is that doing a cluster analysis on the scores from only the
major principal components will get rid of the variance associated
with multicollinearity.

On Mon, 11 Jul 2005 08:59:13 +0200, "I.M. Boerefijn"
<i.m.bo...@home.nl> wrote:

>I see my question lead to quite a discussion here!
>
>At this point, what do you all think the best solution to problem is?
>
>Thanks!

Art Kendall

unread,

Jul 12, 2005, 9:50:08 AM7/12/05

to

First, what are you trying to accomplish via clustering? do you need a
hierarchical tree? Do you just want it to approximate a number of
clusters and then see what stopping rule to apply.

What is the nature of your data? What is a case? Are the variables
items intended to form scales?

How important is interpretability?

Art
A...@DrKendall.org
Social Research Consultants
University Park, MD USA
(301) 864-5570

I.M. Boerefijn

unread,

Jul 12, 2005, 10:37:59 AM7/12/05

to

Hi Art,

here's some extra info:

The nature of the data: there are 23 importance statements meaured on a 7
point rating scale. People are given these statements while they have to
imagine that they would want to buy a floor for their living room. For
example:
"How important do you consider the aspect of price is?" 1= not important at
all - and so on till: - 7= very important
"How important do you consider the aspect of ease of cleaning is" 1= not
important at all - and so on till: - 7= very important

On this data I want to conduct a Cluster analysis; to identify the segments
among the respondents. Then for each of the segments, I look at their image
rating scores (another part of my research) and choose the target segment.

Hope I gave you the info you needed!

Irene

"Art Kendall" <Arthur....@verizon.net> schreef in bericht
news:42D3CAA7...@verizon.net...