Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Kmeans with Bhattacharyya distance

119 views

Skip to first unread message

Marco Luppi

unread,

Mar 31, 2008, 6:33:01 AM3/31/08

I need help for kmeans clustering: I have to use the
Bhattacharyya distance instead of euclidean, cosine, city
blocks or hamming, that are the distances comonly in use
in matlab

But I don'understand in which way I can modify the file
kmeans.m to do this, yo choose like parameter distance
bhattacharyya: kmeans(matrix, cluster, 'dist', 'bhatt');

Thank's

Marco Luppi

unread,

Mar 31, 2008, 6:34:02 AM3/31/08

Peter Perkins

unread,

Mar 31, 2008, 11:43:15 AM3/31/08

Marco Luppi wrote:
> I need help for kmeans clustering: I have to use the
> Bhattacharyya distance instead of euclidean, cosine, city
> blocks or hamming, that are the distances comonly in use
> in matlab

There's a reason for that: K-means clustering requires not only a
distance metric, but also a way to compute the centroid of a cluster.
That is, the criterion that is minimized in K-means is the sum of
point-to-centroid distances, summed over all clusters. Thus, it is
natural to want the centroid to be the point that minimizes the
point-to-centroid distances within a cluster. The arithmetic mean does
that for (squared) Euclidean distance, there are a few distances for
which the centroid is easily computable. Even for something as simple
as (unsquared) Euclidean distance, it is _not easily computable.

> But I don'understand in which way I can modify the file
> kmeans.m to do this, yo choose like parameter distance
> bhattacharyya: kmeans(matrix, cluster, 'dist', 'bhatt');

The Wikipedia has this to say:

"The Bhattacharyya coefficient is a divergence-type measure; it can be
seen as the scalar product of the two vectors (one for p and one for q)
having as components the square root of the probability of the points x
\in X. It thereby lends itself to a geometric interpretation: the
Bhattacharyya coefficient is the cosine of the angle enclosed between
these two vectors."

That would seem to imply that you want to use the cosine distance on the
sqrt of your data, I think.

Or, use hierarchical clustering.

Hope this helps.

- Peter Perkins
The MathWorks, Inc.

0 new messages