k-means and data

MURAT AY

unread,

Jan 16, 2011, 9:46:04 AM1/16/11

to

Hello all,

My problem is:

For example, result of the first phase i run the k-means code's gives me 3 clusters. And, when i enter the new data set to first system how to organize for the new data according to previous (first culsters) cluster as a codes? or idea?

Thanks,

Murat

unread,

Jan 16, 2011, 10:42:04 AM1/16/11

to

"Murat" wrote in message <igv0bc$cjk$1...@fred.mathworks.com>...

Any idea?

Think blue, count two.

unread,

Jan 16, 2011, 11:46:55 AM1/16/11

to

I don't know if I understand the question. My best guess is that after
finding the initial cluster centroids, you want to check the distance of
each new point to the cluster centroids, and label that point according
to which of the centroids it is closest to. If so, then that should be
relatively straight-forward, and is perhaps even vectorizable (depending
upon the distance measure you choose.)

Murat

unread,

Jan 16, 2011, 12:22:05 PM1/16/11

to

"Think blue, count two." <robe...@hushmail.com> wrote in message <3iFYo.60983$lL1....@newsfe21.iad>...

Yes, you're right!
But, how can i calculate the distances according to new data set?
For example, the first cluster centers (for cluster 1 cluster 2 and cluster 3 are known) should be constant and distances should be recalculated for data coming to new system, right?
And, how is it possible?

Think blue, count two.

unread,

Jan 16, 2011, 12:36:08 PM1/16/11

to

On 16/01/11 11:22 AM, Murat wrote:

> Yes, you're right! But, how can i calculate the distances according to
> new data set? For example, the first cluster centers (for cluster 1
> cluster 2 and cluster 3 are known) should be constant and distances
> should be recalculated for data coming to new system, right?
> And, how is it possible?

Well that's going to depend upon the distance measure you want to use.

Let M be a matrix of features across the rows. Let C1 be a row matrix of
centroid coordinates.

Here's a simple implementation for Euclidean distance.

s1 = zeros(size(M,1),1);
for K = 1:size(M,1)
s1(K) = sum( (M(K,:) - C1) .^ 2 );
end

Repeat for the other centroids, and then compare the S*(K) distances to
determine the closest centroid.

This code makes no attempt to be the most efficient code possible for
the situation: get your basic code working first and only worry about
optimizing it if the simple clear code turns out to be unacceptably slow.

Murat

unread,

Jan 16, 2011, 1:48:04 PM1/16/11

to

I loaded the data like that:

M=dt(:,[1 6])'; % row data M=[2X85]
C1=result.cluster.v(1,:); % constant form 1. phase k-means/ X and Y row data as you said C1=[1X2]

s1 = zeros(size(M,1),1);
for K = 1:size(M,1)
s1(K) = sum((M(K,:)-C1).^ 2 );
end

It doesn't work.
May bei there is a problem in "s1 = zeros(size(M,1),1);"

Think blue, count two.

unread,

Jan 16, 2011, 6:04:53 PM1/16/11

to

Your M does not conform to the specifications I indicated, that it be a
an array of *rows* of features. Your M is *columns* of features.

Just remove the ' after the dt(:,[1 6]) to get an M that is appropriate
for the code.

Peter Perkins

unread,

Jan 17, 2011, 12:00:26 PM1/17/11

to

On 1/16/2011 12:22 PM, Murat wrote:
> Yes, you're right! But, how can i calculate the distances according to
> new data set? For example, the first cluster centers (for cluster 1
> cluster 2 and cluster 3 are known) should be constant and distances
> should be recalculated for data coming to new system, right?
> And, how is it possible?

If you have KMEANS, then you must have PDIST2. KMEANS returns the
cluster centroids, and PDIST2 computes distances between two sets of
points. Use the second output of MIN on the distance matrix, and you're
done.