Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Purity for cluster in matlab

719 views
Skip to first unread message

ali

unread,
Sep 19, 2009, 5:59:05 AM9/19/09
to
hi
again
does somebody have compute cluster purity in matlab codes pls
this the ı dont know how many meesage that ı write that

Jos

unread,
Sep 19, 2009, 12:31:03 PM9/19/09
to
"ali " <rebel...@gmail.com> wrote in message <h92a19$dhg$1...@fred.mathworks.com>...

Again, Ali, what do you mean by purity? Try to define it using pencil and paper first.

Jos

ali

unread,
Sep 24, 2009, 7:24:03 AM9/24/09
to
&#305; mean cluster purity
for example

cluster results like that for 2 cluster

A=1
1
2
2

and result cluster is
B=1
1
2
2

then result is true then puriry 1

if result is like that
B= 2
1
2
1
purity is 0.5 bcs only two point right

ok

ImageAnalyst

unread,
Sep 24, 2009, 8:46:48 AM9/24/09
to

---------------------------------------------------------------------
Still don't understand, but if you have some value for the cluster
that you consider "right" then why don't you just calculate the RMS
difference between the actual cluster data points and the "right"
value?

Jan Simon

unread,
Sep 25, 2009, 4:09:02 AM9/25/09
to
Dear ali!

> cluster results like that for 2 cluster
>
> A=1
> 1
> 2
> 2
>
> and result cluster is
> B=1
> 1
> 2
> 2
>
> then result is true then puriry 1
>
> if result is like that
> B= 2
> 1
> 2
> 1
> purity is 0.5 bcs only two point right

It is not really clear to me, what "purity" means in general. Following your example, this could be a solution:
P = sum(A == B) / numel(A)

Good luck, Jan

ali

unread,
Sep 27, 2009, 8:16:01 AM9/27/09
to
thanks for a lot
it is working true like that
if a =1 b=1
1 1
2 2
2 2
sum(a==b) it is 4
but if a=1 b=2
1 2
2 1
2 1
sum(a==b) it is 0
but this result must be 4 bcs it is same onlt number place different
how can &#305; solve this?????

ImageAnalyst

unread,
Sep 27, 2009, 11:20:00 AM9/27/09
to

--------------------------------------------------------------------
Well you might think you could sort them and compare - will that
work? Of course it won't work if you think about it. It won't work
for situations where the number of elements if the same, much less for
situation where a and b have different number of elements. So then,
how about if you count every time an element occurs? OK now you're on
to something. What does that? The hist()function does that! It even
will handle situations where the number of elements is different!
Try this demo code:
clc;
close all;
clear all;
workspace;

a=[1 1 2 2]
b = [1 1 2 2]
allBins = union(a,b)
binEdges = min(allBins) : max(allBins)
countsa = histc(a, binEdges)
countsb = histc(b, binEdges)
mismatches = sum(abs(countsa-countsb))

a=[1 1 2 2]
b = [2 2 1 1]
allBins = union(a,b)
binEdges = min(allBins) : max(allBins)
countsa = histc(a, binEdges)
countsb = histc(b, binEdges)
mismatches = sum(abs(countsa-countsb))

a=[1 1 2 2]
b = [2 2 1 1 3]
allBins = union(a,b)
binEdges = min(allBins) : max(allBins)
countsa = histc(a, binEdges)
countsb = histc(b, binEdges)
mismatches = sum(abs(countsa-countsb))

Or alternatively, how about
purity = sum(a) - sum(b)

Your definition of purity is evolving, or getting revealed, so
slowly. If you can explain it, then you can code it. We don't want
to keep guessing.

ali

unread,
Sep 29, 2009, 10:52:02 AM9/29/09
to
thakns so much but some of the code doesnt work true ly

&#305; wanna explain PUR&#304;TY AGA&#304;N

FOR example
cluster result for label data should be for 6 attributes
1
2
3
1
2
3
but exact clusters must be
1
1
2
2
3
3
so purity should be 0.333
because only two point right now u understnad &#305;f you dont pls send me yoyur privarte mail then &#305; wanna ask more there pls
it is so important for me

Tom Lane

unread,
Sep 30, 2009, 1:09:02 PM9/30/09
to
> FOR example
> cluster result for label data should be for 6 attributes
> 1
> 2
> 3
> 1
> 2
> 3
> but exact clusters must be
> 1
> 1
> 2
> 2
> 3
> 3
> so purity should be 0.333

Ali, if you have the Statistics Toolbox you can use crosstab to produce a
matrix showing the number of points with each exact/assigned combination.
Then you could compute the proportion of elements on the diagonal to get
your purity measure.

But in most cases of clustering, the cluster labels aren't meaningful. So in
your example, simply swapping the names of clusters 2 and 3 would improve
your purity. Suppose you want that instead. Here's an example where I
compute the crosstab for Fisher's iris data using the cluster numbers given
by the cluster function.

load fisheriris
a = grp2idx(species);
d = pdist(meas);
z = linkage(d,'average');
b = cluster(z,3);
M = crosstab(a,b)
M =
0 0 50
0 50 0
36 14 0

But the clusters aren't lined up. Thanks to Yi Cao who wrote it, and Bill
Mueller who pointed it out to me, we can use the munkres function on the
MATLAB File Exchange to permute the rows:

p = munkres(-M);
[i,j] = find(p);
M = M(i,:)
M =
36 14 0
0 50 0
0 0 50

Now the proportion of correctly clustered points is quite high:

purity = sum(diag(M)) / sum(sum(M))
purity =
0.9067

-- Tom


Ting Su

unread,
Sep 30, 2009, 5:56:42 PM9/30/09
to
Ali,

I don't know that how you get 0.333 in your example. The cluster purity
should be 0.5, if you follow the definition of cluster purity that most
people use (For example, see
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html)

Tom's solution implicitly assumes that there is a one-to-one correspondence
between the known class labels and the clustering result. If we don't make
such as assumption, e.g., two clusters may be assigned to the same class,
then there is a simple way to compute the purity based on the confusion
matrix.

If you have the Statistics Toolbox, let's still use Tom's example:

load fisheriris

a = grp2idx(species);

d = pdist(meas);

z = linkage(d,'average');

b = cluster(z,3);

M = crosstab(a,b) % you can use also use "confusionmat"

nc = sum(M,1);

mc = max(M,[],1);

purity = sum(mc(nc>0))/sum(nc)

purity =

0.9067

I get the same purity value as Tom's solution for this example. These two
methods may disagree if more than one cluster have the same majority class.
For example for a clustering solution that gets the following confusion
matrix:

0 50 0

0 48 2

14 0 36

Hope this helps.

-Ting

"ali " <rebel...@gmail.com> wrote in message

news:h9t6ui$ipb$1...@fred.mathworks.com...

ali

unread,
Oct 2, 2009, 3:21:04 AM10/2/09
to
thanks so much that is works for me
just wonder &#305; wanna ask something more
&#305; look for entropy codes but &#305; cant find on the net
there &#305; found &#305; t will be opposite of purity and must be between 0-1 but
it doesnt work do you have entropy codes for cluster
thansk for attention

ali

unread,
Oct 2, 2009, 3:54:02 AM10/2/09
to
entropy like that for example
how can compute this cluster entropy

a=[1;1;2;2;3;3]
b=[1;2;3;1;2;3]

for your answer entropy must be there 0.5 bcs you finf purity 0.5

0 new messages