Again, Ali, what do you mean by purity? Try to define it using pencil and paper first.
Jos
cluster results like that for 2 cluster
A=1
1
2
2
and result cluster is
B=1
1
2
2
then result is true then puriry 1
if result is like that
B= 2
1
2
1
purity is 0.5 bcs only two point right
ok
---------------------------------------------------------------------
Still don't understand, but if you have some value for the cluster
that you consider "right" then why don't you just calculate the RMS
difference between the actual cluster data points and the "right"
value?
> cluster results like that for 2 cluster
>
> A=1
> 1
> 2
> 2
>
> and result cluster is
> B=1
> 1
> 2
> 2
>
> then result is true then puriry 1
>
> if result is like that
> B= 2
> 1
> 2
> 1
> purity is 0.5 bcs only two point right
It is not really clear to me, what "purity" means in general. Following your example, this could be a solution:
P = sum(A == B) / numel(A)
Good luck, Jan
--------------------------------------------------------------------
Well you might think you could sort them and compare - will that
work? Of course it won't work if you think about it. It won't work
for situations where the number of elements if the same, much less for
situation where a and b have different number of elements. So then,
how about if you count every time an element occurs? OK now you're on
to something. What does that? The hist()function does that! It even
will handle situations where the number of elements is different!
Try this demo code:
clc;
close all;
clear all;
workspace;
a=[1 1 2 2]
b = [1 1 2 2]
allBins = union(a,b)
binEdges = min(allBins) : max(allBins)
countsa = histc(a, binEdges)
countsb = histc(b, binEdges)
mismatches = sum(abs(countsa-countsb))
a=[1 1 2 2]
b = [2 2 1 1]
allBins = union(a,b)
binEdges = min(allBins) : max(allBins)
countsa = histc(a, binEdges)
countsb = histc(b, binEdges)
mismatches = sum(abs(countsa-countsb))
a=[1 1 2 2]
b = [2 2 1 1 3]
allBins = union(a,b)
binEdges = min(allBins) : max(allBins)
countsa = histc(a, binEdges)
countsb = histc(b, binEdges)
mismatches = sum(abs(countsa-countsb))
Or alternatively, how about
purity = sum(a) - sum(b)
Your definition of purity is evolving, or getting revealed, so
slowly. If you can explain it, then you can code it. We don't want
to keep guessing.
ı wanna explain PURİTY AGAİN
FOR example
cluster result for label data should be for 6 attributes
1
2
3
1
2
3
but exact clusters must be
1
1
2
2
3
3
so purity should be 0.333
because only two point right now u understnad ıf you dont pls send me yoyur privarte mail then ı wanna ask more there pls
it is so important for me
Ali, if you have the Statistics Toolbox you can use crosstab to produce a
matrix showing the number of points with each exact/assigned combination.
Then you could compute the proportion of elements on the diagonal to get
your purity measure.
But in most cases of clustering, the cluster labels aren't meaningful. So in
your example, simply swapping the names of clusters 2 and 3 would improve
your purity. Suppose you want that instead. Here's an example where I
compute the crosstab for Fisher's iris data using the cluster numbers given
by the cluster function.
load fisheriris
a = grp2idx(species);
d = pdist(meas);
z = linkage(d,'average');
b = cluster(z,3);
M = crosstab(a,b)
M =
0 0 50
0 50 0
36 14 0
But the clusters aren't lined up. Thanks to Yi Cao who wrote it, and Bill
Mueller who pointed it out to me, we can use the munkres function on the
MATLAB File Exchange to permute the rows:
p = munkres(-M);
[i,j] = find(p);
M = M(i,:)
M =
36 14 0
0 50 0
0 0 50
Now the proportion of correctly clustered points is quite high:
purity = sum(diag(M)) / sum(sum(M))
purity =
0.9067
-- Tom
I don't know that how you get 0.333 in your example. The cluster purity
should be 0.5, if you follow the definition of cluster purity that most
people use (For example, see
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html)
Tom's solution implicitly assumes that there is a one-to-one correspondence
between the known class labels and the clustering result. If we don't make
such as assumption, e.g., two clusters may be assigned to the same class,
then there is a simple way to compute the purity based on the confusion
matrix.
If you have the Statistics Toolbox, let's still use Tom's example:
load fisheriris
a = grp2idx(species);
d = pdist(meas);
z = linkage(d,'average');
b = cluster(z,3);
M = crosstab(a,b) % you can use also use "confusionmat"
nc = sum(M,1);
mc = max(M,[],1);
purity = sum(mc(nc>0))/sum(nc)
purity =
0.9067
I get the same purity value as Tom's solution for this example. These two
methods may disagree if more than one cluster have the same majority class.
For example for a clustering solution that gets the following confusion
matrix:
0 50 0
0 48 2
14 0 36
Hope this helps.
-Ting
"ali " <rebel...@gmail.com> wrote in message
news:h9t6ui$ipb$1...@fred.mathworks.com...
a=[1;1;2;2;3;3]
b=[1;2;3;1;2;3]
for your answer entropy must be there 0.5 bcs you finf purity 0.5