results tables question

8 views
Skip to first unread message

dulu...@gmail.com

unread,
Apr 11, 2007, 11:00:45 AM4/11/07
to senseinduction
I was looking at the results table and had a few questions and perhaps
one open ended observation...

Table 1 is the unsupervised evaluation, which was evaluated on the
test corpus (4851 instances). Was the scorer2 program used (presumably
after converting our cluster
numbers into the format used in the key...), or was a different scorer
program used? Was the conversion of cluster numbers reported by us
done based on the distribution of senses found in the key (that is our
most frequent cluster C0 for a word would be converted to the most
frequent sense tag in the gold standard data, and so forth).

Table 2 is the supervised evaluation. My understanding of this is that
the responses on
the test data are mapped to the senses in the training data, and then
precision/recall and F-measure is computed on the test corpus (same
4851 instances as in Table 1?). How was that mapping done, and how
different would it be from looking at the distribution of senses in
the gold standard key data and mapping the most frequent sense in the
key to the most frequent cluster? Was scorer2 used after these
mappings were carried out?

I guess I am trying to decide if I can replicate these mappings that
you did, or if I should just ask for the mapping that you did. :) I
guess I should just ask for the mapping, and then I can run the
scoring program (assuming it was scorer2) with the provided key. If
the scorer program was something else, could that be made available?

Finally, one observation - I'm kind of surprised that 1 cluster per
word is so different between Table 1 and Table 2, it was 78.9 for
Table 1 and 55.8 for table 2...it would seem that 1 cluster per word
should degrade to most frequent sense or nearly so in the supervised
evaluation...unless I am misundstanding something about the mappings
that were done for Table 2 and probably I am...I am not asking for an
in depth explanation of this, but if there is a simple answer it might
help me finese out the differences between Table 1 and Table 2.

So...I guess to summarize, could we get the mappings that were done of
our system results to generate the numbers you have in Table 1 and
Table 2, and could the scoring program be made available (if it is not
scorer2).

Thanks!
Ted

dulu...@gmail.com

unread,
Apr 11, 2007, 11:44:10 AM4/11/07
to senseinduction
PS I realized I was a bit sloppy in this note regarding the mapping
that I theorized. For table 1, would the mapping correspond to finding
the distribution of senses in the gold standard test data, and mapping
those to the discovered senses, where the most frequent sense in the
gold standard test data was associated with the most frequent
discovered cluster for a word, and so forth. So it seems clear that
table 1 is based on "just" the tests data.

Then, in table 2, I am wondering if one simply takes the distribution
of senses in the training data, and then maps those to the discovered
senses in a similar way....

If that's the case then differences in the table 1 and table 2 results
would simply be indicative of different distributions of senses in the
test and training data, but I'm not sure if that's the whole story.

And just to further clarify another question, are both Table 1 and
Table 2 based on scoring the 4851 test instances? Then is the only
difference between the two numbers the way that the mapping was done,
or are different pieces of data being scored there.

Thanks!
Ted

Aitor Soroa Etxabe

unread,
Apr 11, 2007, 11:56:07 AM4/11/07
to sensein...@googlegroups.com
Hello,

2007/04/11 egunean, tped...@d.umn.edu-k zera esan zuen :


>
> I was looking at the results table and had a few questions and perhaps
> one open ended observation...
>
> Table 1 is the unsupervised evaluation, which was evaluated on the test
> corpus (4851 instances). Was the scorer2 program used (presumably after
> converting our cluster numbers into the format used in the key...), or was
> a different scorer program used?

In the unsupervised evaluation we measure the Fscore of the clustering
solution relative to the gold standard, as defined in

@techreport{clusteringcomparison,
title = {Comparison of Agglomerative and Partitional Document Clustering Algorithms},
author = {Ying Zhao and George Karypis},
number = {02-014},
publisher = {University of Minnesota},
url = {http://www.cs.umn.edu/tech_reports_upload/tr2002/02-014.pdf},
year = {2002},
biburl = {http://www.bibsonomy.org/bibtex/2aa6fb7e58c2dd28a3d54d78acd4973e0/gromgull},
keywords = {clustering comparison }
}

and is accesible in
http://www.cs.umn.edu/tech_reports_upload/tr2002/02-014.pdf

So, instead of the scorer2 program, we implemented a script which performs
the evaluation.

> Table 2 is the supervised evaluation. My understanding of this is that the
> responses on the test data are mapped to the senses in the training data,
> and then precision/recall and F-measure is computed on the test corpus
> (same 4851 instances as in Table 1?).

For the supervised evaluation, the training corpus is used for building a
matrix which relates clusters to senses, simply counting the times an
occurrence with sense s_{j} has been assigned cluster c_{i}. We then apply
this matrix to the test corpus instances to get a new key file where the
clusters have been mapped into senses. Finally, the scorer2 program was used
to get recall measure (as the coverage is %100. All the instances that
weren't tagged by the system are assigned to a new generated dummy
cluster). See

http://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1148466885/publikoak/texgraphs06

for more details.

> I guess I am trying to decide if I can replicate these mappings that you
> did, or if I should just ask for the mapping that you did. :) I guess I
> should just ask for the mapping, and then I can run the scoring program
> (assuming it was scorer2) with the provided key. If the scorer program was
> something else, could that be made available?

We have used 3 scripts:

- the unsupervised evaluation (Fscore)
- a program that creates the supervised test keyfile as explained above
- scorer2

We plan to make available all these scripts (as well as the baseline
keyfiles) in a few days.

> Finally, one observation - I'm kind of surprised that 1 cluster per word
> is so different between Table 1 and Table 2, it was 78.9 for Table 1 and
> 55.8 for table 2...it would seem that 1 cluster per word should degrade to
> most frequent sense or nearly so in the supervised evaluation...unless I
> am misundstanding something about the mappings that were done for Table 2
> and probably I am...I am not asking for an in depth explanation of this,
> but if there is a simple answer it might help me finese out the
> differences between Table 1 and Table 2.

Yes, you are absolutely right. We had a bug in the supervised evaluation
script. The fact that the "1 cluster per word" and mfs baselines yielded so
different results fired the alarm this morning. We have fixed the bug and
sent the new results to the participants. Sorry for the inconvenience.

> So...I guess to summarize, could we get the mappings that were done of our
> system results to generate the numbers you have in Table 1 and Table 2,
> and could the scoring program be made available (if it is not scorer2).

As said, in a few days we will make all the scripts publicly available in
the task website.

--
best
aitor
Pgp id: 0x5D6070F2

e.ag...@ehu.es

unread,
Apr 11, 2007, 5:15:31 PM4/11/07
to Aitor Soroa Etxabe, sensein...@googlegroups.com

Hi all,

I just want to add that the evaluation was introduced in the task
description. I quote from
http://nlp.cs.swarthmore.edu/semeval/tasks/task02/description.shtml below.

best

eneko

------------------

Evaluation
Organizers will return the evaluation in two varieties:

a. clustering-style evaluation. We interpret the gold standard (GS) as a
clustering solution: all examples tagged with a given sense in the GS
form a class. The examples returned by participants that share the
"sense" tag with maximum weight are the clusters. We compare
participants clusters on the test data with the classes in the
gold-standard, and compute F-score as usual (Agirre et al. 2006). In
case of ties (or multiple sense tags in the GS), a new sense will be formed .

b. mapping to the GS sense inventory: organizers use training/test split
of the data (as defined in task 17) to map the participants "senses"
into the official sense inventory. Using this mapping, the organizers
convert the participants results into the official sense inventory, and
compute the usual precision and recall measures. See (Agirre et al.
2006) for more details.

The first evaluation variety give better scores to the induced senses
most similar to the GS senses (e.g. similar number of senses). The
second evaluation variety allows for comparison with other kinds of
systems. It does not necessarily favor systems inducing senses similar
to the GS. We have used such framework to evaluate graph-based
sense-induction techniques in (Agirre et al. 2006).

We strongly suggest participants to discuss and propose alternative
evaluation strategies, with the conditions that they make use of the
available lexical-sample data.


--
----------------http://ji.ehu.es/eneko----------------
Eneko Agirre NOTE NEW E-MAIL
Informatika Fakultatea mailto:e.ag...@ehu.es
649 p.k. - 20.080 Donostia fax: (+34) 943 015590
Basque Country tel: (+34) 943 015019


dulu...@gmail.com

unread,
Apr 19, 2007, 2:59:03 PM4/19/07
to senseinduction
Greetings Aitor,

Me again. :)

I was just trying to construct a simple example to better understand
the supervised evaluation, and wanted to run it by you to make sure I
had gotten it right, or see where I went wrong. FYI beyond simply
understanding the reason for doing that is to :

1) compare it to unsupervised purity, as I feel some sort of
connection might exist although I'm not sure about that, and

2) compare it to SenseClusters evaluation methodolgy, which is
unsupervised and relies on the manipulation of a confusion matrix
somewhat like what you construct in the training phase...

More on the above if I work out any connections, etc.

Anyway, my understanding is that the results of clustering on the
training data are stored in a matrix, essentially a confusion matrix.
Suppose the true senses as shown by the gold standard are S1, S2, and
S3, and that we discover 3 clusters, C1, C2, C3.

C1 C2 C3
S1 0 10 5
S2 10 5 5
S3 5 5 5

Now, these values in the cells show the number of times an instance
that is really Sx is found in Cy. So the 10 in cell S1,C2 means that
there are 10 instances that are "truly" sense 1 that are assigned to
cluster 2. Note that in this example there are a total of 50 instances
in the training data.

Now, I think these counts are converted into probabilities....

C1 C2 C3
S1 0 .66 .33
S2 .5 .25 .25
S3 .33 .33 .33

Reading the value in S1,C2 again, we see that if the "true" sense of
an instance is sense 1, there is a 66% chance it will be assigned
cluster 2.

This matrix is then applied to the test data. The above counts and
probabilities are based on the training data. Note that Sx are the
true gold standard senses, and Cy are the clusters predicted by our
algorithm. The fact that we have 3 senses and 3 clusters is just an
accident of the example, we could have more or less clusters than
senses and things would work the same.

Then, suppose an instance in the test data is assigned to cluster 2.
This answer for this instance would be "weighted" by the above matrix
and fed to scorer2.

I guess what remain unclear is how that weighting occurs, and what
would be fed to scorer2. For the sake a simplicity, let's assume that
the clustering algorithm output cluster 2 with 100% certainty....

Given the probability matrix above that comes from training data,
which I hope I got right...how would that instance I assigned cluster2
from the test data be weighted for presentation to scorer2?

I hope that this question isn't too tedious to answer, and that I have
gotten at least part of the process correct. :)

Thanks,
Ted

On Apr 11, 10:56 am, Aitor Soroa Etxabe <a.so...@gmail.com> wrote:
> Hello,
>
> 2007/04/11 egunean, tpede...@d.umn.edu-k zera esan zuen :


>
>
>
> > I was looking at the results table and had a few questions and perhaps
> > one open ended observation...
>
> > Table 1 is the unsupervised evaluation, which was evaluated on the test
> > corpus (4851 instances). Was the scorer2 program used (presumably after
> > converting our cluster numbers into the format used in the key...), or was
> > a different scorer program used?
>
> In the unsupervised evaluation we measure the Fscore of the clustering
> solution relative to the gold standard, as defined in
>
> @techreport{clusteringcomparison,
> title = {Comparison of Agglomerative and Partitional Document Clustering Algorithms},
> author = {Ying Zhao and George Karypis},
> number = {02-014},
> publisher = {University of Minnesota},
> url = {http://www.cs.umn.edu/tech_reports_upload/tr2002/02-014.pdf},
> year = {2002},

> biburl = {http://www.bibsonomy.org/bibtex/2aa6fb7e58c2dd28a3d54d78acd4973e0/gro...},
> keywords = {clustering comparison }
>
> }
>
> and is accesible inhttp://www.cs.umn.edu/tech_reports_upload/tr2002/02-014.pdf


>
> So, instead of the scorer2 program, we implemented a script which performs
> the evaluation.
>
> > Table 2 is the supervised evaluation. My understanding of this is that the
> > responses on the test data are mapped to the senses in the training data,
> > and then precision/recall and F-measure is computed on the test corpus
> > (same 4851 instances as in Table 1?).
>
> For the supervised evaluation, the training corpus is used for building a
> matrix which relates clusters to senses, simply counting the times an
> occurrence with sense s_{j} has been assigned cluster c_{i}. We then apply
> this matrix to the test corpus instances to get a new key file where the
> clusters have been mapped into senses. Finally, the scorer2 program was used
> to get recall measure (as the coverage is %100. All the instances that
> weren't tagged by the system are assigned to a new generated dummy
> cluster). See
>

> http://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1148466885/publikoak...

dulu...@gmail.com

unread,
Apr 19, 2007, 6:38:04 PM4/19/07
to senseinduction
A bit more on the supervised evaluation...

I did some experimenting with this case below, and maybe I have
answered my question about how the supervised evaluation "works"

C1 C2 C3
S1 0 .66 .33
S2 .5 .25 .25
S3 .33 .33 .33

Given the above table of probabilities, computed from training data,
if a test instance is assigned to C1 we call that S2, if it is
assigned to C2 we call it S1 and if it is assigned to C3 we call it S1
(I think the code just returns one sense in the case of a tie).

In other words, create_supervised_keyfile.pl essentially does a look
up on this table to find the highest probability sense (Sx) for a
given cluster assignment (Cy). So the mapping created above is...

C1 -> S2
C2 -> S1
C3 -> S1, S3 [maybe only be assigned to S1...?]

Then, the key file created by create_supervised_keyfile.pl is compared
to the actual official key file with scorer2 to give the accuracy
score reported as the supervised result.

Did I get this right?

Thanks,
Ted

Aitor Soroa Etxabe

unread,
Apr 20, 2007, 4:17:54 AM4/20/07
to sensein...@googlegroups.com
On 2007/04/19, tped...@d.umn.edu-k wrote :
>
> Greetings Aitor,
>
> Me again. :)

;-)

> [...]


> Anyway, my understanding is that the results of clustering on the
> training data are stored in a matrix, essentially a confusion matrix.
> Suppose the true senses as shown by the gold standard are S1, S2, and
> S3, and that we discover 3 clusters, C1, C2, C3.
>
> C1 C2 C3
> S1 0 10 5
> S2 10 5 5
> S3 5 5 5
>

> [...]


>
> Now, I think these counts are converted into probabilities....
>
> C1 C2 C3
> S1 0 .66 .33
> S2 .5 .25 .25
> S3 .33 .33 .33
>

Yes, your analysis is right. This is the way to create what we call a
"mapping matrix"

Now, suppose the system returned a cluster C2 to an instance of the test
corpus. We interpret this assigment by means of what we call a "cluster
score vector", which in this case will be Csv = (0, 1, 0)^{T}. So, to obtain
the sense we multiply the mapping matrix with the cluster score vector,
which gives a "sense score vector" Ssv:

Ssv = M*Csv

And then we choose the sense with maximum score. In case of ties, we take
one sense arbitrarily (but ties don't occur very frequently). In this case,
Ssv = (.66, .25, .33)^{T}, so we choose S1.

Note that this procedure allows assigning more than a cluster to an instance
(like a soft clustering). Suppose we assign the cluster score vector Csv =
(.9, .3, .6)^{T}, i.e., C1 has a weight of 0.9, C2 has 0.3 and C3
0.6. Multiplying it with the matrix, we obtain

Ssv = (0.396, 0.675, 0.505)

so sense S2 will be assigned.

I hope the explanation helps understanding the sup. evaluation, but if you
have more questions feel free to ask on the list.

best,
aitor

Ted Pedersen

unread,
Apr 20, 2007, 3:28:09 PM4/20/07
to sensein...@googlegroups.com
Thanks Aitor,

This is very helpful as well. I'm going to take the liberty of trying
to make some general comments about the measures of evaluation, just
to make sure I'm not wildly off the mark in my thinking as that will
impact what I write in our final system report.

The simplest observation is that to get a high f-score you need to
find the same number AND same type of clusters (that agree with the
gold standard data) whereas with the supervised measure you need to
find the same type of clusters, but not the same number (as in the
gold standard data).

In that sense then the supervised measure is more lenient, in that if
you have 100 instances that belong to gold standard sense 1, if you
happen to put those into three different clusters that are relatively
pure (do not contain instances from other senses), you will get
penalized by the f-score but not the supervised score. In fact this
might be where there is a connection from the supervised measure to
purity, in that the supervised measure is oriented towards finding any
number of relatively pure clusters, where you do better the more pure
your clusters are regardless of how many you might create, whereas the
f-score wants you to get exactly the same number of clusters as senses
AND have them be relatively pure (in order to get a good score).

So, both measures are expecting that you will make the same kind of
distinction, that is keeping these gold standard sense 1 instances
separate from instances of other senses. It's just that the f-score
wants you to find the number of clusters as senses in the gold
standard data, but the supervised score doesn't care how many clusters
you find, as long as you keep the gold standard senses in relatively
pure clusters.

I had mentioned wanting to make sure I understood the supervised
measure in order to compare to what is done for evaluation in
SenseClusters, and I had a few thoughts on that as well. The
evaluation in SenseClusters is more like the f-score than the
supervised measure I think, in that we penalize the score very heavily
if we find some number of clusters that differs from the gold standard
number. However, it is similar to the supervised measure in that we do
create a mapping that is based on the most probable sense/cluster
combination.

Here's an example...suppose there are 3 true senses as observed in the
gold standard data (S1, S2, S3), and 4 discovered clusters (C1, C2,
C3, C4). Suppose that the confusion matrix for 100 instances that are
clustered looks like this....

C1 C2 C3 C4
S1 10 30 0 5 = 45
S2 20 0 10 0 = 30
S3 0 0 5 20 = 25
----------------------------------
30 30 15 25 100

Now, in SenseCluster's evaluation framework we try and find the
optimal mapping of clusters to senses, such that the agreement overall
is maximized. This turns out to be an instance of the Assignment
Problem from Operations Research, so we take advantage of the Munkres
Algorithm to "solve" that, which really just amounts to rearranging
the columns in the matrix above such that the diagonal sum is
maximized, and then we map clusters to senses based on that
arrangement.

So, for the above matrix we'd end up with ...

C2 C1 C4 C3**
S1 30 10 5 0 = 45
S2 0 20 0 10 = 30
S3 0 0 20 5 = 25
----------------------------------
30 30 25 15 100

So this means that

C2 -> S1
C1 -> S2
C4 -> S3

...and C3 doesn't map to anything, and that is where the penalty is
assessed. Those instances found in C3 are counted as being wrong. :)

In this case precision and recall are the same, since the clustering
algorithm has not declined to place any instances in a cluster (that
happens sometimes, and is essentially an "i don't know" answer from
the clustering algorithm, but we'll assume there are 100 instances in
total in the data that we clustered.

So precision and recall are both 30+20+20/100 = 0.70

And these are combined in the F-Measure as 2*P*R/(P+R) = 0.70

(The 30+20+20 comes from the diagonal of the matrix above).

Now...in the supervised methodology, you'd compute a table of
probabilities something like this...

C1 C2 C3 C4
S1 .22 .66 0 .11
S2 .66 0 .33 0
S3 0 0 .2 .8

If we assume that the clustering algorithm only outputs one single
answer per instance, the mapping would go like this...

C1->S2
C2->S1
C3->S2
C4->S3

...which is the same as what SenseClusters finds, except for the
C3->S2 mapping, which is not allowed by SenseClusters but allowed in
the supervised framework.

Now the only problem here is that I need some test data to evaluate
the supervised measure with, but I used it all up getting that table
up there. :) But, I have perhaps gone far enough to make the point
that I think the SenseClusters measure is somewhere in between F-score
and the supervised method.

The SenseClusters evaluation is like the F-score in that it doesn't
require a training/test split, and it also is very harsh in punishing
you for getting a different number of clusters than senses. The
F-score does some weighting based on the size of clusters discovered,
which SenseClusters doesn't do, and SenseClusters is only "interested"
in the values in the confusion matrix that fall in the diagonal,
everything else is ignored.

However, it is like the supervised measure in that it creates a
mapping of clusters to senses, and in fact the mapping we get seems to
generally correspond to the supervised mapping, except that we only
map for as many senses as we have, and anything else is considered
wrong.

I ran the SenseClusters evaluation on the train+test solution for our
system (27,132 instances) and got a score of 56.36, which in this case
can be viewed as an accuracy number since precision and recall are
equal. So this is just the percentages of instances that we got
"right", according to the gold standard. The F-score gave us 63.1,
which is slightly higher, and is mentioned here to make it clear that
they are different. And if we assign each word to a single cluster in
the train+test (i.e., most frequent sense), we get 78.55, just to give
you an idea of how far away our results are from that. :)

Of course we don't have a supervised score for the train+test data,
but I'm planning to run the SenseClusters evaluation on the test data
as well, and then compare that with the f-score and supervised results
just to see how they vary. I would guess that the SenseClusters
evaluation will be somewhat lower than f-score, which is somewhat
lower than supervised score...

The SenseClusters evaluation I am describing can be done with a few
programs available in SenseClusters
(http://senseclusters.sourceforge.net), namely cluto2label.pl,
label.pl, and report.pl, in that order. Our solution and key files as
used in this task need to be converted to the SenseClusters format to
be used with these programs, but I'll package up those scripts and
make those available with the three programs above as a tar file if
anyone is interested in running this evaluation on their data.

In general the scores reported by the SenseClusters evaluation will be
lower for the reasons mentioned above. So a lower number from the
SenseClusters evaluation should only be viewed relative to other
numbers that come from the SenseClusters evaluation, otherwise you
will be making a comparison that is too harsh, as both the f-score and
supervised methodology tend to report somewhat higher numbers.

Thanks for listening, comments of course welcome. :)
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

Aitor Soroa Etxabe

unread,
Apr 24, 2007, 8:09:02 AM4/24/07
to sensein...@googlegroups.com
Hi Ted,

we needed to put some thoughts to reply you this time :-)

we will try to summarize your thoughts about the evaluation. But first we
thank you for your comments and analysis. Evaluation of induction systems is
not clear-cut issue, and the relation between our supervised and FSCORE
evaluation is a subtle one.


You wrote:

> Having a connection to purity, btw, is not all that surprising, since when
> we compare to a gold standard, in the end we'd like to find clusters that
> are relatively pure, so all the different evaluation methods that compare
> against gold standards are oriented towards purity to varying
> degrees. It's just a question of to what degree. The main distinction that
> I can see right now is that f-score and the Senseclusters evaluation I
> mentioned previously insist upon finding the same number of clusters as
> senses in the gold standard data (or you pay dearly) while purity and the
> supervised evaluation don't punish you if you find a different number of
> clusters from the gold standard classes, as long as your clusters are
> relatively pure.

Yes, we think you certainly have a point here. There is a connection to
purity, but we want to mention a pair of issues.

First, we think that the supervised evaluation also measures that the
clustering solution is consistent across train and test. In fact, consider
the "1 instance 1 cluster" baseline, where each instance is a single
cluster. The purity of the clustering is 100%. However, you can't even
perform a supervised evaluation because, after splitting the train and test
corpus, the labels of the test corpus won't appear in the train, so the
mapping matrix will be useless and scorer2 will report coverage and recall
0%. In this case, you don't have a consistent labeling across train and
test, so purity and supervised evaluations differ.

Second, purity measures the sense distribution in a cluster, whereas the
supervised evaluation measures recall in the supervised WSD sense. Let's
take an example: given a cluster C1 that is mapped in the training part to
S1, the supervised evaluation will map the instances tagged with C1 into S1
in the test part. If those instances in the test happen to have S2 in the
gold-standard you get 100% purity (all instances of cluster C1 are of the
same class, S2), but scorer2 will give you a recall of 0%, because you just
missed to label the instances correctly. So, the supervised evaluation don't
measure only the "purity" of a cluster, but also that the labels are the
_same_. This fact stems from the same fact as above, i.e. the clustering
needs to be consistent across train and test with regard to the
gold-standard classes.


Regarding the random baseline and the MFS, you wrote:

> And so...I think the random baseline degrades to what is almost mfs in the
> supervised case...
>
> Is that right?? I think it's possible because given a random assignment of
> clusters, you will have very balanced distributions of senses, but one of
> them will usually win out (ties are rare) and so the mapping essentially
> reduces to 1 cluster per word...there must be a few ties as it's not
> identical to mfs, but it's very close.

Right. The random baseline decides 1) how many clusters are induced and 2)
the weight of every induced cluster per instance. If the number of induced
clusters is low, the supervised evaluation of the random baseline will is
close to MFS. Consider the example you gave, where only 2 clusters were
induced (C0 and C1). Then, every instance will be assigned with a cluster
score vector with weights to clusters C0 and C1, (w0, w1). Being randomly
produced weights, if we had a very large training corpus, the weights will
mimic the MFS. Thus, when multiplying the score vector with the mapping
matrix, the MFS will dominate over the other senses.

However, we also have the issue of consistency over train and test. The more
clusters we randomly induce, the smaller is the probability that a cluster
tag in the test corpus previously appeared in the training part. Therefore,
the supervised evaluation will get lower recall scores than the MFS as the
number of induced clusters raises. In the limit, when the number of clusters
is close or larger to the number of instances, the solution is like the "1
instance 1 cluster" baseline, with coverage and recall close to 0%.

All in all, we think the supervised evaluation also measures a balance
in the number of clusters, as it is correlated with:

- purity (so it prefers small clusters, i.e. a large number of clusters)

- consistency across train and test (so it penalizes small clusters,
i.e. prefers a small number of clusters)

If a system fails to address both issues, it will get supervised recall
lower than the MFS.

Kind regards,

Aitor and Eneko

Ted Pedersen

unread,
Apr 25, 2007, 12:56:20 AM4/25/07
to sensein...@googlegroups.com
Greetings,

Thanks very much for your thoughtful response. These are indeed
complex issues, and to be honest I wish I wasn't the only one engaging
in the discussion from the participants side (hint to any lurkers out
there :), but I guess I'll keep pressing on a bit. Just a few more
thoughts really....

I agree, but this doesn't seem to favor the supervised measure. Let us
suppose that 1 instance - 1 cluster (assigning each instance to its
own cluster) is actually the correct solution as shown in the gold
standard. This is perhaps an unusual or degenerate case, but none the
less a measure should deal with it appropriately, and I think the
supervised measure in this case could report 0 or be undefined, while
the f-score would be a perfect 1.0 (as would purity). In that case at
least, the supervised measure appears to just be wrong, while the
f-score and purity are correct.

It seems that in general the supervised measure will fail when there
are more classes (senses in gold standard data) than there are
instances in the test portion of the data, and that doesn't seem like
a terribly usual case really.

A more general point is that the consistency of test and train don't
seem to have much to do with the clustering solution, since the test
and train partitions are made after clustering the full set of data.
In the sense induction task framework, we cluster one set of data
(consisting of let's say 100 instances). Then after clustering we
divide it into test and training portions for purposes of evaluation
(perhaps into a training portion of 75 instances and a test portion of
25 instances).

If the test and train are inconsistent, it either means that we have
selected the test instances in a peculiar way (e.g., if there are only
4 classes in the gold standard data and we only select one of them in
the test portion), or that we don't have enough data to make the test
set representative of the training set (e.g., if there are 20 clusters
in our 100 instances, a test set of 25 is perhaps too small to
represent all of them reasonably.)

In either case though, it seems like the consistency between test and
train really doesn't reflect upon the overall clustering solution but
rather on the technique used by the person selecting the training and
test sets from the instances that were clustered, or perhaps on the
characteristics of the test data relative to the number of
classes/clusters (see above).

> Second, purity measures the sense distribution in a cluster, whereas the
> supervised evaluation measures recall in the supervised WSD sense. Let's
> take an example: given a cluster C1 that is mapped in the training part to
> S1, the supervised evaluation will map the instances tagged with C1 into S1
> in the test part. If those instances in the test happen to have S2 in the
> gold-standard you get 100% purity (all instances of cluster C1 are of the
> same class, S2), but scorer2 will give you a recall of 0%, because you just
> missed to label the instances correctly. So, the supervised evaluation don't
> measure only the "purity" of a cluster, but also that the labels are the
> _same_. This fact stems from the same fact as above, i.e. the clustering
> needs to be consistent across train and test with regard to the
> gold-standard classes.

I guess this amplifies the previous point, the test and training split
is an artifact of the evaluation, not the clustering. In the
evaluation used in the sense induction task, the training and test
portions of the data do not even exist seperately until after
clustering. We clustered all 27,132 instances without thinking about a
test/train split, and then that split was made after the clustering
was done. It might be that this split could be made in such a way
where the training and test portions "disagree", but what does that
really tell us about the overall clustering solution?

Suppose we have 100 instances that are clustered in C1, where 50 fall
into S1 and 50 fall into S2 (where Sx are the gold standard senses).
Let's suppose that those 50/S1 instances are in the test portion, and
the 50/S2 instances are in the training portion. The supervised
measure would score this as 0.00, where in fact it seems like the
overall clustering solution is right half the time (.50). So this
doesn't seem to be measuring the overall clustering solution, rather
it is measuring the nature of that training / test split, and I don't
quite see what that tells us about the clustering solution.

> Regarding the random baseline and the MFS, you wrote:
>
> > And so...I think the random baseline degrades to what is almost mfs in the
> > supervised case...
> >
> > Is that right?? I think it's possible because given a random assignment of
> > clusters, you will have very balanced distributions of senses, but one of
> > them will usually win out (ties are rare) and so the mapping essentially
> > reduces to 1 cluster per word...there must be a few ties as it's not
> > identical to mfs, but it's very close.
>
> Right. The random baseline decides 1) how many clusters are induced and 2)
> the weight of every induced cluster per instance. If the number of induced
> clusters is low, the supervised evaluation of the random baseline will is
> close to MFS. Consider the example you gave, where only 2 clusters were
> induced (C0 and C1). Then, every instance will be assigned with a cluster
> score vector with weights to clusters C0 and C1, (w0, w1). Being randomly
> produced weights, if we had a very large training corpus, the weights will
> mimic the MFS. Thus, when multiplying the score vector with the mapping
> matrix, the MFS will dominate over the other senses.

When I did my random clusters, I assigned no weights. In the case of
random2, I simply assigned each instance to cluster1 or cluster 2
with 100% certainty. That scheme results in a supervised value that
was actually slightly better than MFS, and would have placed fourth in
the supervised evaluation (ahead of MFS and 3 participating teams).

MFS got 78.7, while random2 got 78.9 in the supervised evaluation.

What do we conclude from this?

> However, we also have the issue of consistency over train and test. The more
> clusters we randomly induce, the smaller is the probability that a cluster
> tag in the test corpus previously appeared in the training part. Therefore,
> the supervised evaluation will get lower recall scores than the MFS as the
> number of induced clusters raises. In the limit, when the number of clusters
> is close or larger to the number of instances, the solution is like the "1
> instance 1 cluster" baseline, with coverage and recall close to 0%.

I am skeptical of this. :)

I did a random50 experiment, where I assigned each instance to 1 of 50
different clusters (with no weighting, all 100%). This was approx 50
different clusters per word, btw, and the number of clusters was
approx. 4600, and the supervised score was .756. (The average number
of clusters per word was 46, that's where I get 4600 from rather than
5000).

> All in all, we think the supervised evaluation also measures a balance
> in the number of clusters, as it is correlated with:
>
> - purity (so it prefers small clusters, i.e. a large number of clusters)
>
> - consistency across train and test (so it penalizes small clusters,
> i.e. prefers a small number of clusters)
>
> If a system fails to address both issues, it will get supervised recall
> lower than the MFS.

I understand the rationale you are describing, but I don't know that I
see it reflected in the actual performance of the measure. Random2
(.789) beat the MFS (.787), and Random50 isn't that far away (.756). I
think we would all agree that MFS is in general a challenging baseline
for unsupervised systems, and yet the fact that random generation of
clusters can approach or exceed it according to the supervised measure
just worries me quite a bit. I do not think it means our systems are
only as good as random methods, but that is the unfortunate message
that could be conveyed by results like this.

So....as I mentioned before I do think there is a way to do the
"supervised" evaluation in such a way that you don't have to do a test
/ train split after clustering (as is done now). I haven't fully
composed that note yet, it's not that complicated or brilliant really,
but it does resolve the issue of degenerate cases like 1 cluster - 1
instance being correct, I think. I'll post that in the next day or
two. However, it does not probably help with the random results, and
that I think is the more serious concern.

Cordially,
Ted

>
> Kind regards,
>
> Aitor and Eneko

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

Reply all
Reply to author
Forward
0 new messages