random baselines on supervised evaluation

Ted Pedersen

unread,

Apr 27, 2007, 5:05:06 PM4/27/07

to senseinduction

Greetings all,

I just wanted to collect the random results I have been creating for
the supervised measure in the sense induction task into a single note.
Also, in some of my earlier notes I think I created f-score and
average number of cluster values from the full 27,132 instances, so
here I just use the test portion of that data (4,851 instances).

The results are shown as randomX, where each instance of a word is
assigned a cluster with between 1 and X. random2 then means that the
instances of a word are either assigned to cluster 1 or cluster 2. I
also included the MFS baseline here as well.

All of the results shown are on the test portion of the data (4,851
instances) where the avergage number of classes (senses in gold
standard data) is 2.87.

avg-clusters is the average number of clusters in the test data across
the 100 words. You can get a good idea of the total number of clusters
in the test data by mulitiplying the avg-clusters value by 100.

supervised f-score purity entropy avg-clusters
random2 78.9 59.7 80.0 43.5 1.98
MFS 78.7 78.9 79.8 45.4 1.00
random4 78.4 44.9 80.9 39.8 3.92
random3 78.3 50.0 80.6 41.5 2.95
random50 75.6 17.9 90.0 15.5 24.12
random100 73.7 14.7 93.2 10.2 30.72

The values are sorted in descending supervised scores. If we
superimpose the supervised results from the participating systems in
this table, we see the following:

supervised f-score purity entropy avg-clusters
RANK 1 81.6
RANK 2 80.6
RANK 3 79.5
random2 78.9 59.7 80.0 43.5 1.98
MFS/RANK 4 78.7 78.9 79.8 45.4 1.00
RANK 5 78.6
RANK 6 78.5
random4 78.4 44.9 80.9 39.8 3.92
random3 78.3 50.0 80.6 41.5 2.95
random50 75.6 17.9 90.0 15.5 24.12
random100 73.7 14.7 93.2 10.2 30.72

What conclusions do we draw from these numbers? I think the first and
most important one is that I don't really know what the supervised
score is telling me.

As I look at the f-score, I can see that the solution generated by
randomX is nothing like the gold standard solution.

As I look at purity, I can see that the distribution of senses in the
gold standard data is pretty skewed (high purity of MFS tells me that)
and that as the number of clusters increases, so does purity (which is
to be expected).

As I look at entropy, I see that it declines with the number of
clusters, which again is expected...

As I look at the supervised score, I see that randomly generating 1 of
2 clusters for each instance of a word gets the same supervised score
(even a little better) than MFS. The purity and entropy of random2 and
MFS are nearly the same as well, and the only place where we see a
difference really is the f-score, and there the difference is huge.

supervised f-score purity entropy avg-clusters
random2 78.9 59.7 80.0 43.5 1.98
MFS/RANK 4 78.7 78.9 79.8 45.4 1.00

So...if it weren't for the f-score, we might think that random2 and
MFS were about the same, but of course they aren't...

I guess I will leave that as a question, what exactly is the
supervised measure really measuring?

Thanks,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

e.ag...@ehu.es

unread,

Apr 27, 2007, 7:15:26 PM4/27/07

to Ted Pedersen, sensein...@googlegroups.com

hi all,

ted, thanks for all the data, and sorry to have been silent...

I have been travelling during the week, and we now have a long weekend
in the Basque Country. We'll come back on Wednesday with more material
for the discussion. We are doing some further experiments, and we hope
to be able to shed some light into the nature of the evaluation results
then.

One of the things we would like to do are significance tests (I suspect
the difference between random2 and MFS is not significant).
Unfortunately we are both very busy with an important project deadline,
and might not be able to do so. Any help would be appreciated :-)

In the meantime, a few of observations:

1) A small correction for a previous posting by ted: 1 instance 1
cluster does not get perfect FSCORE:

FScore: 0.095
Entropy: 0.000
Purity: 1.000

FScore is computed usin precision and recall, and while precision is 1
(as purity), recall is much lower (but still over 0). Note that this
FSCORE is consistent with the supervised result (0 recall).

2) random2. As ted observed random2 in the supervised evaluation is only
beaten by 3 systems, and is very close to MFS. Note that random 2 has
also better entropy than any of the participant systems, and is very
close to the entropy of MFS. We already predicted that random2 would be
very close to MFS.

3) random100 still gets a lot of information from the mapping (remember
that the mapping inherently feeds MFS information into any clustering
solution) by mere chance. If you increase the number of clusters (1000,
10000), it will continue to decrease, as the chance to tag train and
test instances with the same cluster diminishes.

4) Regarding the train/test split. We used the same train/test split as
defined by task 17 organizers. The reason is to be able to compare with
other supervised/unsupervised systems that participate in the lexical
sample subtask of task 17. We expected that the split would be random,
but it seems that it's not the case. We are not happy with this, but in
any case, note that all our induction systems (as well as the supervised
ones) are affected by the different distributions in train/test. In the
following days we plan to do a random split, and check what influence it
has in the ranking. Note that this way we loose comparability with
regard to supervised systems.

> I guess I will leave that as a question, what exactly is the
> supervised measure really measuring?

It measures what it intends to: the recall when the clustering is (1)
mapped into classes using a corpus tagged with both clusters and (2)
then used to re-tag another test corpus via the mapping.

We think it's a useful measure. For instance it shows that even if the
participant systems clustering solutions were not able to beat the 1
cluster solution in Fscore, some of them learn useful information which
carries onto the supervised scores.

Have a nice weekend!

eneko and aitor

--
----------------http://ji.ehu.es/eneko----------------
Eneko Agirre NOTE NEW E-MAIL
Informatika Fakultatea mailto:e.ag...@ehu.es
649 p.k. - 20.080 Donostia fax: (+34) 943 015590
Basque Country tel: (+34) 943 015019

Ted Pedersen

unread,

Apr 28, 2007, 10:37:58 AM4/28/07

to sensein...@googlegroups.com

Thanks Eneko, these are certainly interesting and important
observations. Time is short here as well, so I'll comment over the
course of a few notes....

On 4/27/07, e.ag...@ehu.es <e.ag...@ehu.es> wrote:
>
> In the meantime, a few of observations:
>
> 1) A small correction for a previous posting by ted: 1 instance 1
> cluster does not get perfect FSCORE:
>
> FScore: 0.095
> Entropy: 0.000
> Purity: 1.000
>
> FScore is computed usin precision and recall, and while precision is 1
> (as purity), recall is much lower (but still over 0). Note that this
> FSCORE is consistent with the supervised result (0 recall).
>

My observation about 1 instance 1 cluster was perhaps misunderstood.
My point was hypothetical, that is let's assume that 1 instance 1
cluster was the correct assignment (as reflected in the gold standard
data), then the f-score would be 1.00 and the supervised measure would
assign it 0 or be undefined. To be clear, 1 instance 1 cluster means
that each instance is assigned to its own cluster. The f-score you
mention above of 0.095 is based on the gold standard data as given,
which is a different situation.

Suppose that every instance is assigned its own cluster, and that this
is the correct answer according to the gold standard solution. Suppose
that there are 10 instances, with 10 clusters created, and 10
correct/classes....

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
S1 1
S2 1
S3 1
S4 1
S5 1
S6 1
S7 1
S8 1
S9 1
S10 1

Let's suppose the above is the confusion matrix generated by the
clustering solution over the 10 instances (zeros are not shown to save
keystrokes) and let's also suppose that this is the correct
answer/assignment as reflected in the gold standard data. In this
case, the f-score is 1.00 because the cluster assignments perfectly
agree with the gold standard classes.

If you create a test / train split, the supervised method will get 0
or fail because the classes in the training data are not in the test,
and vice versa.

So, in general the point was that the supervised measure will fail if
the number of classes is near or beyond the number of instances in the
test portion, while measures that do not attempt to make a train /
test split do not have that limitation.

Now, this is clearly a point of at most theoretical interest for the
sense induction data, where in general the same classes are occuring
in both the test and train. However, since 1 cluster 1 instance is a
common baseline in clustering evaluation, it seems like this is a
significant limitation of the supervised measure.

Cordially,

Ted Pedersen

unread,

Apr 28, 2007, 2:11:30 PM4/28/07

to sensein...@googlegroups.com, senseclus...@lists.sourceforge.net

Greetings again,

On 4/27/07, e.ag...@ehu.es <e.ag...@ehu.es> wrote:

> 4) Regarding the train/test split. We used the same train/test split as
> defined by task 17 organizers. The reason is to be able to compare with
> other supervised/unsupervised systems that participate in the lexical
> sample subtask of task 17. We expected that the split would be random,
> but it seems that it's not the case. We are not happy with this, but in
> any case, note that all our induction systems (as well as the supervised
> ones) are affected by the different distributions in train/test. In the
> following days we plan to do a random split, and check what influence it
> has in the ranking. Note that this way we loose comparability with
> regard to supervised systems.

This is I think my central concern about the supervised measure in the
sense induction task. It seems tempting to compare the supervised
sense induction measure to the scores of supervised systems as used in
the lexical sample task (#17), or perhaps to unsupervised measures
that also use MFS as a baseline, such as the unsupervised f-score or
the SenseClusters score.

However, I think this results in a flawed comparison.

While MFS will produce the same value across the supervised sense
induction measure, the traditional supervised measure from task 17,
the unsupervised f-score, the SenseClusters score, etc. other
important baselines or cases do not. For example, the score generated
by random2 "move around" as you go from evaluation measure to measure,
as do the scores of the 1 cluster - 1 instance example I mentioned
previously.

This shows at least in an intuitive way that we are dealing with
measures that are rather different from each other, and should not be
compared directly to each other. Each of these measures may have
interesting points to make on their own, but I do not think they can
be compared directly or even indirectly.

For example, if you entered random2 in the English lexical sample task
(#17), it scores about .28. In the supervised sense induction task it
scores .78. Somehow this difference seems important. Now, it is not
enough to say that because random2 behaves differently these are
different measures that can't be compared...but it's sort of what got
me started on the thought process that follows below...

So, what is the problem with comparing the supervised sense induction
measure with the f-score or the SenseClusters score? I think the
mapping step in the supervised measure is clearly a supervised
learning step, since the mapping is based on knowledge of the correct
classes given in the training data, and then this knowledge is used to
alter the results of the clustering.

If you look at the results of create_supervised_keyfile.pl, the
distribution of clusters is often radically different than what the
clustering algorithm originally assigned. For example, when the
"clustering algorithm" is random2, each word has 2 clusters, and the
distribution of the clusters is relatively balanced. However, after
creating the mapping from the training data and building the new key
file, the distribution of the answers for clustering that has been
generated is radically different, and in fact for most words there is
just 1 cluster that occurs most of the time. This is why a random
baseline as used in task 17 will fare so poorly, because the original
random values are presented to the scoring algorithm, while in the
sense induction task the random values are in fact adjusted prior to
scoring based on information from the training data and end up
converging towards MFS.

In the supervised sense induction evaluation, in effect the clustering
algorithm is being augmented with a supervised learning step, and
while that is perhaps a reasonable thing to do, it does not seem
reasonable to compare the scores that come from such a process with
scores like the f-score or the SenseClusters score that do not have
the benefit of such a step.

What is the problem then with comparing the supervised sense induction
measure with the supervised results from the lexical sample task (#
17)? Here I think the problem is one of procedure, and is perhaps more
subtle than the case of comparing to the f-score and SenseClusters
score. However, the nub of the problem is that the clustering was
carried out on the full 27,132 instances. This means that we saw the
test data as defined by the lexical sample task during
learning/clustering, and that makes what we did quite different.

The task 17 participants that used supervised learning were given the
training data (22,281 instances) and then applied whatever model they
learned to the test data after the fact (4851 instances). I think to
compare our results with the task 17 results, we would need to cluster
the 22,281 instances in the training data, and then from that apply a
learning algorithm (the mapping procedure done in
create_supervised_keyfile.pl is essentially a simple learning
algorithm) to the results on the training data to build a model that
we could then apply to the test data.

So I think the problem is that we did not hold the test data out of
the clustering process, which was in effect how we were generating
training data. Now, would the results have been radically different? I
don't know. Perhaps not. But, perhaps they would be, because we'd have
a much smaller number of instances per word, and so there is a
potential for quite a few differences in how we would cluster the
training data. But, if we had held the test data out of the clustering
process, there would then be no doubt that we could compare the
results on the test data in a kind of supervised learning experiment
where the training data is created by clustering, which seems to be
the goal of the supervised measure. I do think that is a good and
interesting goal, btw, since in the end we would like to replace
manual sense tagging with clustering.

However, I really don't feel we can make that comparison to task 17
here, because we didn't keep the test data reserved for evaluation, it
was a part of the clustering/learning. This would be analogous to
giving the supervised systems all 27,132 instances to learn a model
from, and then applying that model to a subset of the data upon which
they learned.

I think more closely following the traditional procedure for
supervised learning when doing the supervised sense induction
evaluation would actually be quite a bit more clear, and allow for
direct comparison to other supervised methods. This would also make it
all the more clear that such a measure should not be compared to the
f-score or SenseClusters measure, which can fall below MFS for even
rather good results. Supervised learning, as we know, generally has
MFS as a lower bound, and so we might view exceeding the MFS with the
supervised sense induction evaluation in that light.

The other very significant problem with comparing to task 17 is that
it would be tempting to say something like "the best of our
unsupervised systems attained recall of 0.81, and the best of the
supervised systems attained recall of 0.89 (there are the actual
numbers I think...), therefore this shows that unsupervised systems
are nearly as good as supervised." Now, I do not mean to suggest that
this is the intention of the organizers, I am only pointing out a
potential pitfall here, a possible error of interpretation by those
who view these results more casually than we do. I think what the
unsupervised sense induction scores really could show (given that the
test data was really withheld) is that if you use clustering to create
training data, then applying a supervised learning algorithm to such
data can give you results that are at the level of MFS or a bit above,
and that's a very different kind of claim and perhaps makes the nature
of the evaluation a bit more clear. Unfortunately I think the
procedural issue I mentioned prevents us from making such a claim,
because it is kind of interesting.

So...if the motivation behind the supervised sense induction measure
is to make comparisons to supervised learning systems, I really do
think we need to "handle" the data in the same way as the supervised
systems do, and that is do our clustering/tagging on the training
data, and then build models with that clustering training data that we
then apply to the test data.

Also, I do think it's important to recognize that the supervised sense
induction measure really does include a learning step that alters the
answers of the clustering algorithm, and so really shouldn't be used
as a basis for comparison with unsupervised measures that don't adjust
the answers of the clustering algorithm like the f-score or the
SenseClusters score.

Cordially,

Eneko Agirre

unread,

May 4, 2007, 9:01:57 AM5/4/07

to sensein...@googlegroups.com

hi again,

at last we had the time to run some further tests here. Sorry for the delay,
but running the task is supposing a higher overload than expected.

1) SMALL BUG IN THE SUPERVISED EVALUATION SCRIPT

The first thing is that we found a bug in the supervised scoring script.
For a given instance, if after mapping the clusters using the training
part it returned 0's for all senses (i.e. the test cluster did not
appear in the train part) the script wrongly chose the first sense. This
mainly affects random runs with 100's of clusters giving the wrong idea
that those random runs were performing too well.

It also affects slightly some participating systems, but the ranking is
not affected at all (we also include random runs with 1000 clusters, and
10000 clusters per word respectively):

All Nouns Verbs
- 81.6 86.8 75.7
- 80.6 84.5 76.2
- 79.1 82.5 75.3
MFS 78.7 80.9 76.2
- 78.5 80.7 76.0
- 77.7 81.6 73.3
- 77.1 80.5 73.3
rand1000 24.8
rand10000 3.6

We think you can figure out which your sistem is, as the ranking has
not changed. Still, we will send an e-mail to the affected teams so
they replace the correct figures in their system paper. Please let us
know if you have trouble to meet the deadline for this reason.

You can find the correct file in:
http://ixa2.si.ehu.es/semeval-senseinduction/create_supervised_keyfile.pl

Ted, if you wish, you can download the new script from the webpage,
and re-score the random runs you had. Or if you prefer you can send us
those random runs and we can re-score them.

Just to be clear, our own system is below the MFS, so we don't have
any egocentric interest on this evaluation ranking :-)

2) OFFICIAL TRAIN/TEST SPLIT

Regarding the train/test split, we now know that it wasn't random at
all. The lexical sample organizers changed the usual policy (from
Senseval-1 to 3) of doing random splits, and followed a scheme that is
usual in parsing and SRL: they took sections 02-21 of the WSJ for
training and sections 01 22-24 for test. This explains why some of the
participant systems fall below MFS in the supervised evaluation. As an
aside, I have participated in the lexical sample supervised task, and
was not aware of this change in the way train/test split has been done.

Our apologies, this is a major factor that affects the way evaluation
is done. BTW, this also affects supervised systems, as their results
degrade as well. In a way, this kind of split is more realistic, in
that the test examples and the train examples come from different
documents.

We still think that this is a valid evaluation measure, which measures
the ability (or not) of a clustering solution to improve the MFS
baseline via the mapping. We agree with Ted that external factors are
at play here, but this is the reason why we use supervised evaluation
as a complementary measure to the unsupervised FSCORE. We will make
sure that we mention these clearly in the paper.

3) ON USING SUPERVISED EVALUATION TO COMPARE INDUCTION AND SUPERVISED
ML SYSTEMS

> However, I really don't feel we can make that comparison to task 17
> here, because we didn't keep the test data reserved for evaluation, it
> was a part of the clustering/learning. This would be analogous to
> giving the supervised systems all 27,132 instances to learn a model
> from, and then applying that model to a subset of the data upon which
> they learned.
>

In fact supervised systems do have access to all the training part and
the test part. Using the unannotated test part to improve the
supervised results is an active area of research, and some systems are
able to improve their results. The design of the lexical sample task
allows for this, and perhaps some systems are actively using it.

We always have stressed that the induction results are a combination of:
- the sense induction algorithm
- the mapping algorithm, which feeds strong MFS information

In fact, if such a system would have participated in Semeval task 17,
it would be classified as semi-supervised.

We include now the results from task 17, with the most notable
supervised and unsupervised systems:

All
best sup 88.7
best semisup 85.1
best induction (+MFS) 81.6
MFS 78.7
best unsup 53.8

We think this information is highly relevant in order to compare among
different types of WSD systems.

We also wanted to measure how supervised systems fare on the
unsupervised FSCORE, but we have not received the system outputs of
the supervised systems yet.

3) RANDOM TRAIN/TEST SPLITS

Since we suspected that the train/test split was not random, we wanted
to know what would have happened if we had taken it at random. We have
produced new keys and a new script, available at the website:

http://ixa2.si.ehu.es/semeval-senseinduction/sup_eval_2.sh
http://ixa2.si.ehu.es/semeval-senseinduction/senseinduction.random82test.key
http://ixa2.si.ehu.es/semeval-senseinduction/senseinduction.random82train.key
http://ixa2.si.ehu.es/semeval-senseinduction/senseinduction.random50test.key
http://ixa2.si.ehu.es/semeval-senseinduction/senseinduction.random50train.key

Usage (in a single line):

sup_eval_2.sh your_system.key anydir senseinduction.random50train.key senseinduction.random50test.key

The first two follow the same proportion of examples as in the
official split (82/18). The second two follow a fifty-fifty split.

These are the results:

official 82/18 50/50

- 81.6 82.2 81.6
- 80.6 80.1 79.6
- 79.1 79.9 79.7
MFS 78.7 78.4 78.3
- 78.5 79.0 78.8
- 77.7 81.3 81.0
- 77.1 77.9 78.3

As you can see, once the splits are random all the induction systems
fare better than MFS (the other happened to be a knowledge-based WSD
system) . The ranking changes as well, so some induction system do
better with random split, but the best system stays the same in all
splits. The 50/50 and 82/18 produce nearly the same ranking.

And these are the sense averages:

official 82/18 50/50

train 3.6 3.63 3.41
test 2.87 2.86 3.44

The smaller test gets less senses on average, which is explained by
the fact that many senses occur with low frequency.

We will try to include this extra information in the paper, but the
official results will stay the same.

4) EXPLANATION OF THE RANDOM RESULTS IN THE SUPERVISED TEST

The supervised system of evaluation already injects some knowledge in
any clustering solution, as the mapping learned from the training part
has the MFS bias on it. This is why random2 (random choice between two
clusters per word) is (modulo small variations) equal to the MFS. When
the random solution chooses among more clusters (1000, 10000) it
starts do degrade, as many of the clusters that are applied in the
test are not seen in the training. Still, it suffices that, by chance,
a small number of clusters appear both in train and test, for the
inherent MFS bias to appear in those cases. Note also that contrary to
precious Senseval evaluations, some target words have more than 1
thousand occurrences, which makes the probability of clusters to occur
both in train and test quite large.

Thanks for your patience and interest!

best

eneko and aitor

PD. We just noticed that Ted sent at least one message to another
list: senseclus...@lists.sourceforge.net. We have not been
aware of it. Should we forward this message to them as well? Perhaps
you can do it for ourselves?

---------------------http://ji.ehu.es/eneko------------------
Eneko Agirre PLEASE NOTE NEW E-MAIL:

Informatika Fakultatea mailto: e.ag...@ehu.es
649 p.k. - 20.080 Donostia fax: (+34) 943 015590

Euskal Herria / Basque Country tel: (+34) 943 015019

Reply all

Reply to author

Forward