As mentioned I've been doing some thinking on the supervised measure
of evaluation used in the sense induction task, and I think there is a
way to perform that without having to make a test / train split after
clustering. This would then allow us to score the entire sense
induction task data (27,132 instances) much like the f-score, purity,
and entropy, and allow us to compare clustering results across the
whole of the data that was clustered. There are also some difficulties
with the test / train split (I think) that this resolves, and I think
in some respects at least the measure still captures the intent of the
original which has the train / test split.
Let us go back to the example I've been using, where we have three
"true" senses / classes in the gold standard data, and 4 discovered
clusters. The confusion matrix looks like this...
        C1    C2     C3      C4
S1    10     30      0         5    = 45
S2    20       0     10        0    = 30
S3      0       0      5       20   =  25
      ----------------------------------
       30     30      15      25       100
There are 100 instances in this data, and in order to do the
supervised evaluation we'd have to divide it up into test and train
splits, and proceed as has been described where we learn a mapping
matrix from the training data and apply that to the test.
However, why not simply do an evaluation much like the supervised
measured based on matrix above? Let's suppose I went ahead and
computed the mapping table of probabilities as is done in the
supervised case....
        C1    C2     C3      C4
S1    .22     .66     0       .11
S2    .66      0     .33       0
S3      0       0      .2       .8
So the mapping here would be
C1->S2
C2->S1
C3->S2
C4->S3
Now, rather than applying this to some other set of data, why not just
apply it to those same 100 instances?
If you do that....in effect you are saying that the instances in C1
that correspond to S1 are correct, those in C2 that correspond to S1
are correct, those in C3 that correspond to S2 are correct, and those
in C4 that correspond to S3 are correct.
This would give you 20+30+10+20 correctly assigned instances out of
100, which is 80/100 or .80.
This does in fact look a lot like purity, but it's different in that
we are not weighting by the size of the clusters. What we seem to be
getting here is an accuracy-type measure that does not penalize you
for getting the number of clusters wrong. In other words, the fact
that sense 2 (S2) was divided between two clusters (C1 and C3) does
not "hurt" the result.
One problem this fixes with the existing supervised method is if the
number of senses (classes) in the gold standard data is greater than
or nearly the size of the test data. The simplest example of that is
found in the 1 cluster per instance baseline. If we assume that this
is the correct clustering as found in the gold standard data, then the
supervised method as it stands (with a test train split) will score
that as zero even if the clustering is done perfectly, because the
test and train data do not contain the same senses/classes. In the
modified version I show above, there is no problem and if we correctly
assign 1 instance per cluster, we'll get a perfect score (if that is
what the gold standard indicates as being correct).
Another issue this will address is what if the test and train splits
are somehow unrepresentative of each other. In my view, if the test
and train splits are not representative, that reflects more on the
method that created the split, or on the nature of the data (see case
above, for example) than it does on the clustering algorithm, and it's
not really fair to score the clustering algorithm based on the
splitting of the test and train data, since the clustering algorithm
didn't do that.
The clustering algorithm in the supervised framework as used in the
sense induction task saw the train+test data as one big data set, and
only after clustering is the test data "pulled out". So if that
pulling is done in a curious way that leads to a non-representative
test train split, or if there is really not enough data to pull out a
representative test sample, the score will reflect this and end up
reflecting on the clustering algorithm even though the clustering
algorithm had nothing to do with the creation of the test+train split.
As a simple example of that, suppose we find that 100 instances belong
in a single cluster (C1). Let's assume that the true class/sense
assignments of these 100 instances is that 50 of them belong to S1,
and the other 50 belong to S2.
If we do a random sampling of the 100 instances to create the
test/train split, we will most likely find that the distribution of
true senses/classes in the test and train split are approximately the
same, and the supervised score will reflect that. Suppose we have 50
instances in the test, and 50 in the train. If we sample randomly,
perhaps there would be 26 S1 instances and 24 S2 instances in the
train, and then 24 S1 instances and 26 S2 instances in the tests
(distributions balanced). In this case, C1 would map to S1, and when
processing the test data the 24 S1 instances would be assigned
correctly and the score would reflect that and give a result around
50% (assuming no other clusters are present, etc.)
However, supose we do the test train split such that the 50 S1
instances are in train, and the 50 S2 instances are in test. In that
case, the supervised measure will assign 0, since C1 would be mapped
to S1 in the train data, and all the S2 instances in the test data
would therefore be assigned incorrectly. So the overall score for this
split is 0.
The problem I see here is that the clustering result is the same in
both cases, the difference in the supervised scores reflects the fact
that in the first case (random sample) the test and train sets are
representative, and in the second case they are not. This is an
interesting point, but it has nothing to do with the clustering
solution.
Now, it should be pointed out in the sense induction task the test and
train sets are not wildly different, however, there are differences.
The simplest way to see that is the average number of classes/senses
in the test portion is 2.9, while in the test+train data it is 3.8.
There are a few classes in the test data that don't occur in the
train, and probably a few in the train that don't occur in the test.
However, the test data is at least somewhat representative of the
train, so the supervised score that we see now based on the train /
test split is probably at least in the ballpark of what we'd see with
the method I'm describing above. However, I think the differences in
the train / test split do affect the scoring, and that a score that is
based on the entire set of data that was clustered might in some sense
be more fair (since the train / test split really doesn't have
anything to do with the clustering solution, and changing that train /
test split would not doubt change the scores and even the rankings of
systems).
Now, this modified version of the supervised method will still score
random results pretty high (as the current supervised method does).
However, random methods can have rather high purity, especially as the
number of clusters grow, so that isn't too surprising, and perhaps
just makes more clear the connection to purity.
Finally, I should point out that in the case where the number of
clusters and number of senses are the same, at least when there are no
weights attached to answers (and just 1 cluster is assigned per
instance) this seems to be equivalent to the SenseClusters evaluation
measure I've mentioned. The difference is that if the number of
clusters is more than the number of senses, the supervised measure
doesn't penalize that, while the SenseClusters measure does.
In this case...
        C1    C2     C3      C4
S1    10     30      0         5    = 45
S2    20       0     10        0    = 30
S3      0       0      5       20   =  25
      ----------------------------------
       30     30      15      25       100
The SenseClusters measure would map clusters and senses as shown here:
       C2   C1   C4
S1    30   10     5
S2      0   20     0
S3      0     0   20
C2->S1
C1->S2
C4->S3
...and attain an accuracy of 30+20+20/100 = 70/100 = .70. This is less
than the modified supervised measure score of .80, and reflects the
penalty assesssed for having an "extra" cluster, at least relative to
the gold standard data. In general the SenseClusters evaluation gives
somewhat lower scores that the supervised method (either the existing
one or the new one) because of this penalty that it imposes for
getting the number of clusters "wrong".
So, that's the alternative I was thinking about, and some of the
motivations for it. I do hope it makes some sense.
Thanks,
Ted
You may note that there has been a bit of a fork in the discussion on
the supervised sense induction measure, in that in a few of the most
recent notes I am talking about the "procedure" of supervised learning
and how to handle the test / train split, and that the test data
shouldn't be a part of the initial clustering, and so forth. This is a
different set of concerns than is expressed in the note below, and I
thought I should point out the distinction.
The thing that has had me really puzzled with the supervised measure
is that it clusters the full 27,132 instances, and then breaks it into
train and test sets and only directly scores the 4,851 instances in
the test data. That had me searching for a way to evaluate the full
27,132 instance clustering solution in a way that might be similar in
spirit at least to the supervised measure, which is what I describe
below.
So, the method below was meant to be an unsupervised formulation of
the supervised sense induction evaluation method, and I guess at this
point I see that it seems to be adressing something different than
what motivates the supervised sense induction evaluation method and so
might be somewhat tangential to the discussion.
Perhaps what it points out is that the supervised sense induction
method probably should either be unsupervised (like the one below, and
evaluate the full 27,132 instances) or supervised, and hold the test
data out until the actual moment of evaluation. I think right now the
fact that is clusters all the data and then scores a part of it for
evaluation makes it neither unsupervised nor supervised (in the
technical sense of those terms at least) and is what makes it hard to
directly compare to anything else.
Cordially,
Ted