I have reached what I find to be a mildly strange conclusion. :)
I generated some random answer files, where I randomly assigned
instances to 1 of 4 clusters per word. I did this for the test+train
data (27,132 instances). I ran the unsupervised evaluation on this and
got the following:
f-score  .378
purity    .788
entropy .433
Then I went ahead and did the supervised evaluation, and for that I
got .784. I noticed the test file generated by
create_supervised_keyfile was almost (but not quite) the same as a
1clusterPerWord keyfile....
And so...I think the random baseline degrades to what is almost mfs in
the supervised case...
Is that right?? I think it's possible because given a random
assignment of clusters, you will have very balanced distributions of
senses, but one of them will usually win out (ties are rare) and so
the mapping essentially reduces to 1 cluster per word...there must be
a few ties as it's not identical to mfs, but it's very close.
Thanks,
Ted
http://www.d.umn.edu/~tpederse/senseclusters_scorer.zip
This includes two very simple programs that reformat the solution and
key files into the format expected by senseclusters, as well as the
three programs from SenseClusters that do evaluation. Note that this
does not include all of SenseClusters, this is intended just to run
the evaluation software given that a key file and a solution file (in
Senseval format) are already available.
There is a single script that you run with your solution file and key
file as input, then from that you will get the SenseClusters
evaluation score of your results.
I have also included some data where I have run this script, including
mfs on the test data, and random on train+test.
The random data was created by assigning each instance of a word 1 of
4 possible clusters. There is a small program included that let's you
generate your own random data. The file random4.txt is in fact the
exact file that I ran unsup_eval.pl and sup_eval.pl on to get the
results mentioned in other postings.
So, please do give this a try if you are interested, and let me know
if you have any questions.
Enjoy,
Ted
Anyway, here's what the unsupervised evaluation yielded on train+test ...
f-score  .092
purity    .814
entropy .330
Then I ran the supervised evaluation and got .756...
In the case of 4 random clusters per word, the supervised score was
.784, and MFS is .787, so as the results get more and more random they
do fall from MFS, but very slowly it seems....and it doesn't look like
the "floor" for the supervised measure is much below MFS.
I don't know exactly what this tells us, but I think it does suggest
we need to interpret the significance of being at or near MFS very
differently when we are dealing with the supervised measure versus
other measures such as the f-score or the one in SenseClusters.
Anyway, I decided to do one more, and that was random where there 2
possible clusters randomly assigned to each word. I think more or less
the results are consistent with those found for 4 and 50 random
clusters per word...
Unsupervised (test+train)
f-score .560
purity .786
entropy .441
Supervised .789
SenseClusters .4837 (senseclusters_scorer)
So, I think that's about enough of this for now. :) Have I totally
messed up here, and botched the random evaluation of the supervised
measure, or are these results about what would be expected...?
Thanks,
Ted
-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse