I'm still mulling over the random baseline results, which for the
unsupervised evaluation on the test portion came up with :
Fscore 37.9, Purity 86.1, Entropy 27.7
The very high level of purity raises an interesting question, I think,
as to how well random would fare according to the supervised
evaluation. Are those numbers available?
The conjecture I've been playing with is what connection exists
between the supervised evaluation method and purity. I do think there
is a connection, because the supervised evaluation does not punish you
if you get more than the number of classes/senses that appear in the
gold standard data, but rather focuses on rewarding purity of
cluster.
Having a connection to purity, btw, is not all that surprising, since
when we compare to a gold standard, in the end we'd like to find
clusters that are relatively pure, so all the different evaluation
methods that compare against gold standards are oriented towards
purity to varying degrees. It's just a question of to what degree. The
main distinction that I can see right now is that f-score and the
Senseclusters evaluation I mentioned previously insist upon finding
the same number of clusters as senses in the gold standard data (or
you pay dearly) while purity and the supervised evaluation don't
punish you if you find a different number of clusters from the gold
standard classes, as long as your clusters are relatively pure.
In any case, I think the supervised numbers for random would be quite
interesting, as would information in general about average numbers of
clusters found by systems (including random). I think that combination
of information would allow us to tease apart the tradeoffs between
what we see in the f-score/purity/entropy and the supervised
evaluation. Presumably as the average number of clusters discovered
increases among systems, f-scores fall while purity might increase? Or
maybe not. :) But I think the number of clusters discovered is an
important part of the evaluation.
And then finally, I think the f-score/purity/entropy on the entire
27,132 instances remains an important piece of the puzzle, since the
test and training splits don't seem to really be randomly generated
(and I understand this data was created outside of this task, so they
might have had different motivations). I think in our task at least,
if we were generating that test / train split to do the evaluation,
we'd cluster all 27,132 instances and then randomly draw some portion
of those to be the test data, and the characteristics of the test data
would be a bit more representative of the data as a whole, and I
wouldn't worry so much that what we see on the test portion will be
different than what we see on the train portion. However, given the
difference in the average number of senses/classes in the test data
versus the training data (2.9 versus 3.8 I think) I think there is
some reason for concern, especially since f-score is very sensitive to
this number of clusters/senses issue.
Oh, one last point. I think there is a way to do the supervised
evaluation without using the training/test split. I'll write about
that in a separate note, but I think that might make it possible to do
the exact same evaluations on the 27,132 instance data set (train
+test) as well as the 4,812 instance (test) set.
Cordially,
Ted