Re: Semeval 2007 task#2. Error in the supervised evaluation.

8 views

Skip to first unread message

Ted Pedersen

unread,

Apr 16, 2007, 9:07:09 AM4/16/07

to Aitor Soroa, sensein...@googlegroups.com, Anagha Kulkarni

Hi Aitor,

Thanks for this clarification, yes, indeed it helps. I decided to move this to
the group discussion, as it might be of general interest. I removed the portions
of our previous notes that identify our score....So, regarding the evaluation...

In trying to summarize which of the instances are used in which type
of evaluation,
would it be correct to say something like the following then....

In the unsupervised evaluation, only the 4851 test instances are
scored, and the other instances (the "training" portion) are simply
ignored (they have no impact on the Fscore
at all).

In the supervised evaluation, the training instances are used to
create a mapping
with which the 4851 instances are scored.

I do understand the desire to have the scoring between the two
measures consistent
with respect to what is being scored (that is the 4851 test instances).

However, from the participants point of view, the evaluation data was
the full 27,132 instances, and we weren't aware of any training/test
split and/or didn't make use of that fact. In our case at least, we
selected features, etc. from the full 27,132 instances and
then clustered them. So I wonder if having results that are based on
the full set of
27,132 instances is also of interest?

I might suggest that the simplest thing to do would be to compute
purity, entropy, FScore
over the 27,132 instances as a third evaluation, with the
understanding that this score
is on the full set of data of evaluation data as seen by the
participants, while the other
two evaluations are on the test data as used in the english lexical
sample task. One interesting issue that might bring to light is the
degree to which the test and training+test perform differently.

Also, in our case at least, this leads to a "cleaner" evaluation in
that we used all 27,132 instances as our features, we clustered all of
them, and we didn't really treat test/train instances different at
all, in fact we viewed all 27,132 instances as evaluation data.

Now, I understand you may not be able to do this for all participants
in the available time.
If I ran the unsupervised scoring program using my answer file as
submitted (with 27,132
instances) and the senseinduction.key file (again with 27,132
instances) would I get
accurate scores? If so I can report those in my paper and address some
of the above.

As a second curiosity, if both evaluation measures are being run on
the same 4851 test instances, I wonder why 1 cluster per word for
unsupervised and mfs for supervised come out slightly differently. In
unsupervised the Fscores are 78.9 (all), 80.7 (n), and 76.8 (v), while
in supevised they are 78.7 (all), 80.9 (n), and 76.2 (v). Any thoughts
on where there
is a difference?

Cordially,
Ted

On 4/16/07, Aitor Soroa <a.s...@ehu.es> wrote:
> Hello,
>
> 2007/04/15 egunean, Ted Pedersen-k zera esan zuen :
> > Hi Aitor, I reran the scoring script and recreated the results you
> > report. However, I'm not sure I understand what is happening there....the
> > scoring program gets all 27,132 instances that were clustered by my
> > program, and the key being used is just for the test portion of that
> > (4,851).
> >
> > So...maybe two questions arise...
> >
> > I'm not sure I see what is being scored with the FScore here - is it the
> > 27,132 instances that we clustered? Or the 4,851 instance subset of those?
>
> Only test instances. We could score the whole clustering soultion, but, as
> the supervised scoring is done exclusivelly on the test part (the train part
> being used to map clusters to senses), we decided to perform the
> unsupervised evaluation in the same way.
>
> > My understanding of the F-measure is that it is weighted based on the
> > distribution of the "senses" in the gold standard clusters - from
> > which data set is that distribution taken?
>
> For calculating the FScore, we first calculate the F-measure of each pair of
> (Cluster,Class), so
>
>
> 2*P(c_{i}, s_{j})R(c_{i},s_{j})
> F(c_{i}, s_{j}) = -------------------------------
> P(c_{i}, s_{j}) + R(s_{i}, c_{j})
>
> Where c_{i} is cluster i, s_{j} is class (sense) j, P is precission (how
> many different classes are in cluster i) and R is recall (how many different
> clusters are in class j)
>
> The FScore of cluster c_{i} is the maximum across the different classes:
>
> FScore (c_{i}) = max {s} F(c_{i}, s)
>
> and the FScore of the whole clustering solution is:
>
> FScore = \sum_{i=1}^{c} #{c_{i}}/n FScore(c_{i})
>
> where c is the number of clusters, #{c_{i}} is the size of cluster i and n
> is the size of the clustering solution.
>
> To summarise, we don't use any a-priori distribution of the classes. We just
> calculate the FScore of the clustering solutions against the gold standard
> classes.
>
> I don't know if this explanation answers your questions, but i hope i have
> thrown some light regarding how we compute the FScore measure.
>
>
> Best,
> aitor
>
> > (BTW, in my previous message I think I had the input files reversed...)
> >
> > How i recreated your results....
> >
> > marimba(71): perl ../scorers/unsup_eval.pl
> > ../Submitted/12.new.answers.txt senseinduction_test.key
> > FScore: 0.661 clust_average: 1.360 class_average: 2.870
> > marimba(72): perl ../scorers/unsup_eval.pl -p n
> > ../Submitted/12.new.answers.txt senseinduction_test.key
> > FScore: 0.671 clust_average: 1.714 class_average: 2.886
> > marimba(73): perl ../scorers/unsup_eval.pl -p v
> > ../Submitted/12.new.answers.txt senseinduction_test.key
> > FScore: 0.650 clust_average: 1.169 class_average: 2.862
> >
>
> --
> ondo izan
> aitor
> Pgp id: 0x5D6070F2
>

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

Aitor Soroa Etxabe

unread,

Apr 16, 2007, 9:52:35 AM4/16/07

to sensein...@googlegroups.com

Hi Ted,

> I might suggest that the simplest thing to do would be to compute purity,
> entropy, FScore over the 27,132 instances as a third evaluation, with the
> understanding that this score is on the full set of data of evaluation
> data as seen by the participants, while the other two evaluations are on
> the test data as used in the english lexical sample task. One interesting
> issue that might bring to light is the degree to which the test and
> training+test perform differently.

Yes, we are happy with that. The fact that the unsupervised evaluation is
applied exclusively over the test corpus should have been stressed in the
task website. We apologize for this. We plan to include the results of the
systems over the whole corpus in the task description paper if we have time.

> Also, in our case at least, this leads to a "cleaner" evaluation in that
> we used all 27,132 instances as our features, we clustered all of them,
> and we didn't really treat test/train instances different at all, in fact
> we viewed all 27,132 instances as evaluation data.
>
> Now, I understand you may not be able to do this for all participants in
> the available time. If I ran the unsupervised scoring program using my
> answer file as submitted (with 27,132 instances) and the
> senseinduction.key file (again with 27,132 instances) would I get accurate
> scores? If so I can report those in my paper and address some of the
> above.

Yes, this is right. For computing the FScore over the whole corpus just
execute unsup_eval.pl script with the complete gold standard key as the
second parameter:

% unsup_eval.pl your_system_train_test.key ../keys/senseinduction.key

> As a second curiosity, if both evaluation measures are being run on the
> same 4851 test instances, I wonder why 1 cluster per word for unsupervised
> and mfs for supervised come out slightly differently. In unsupervised the
> Fscores are 78.9 (all), 80.7 (n), and 76.8 (v), while in supevised they
> are 78.7 (all), 80.9 (n), and 76.2 (v). Any thoughts on where there is a
> difference?

There are scored following different criteria: the unsupervised evaluation
is scored with the unsup_scorer.pl program (FScore), and the supervised with
the scorer2 (which computes recall). They basically measure different
things, so i think it's normal that they yield different results. The fact
that they give almost the same figures looks more like a coincidence.