question on FScore from 2010 versus F1 in 2013

Ted Pedersen

unread,

Apr 6, 2013, 9:25:46 AM4/6/13

to semeval-2013-ws...@googlegroups.com

I've gone back and run my random baselines with the SemEval 2010
evaluation code (which can be found here :
http://www.cs.york.ac.uk/semeval2010_WSI/datasets.html ). and I've
gotten FScores which are quite a bit different than the F1 in the 2013
evaluation scheme.

randx means senses assigned randomly from x possible values (in a balanced way)

Here's what we get with Fscore 2010, which I think intuitively at
least is appealing, since as the results get more and more random, the
Fscore declines.

rand2
Total FScore:0.41490940079629784
rand5
Total FScore:0.2516604650480077
rand10
Total FScore:0.15045198335753995
rand25
Total FScore:0.07007328126488682
rand50
Total FScore:0.040660188124065404

Here are the results from F1 2013, which I sent in a previous note but
reproduce here...

rand2
average F1 = 0.54890625
rand5
average F1 = 0.56734375
rand10
average F1 = 0.59671875
rand25
average F1 = 0.66890625
rand50
average F1 = 0.761875

In this case the values get better as the results get more random,
which seems a little counter-intuitive. I say "better" since the gold
data reports a result of 1.00 on F1 so that's the best case.

I guess what I'm wondering is whether or not the 2010 FScore could be
run on the submitted data sets and provided as a part of the
evaluation as well? I think it's providing something "different" in
terms of an evaluation measure, and it seems like it might be
important to consider that. It would also make direct comparison to
2010 possible (on different sorts of data grant you, but it still
might be interesting to see how that compares). There was also quite a
lot of discussion about the measures in the 2010 evaluation, and I
think in some respects the F-score seemed to have a lot of good
qualities.

Thanks!
Ted

Roberto Navigli

unread,

Apr 6, 2013, 9:55:44 AM4/6/13

to semeval-2013-ws...@googlegroups.com

Hi Ted,

our way of calculating P & R is different indeed (if interested, look at the CL 2013 paper). I think it would be good to have those figures too, yes. Daniele?

Best,
Roberto

2013/4/6 Ted Pedersen <tped...@d.umn.edu>

Ted

--
You received this message because you are subscribed to the Google Groups "Semeval-2013 Task 11: WSI & Disambiguation within An Application" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semeval-2013-wsi-in-a...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
=====================================
Roberto Navigli
Dipartimento di Informatica
SAPIENZA Universita' di Roma
Via Salaria, 113 (now in: Viale Regina Elena 295)
00198 Roma Italy
Phone: +39 0649255161 - Fax: +39 06 8541842
Home Page: http://wwwusers.di.uniroma1.it/~navigli
=====================================

Ted Pedersen

unread,

Apr 6, 2013, 9:22:00 PM4/6/13

to semeval-2013-ws...@googlegroups.com

I went ahead and ran my systems through the 2010 evaluation code, and
got the following (FScore is from the 2010 evaluation code) :

**task11.duluth.sys1.pk2.txt
Total FScore:0.46527007617398203
average F1 = 0.56828125
average Rand Index = 0.5217509920634921
average Adj Rand Index = 0.05735046354576045
average Jaccard Index = 0.3179005365877712
============ average number of created clusters: 2.53
============ average cluster size: 26.453333333333333

**task11.duluth.sys7.pk2.txt
Total FScore:0.4588783486791854
average F1 = 0.5878125
average Rand Index = 0.5204117063492063
average Adj Rand Index = 0.0678018932041495
average Jaccard Index = 0.3103464720228176
============ average number of created clusters: 3.01
============ average cluster size: 25.159619047619056

**task11.duluth.sys9.pk2.txt
Total FScore:0.35564006060925074
average F1 = 0.57015625
average Rand Index = 0.546329365079365
average Adj Rand Index = 0.02590204039221355
average Jaccard Index = 0.22242468819393144
============ average number of created clusters: 3.32
============ average cluster size: 19.839999999999986

We see some interesting things here, I think. If you look at the
rankings of the systems according to FScore and F1, they are
dramatically different, particularly with respect to sys9, which F1
seems to like but FScore really doesn't...

FScore
sys1 46.53
sys7 45.89
sys9 35.56**

F1
sys7 58.78
sys9 57.02**
sys1 56.83

And I just happened to note that Jaccard seems to correlate fairly
well with Fscore, both in terms of ranking and then the difference
between sys7 and sys9.

Jaccard
sys1 31.79
sys7 31.03
sys9 22.24

So, no conclusions at this point, just some numbers. I think though
the rankings and scores for FScore will be quite a bit different, so
that could be pretty interesting.

More as it develops...
Ted

Roberto Navigli

unread,

Apr 7, 2013, 1:55:07 AM4/7/13

to semeval-2013-ws...@googlegroups.com

Ted, thank you so much for this much-needed analysis. I must say that in our opinion the most important scores are S-recall@K and S-precision@r, since they provide a real benefit to the user viewing the results. All other measures suffer from uncertainty... but if you feel that we should also include FScore in our paper (maybe the final version in a couple of weeks), we can do that. We will also run the scorer and give you comparative tables, so we can decide whether it makes sense.

All best,
Roberto

2013/4/7 Ted Pedersen <tped...@d.umn.edu>

Ted Pedersen

unread,

Apr 7, 2013, 10:52:23 AM4/7/13

to semeval-2013-ws...@googlegroups.com

Hi Roberto,

Thanks for looking at this. As a part of this analysis I have, sadly,
realized that my systems did not rank the results (as I had intended
them to do). So, I'm afraid my S-recall and S-precision values should
be regarded as a kind of random baseline. The order of my results is
simply that of the input. But, I do agree, I think S-recall and
S-precision have some very appealing properties, and provide a nice
way to consider the results. I will see if I can get some
post-competition results for S-recall and S-precision at least.

Thanks,
Ted

Ted Pedersen

unread,

Apr 7, 2013, 6:09:43 PM4/7/13

to semeval-2013-ws...@googlegroups.com

Greetings yet again. :)

I was looking at the gold standard data, and perhaps I've
misunderstood something. I had the impression that the output should
be ranked in order to get S-recall and S-precision scores. However, in
the gold standard data what I see, I think, is output that is sorted
by "sense" and then by "instance", as shown below (in a short excerpt
from the 2010 version of the gold standard).

I feel like I am misunderstanding something here about S-precision and
S-recall, but I guess the most immediate question is whether or not I
can simply "sort" my output (first by sense, then by instance id) and
then re-score to get valid S-precision and S-recall? If so that's
great as i had a much more complicated scenario in mind!

polaroid.n polaroid.n.1 polaroid.n.1
polaroid.n polaroid.n.2 polaroid.n.1
polaroid.n polaroid.n.3 polaroid.n.1
polaroid.n polaroid.n.4 polaroid.n.1
polaroid.n polaroid.n.5 polaroid.n.1
polaroid.n polaroid.n.9 polaroid.n.1
polaroid.n polaroid.n.10 polaroid.n.1
polaroid.n polaroid.n.11 polaroid.n.1
polaroid.n polaroid.n.13 polaroid.n.1
polaroid.n polaroid.n.14 polaroid.n.1
polaroid.n polaroid.n.23 polaroid.n.1
polaroid.n polaroid.n.32 polaroid.n.1
polaroid.n polaroid.n.39 polaroid.n.1
polaroid.n polaroid.n.49 polaroid.n.1
polaroid.n polaroid.n.50 polaroid.n.1
polaroid.n polaroid.n.61 polaroid.n.1
polaroid.n polaroid.n.6 polaroid.n.2
polaroid.n polaroid.n.16 polaroid.n.2
polaroid.n polaroid.n.20 polaroid.n.2
polaroid.n polaroid.n.21 polaroid.n.2
polaroid.n polaroid.n.22 polaroid.n.2
polaroid.n polaroid.n.26 polaroid.n.2
polaroid.n polaroid.n.31 polaroid.n.2
polaroid.n polaroid.n.38 polaroid.n.2
polaroid.n polaroid.n.42 polaroid.n.2
polaroid.n polaroid.n.53 polaroid.n.2
polaroid.n polaroid.n.56 polaroid.n.2
polaroid.n polaroid.n.58 polaroid.n.2
polaroid.n polaroid.n.62 polaroid.n.2
polaroid.n polaroid.n.7 polaroid.n.3
polaroid.n polaroid.n.12 polaroid.n.3
polaroid.n polaroid.n.15 polaroid.n.3

Thanks!
Ted

Ted Pedersen

unread,

Apr 7, 2013, 8:15:03 PM4/7/13

to semeval-2013-ws...@googlegroups.com

Following up on the S-recall and S-precision discussion - I re-ordered
my output via the following command, which gives me, I think,
something sorted like the gold data, that is first it is sorted by
sense (so sense n.1 comes first, then n.2, etc.) and then it is sorted
by instance number.

sort -k3,3 -s task11.duluth.sys1.pk2.txt > new.sys1.pk2

When I scored new.sys1.pk2 relative to the gold data, S-precision and
S-recall changed a little, but not really that much. My other scores
(F1, ARI, etc.) all stayed the same, so apparently I made no major
mistakes. :)

The new values for S-recall (up to k=30) are shown on the left, and
the old (submitted) values are on the right.

NEW ------------ OLD
++++++++++++++++++
1 0.1807 1 0.1807
2 0.2406 2 0.2414
3 0.2810 3 0.2851
4 0.3457 4 0.3310
5 0.3777 5 0.3711
6 0.4141 6 0.4141
7 0.4550 7 0.4412
8 0.4771 8 0.4805
9 0.5067 9 0.5096
10 0.5425 10 0.5329
11 0.5642 11 0.5614
12 0.5801 12 0.5801
13 0.6128 13 0.6029
14 0.6294 14 0.6298
15 0.6518 15 0.6478
16 0.6681 16 0.6680
17 0.6771 17 0.6783
18 0.6869 18 0.6886
19 0.7032 19 0.7022
20 0.7112 20 0.7124
21 0.7264 21 0.7238
22 0.7375 22 0.7354
23 0.7449 23 0.7505
24 0.7597 24 0.7597
25 0.7676 25 0.7714
26 0.7834 26 0.7833
27 0.7945 27 0.7929
28 0.8000 28 0.8008
29 0.8120 29 0.8117
30 0.8181 30 0.8181

The same is true for S-precision, new on the left, old on the right...

NEW ------------------ OLD
+++++++++++++++++++++++
0.4000 0.4113 0.4000 0.4321
0.4500 0.4209 0.4500 0.4126
0.5000 0.4053 0.5000 0.4008
0.5500 0.3321 0.5500 0.3172
0.6000 0.3141 0.6000 0.3131
0.6500 0.3080 0.6500 0.2953
0.7000 0.2754 0.7000 0.2673
0.7500 0.2541 0.7500 0.2537
0.8000 0.2393 0.8000 0.2451
0.8500 0.2279 0.8500 0.2198
0.9000 0.2163 0.9000 0.2177
0.9500 0.2038 0.9500 0.2048
1.0000 0.1563 1.0000 0.1563

So, I think my main question is whether or not I'm understanding what
I should be doing here in order to get meaningful results from
S-recall and S-precision - it seems like I had a different idea of
what it meant for the results to be ranked, but I guess the impression
I have (mainly from the gold data) is that this ranking mostly
consists or organizing the output in order by sense, and then instance
id...? Is that accurate, or is there something more that should be
happening?

Also, I ran the gold data through the evaluation code, and got very
high scores (I think) on S-precision and S-recall, so that's what led
me to think it was just this ordering issue I resolved, I think, via
my sort command...

Thanks!
Ted

Reply all

Reply to author

Forward