interpreting S-precision and S-recall

Ted Pedersen

unread,

Apr 14, 2013, 4:43:51 PM4/14/13

to semeval-2013-ws...@googlegroups.com

A question about S-precision and S-recall. In my system I didn't rank my
results, but rather just output them in order of the instance number (like
the following made up example).

dog.n dog.n.1 dog.n.2
dog.n dog.n.2 dog.n.2
dog.n dog.n.3 dog.n.1
dog.n dog.n.4 dog.n.1
dog.n dog.n.5 dog.n.3
dog.n dog.n.6 dog.n.1
dog.n dog.n.7 dog.n.3

I'm thinking that means that my S-precision and S-recall scores don't
have much validity. However, when I look at the gold standard key, what
I see is something like I show below, where the "correct" answers
are sorted by cluster number (third column) and then the instance
number (second number).

dog.n dog.n.3 dog.n.1
dog.n dog.n.4 dog.n.1
dog.n dog.n.6 dog.n.1
dog.n dog.n.1 dog.n.2
dog.n dog.n.2 dog.n.2
dog.n dog.n.5 dog.n.3
dog.n dog.n.7 dog.n.3

It would be very easy for me to sort my answer file that way, and in
fact when I do that my S-precision and S-recall scores change, but
really very very little.

So I guess I'm wondering, doesn't S-precision and S-recall depend on
the instances being ordered in terms of some measure of "quality"
rather than just based on cluster id and then instance id? I have looked
through the answer key (2010 format) and *think* that it's just
organized by cluster id and then instance id (like the example immediately
above).

Am I missing something here? I don't know what to make of S-precision
and S-recall at this point.

Thanks for any advice!

Cordially,
Ted

Daniele Vann

unread,

Apr 16, 2013, 5:04:35 AM4/16/13

to semeval-2013-ws...@googlegroups.com, tped...@d.umn.edu

see
https://groups.google.com/d/msg/semeval-2013-wsi-in-application/V1sZCm3cViY/QrVr7iuiW1QJ

Ted Pedersen

unread,

Apr 16, 2013, 7:52:22 AM4/16/13

to semeval-2013-ws...@googlegroups.com

Thanks for sharing this, although I guess what I see in that thread is
part of my question. In that discussion it says that the ordering of
the instances within each cluster is important (because it affects the
"flattening" process using in S-recall and S-precision). However, in
the gold standard data, the order of the instances in the cluster is
simply based on the instance number. That doesn't seem to really have
much to do with the "quality" or confidence of the clustering, and so
I'm puzzled as to why the gold standard data is ordered that way. For
example (and this is a simple made up example but it reflects what I
see in the gold standard data) :

dog.n dog.n.3 dog.n.1
dog.n dog.n.4 dog.n.1
dog.n dog.n.6 dog.n.1
dog.n dog.n.1 dog.n.2
dog.n dog.n.2 dog.n.2
dog.n dog.n.5 dog.n.3
dog.n dog.n.7 dog.n.3

In the gold standard data, I think what I see is data that is sorted
by the third column (sense) and then the second column (instance id).
I really don't see how that fits in with the idea of ordering the
within-cluster results based on some notion of confidence or quality.

"In order to perform the flattening procedure, WSD/WSI must provide
snippets in each cluster already sorted by the confidence according to
which the snippet belongs to the cluster, and must rank clusters
according to their diversity."

This is from the task page, but what I see in the gold data does not
seem to be organized by confidence, unless by some miracle confidence
is associated with the instance number. :) So my question really is,
how does the gold standard data reflect an ordering based on
"confidence", or what detail am I missing?

Thanks!
Ted

> --
> You received this message because you are subscribed to the Google Groups
> "Semeval-2013 Task 11: WSI & Disambiguation within An Application" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to semeval-2013-wsi-in-a...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Ted Pedersen

unread,

Apr 23, 2013, 2:44:45 PM4/23/13

to semeval-2013-ws...@googlegroups.com

So I'm still puzzling over S-precision and S-recall. I do understand
they are intended as the "featured" measure of the evaluation and I'd
like to do more with them, but the truth is I feel like I'm missing
something very vital in understanding them.

I've commented on how the 2010 formatted key simply appears to sort
the gold standard answers by subtopic (which I sometimes refer to as
cluster id) and then by result id (which i sometimes refer to as
instance number). I decided to go look at the 2013 formatted key, just
to make sure there wasn't something different. But, I saw the same
thing. Here are the first few lines of STRe1.txt (the gold standard).

1.1 1.1
1.1 1.2
1.1 1.3
1.1 1.4
1.1 1.5
1.1 1.9
1.1 1.10
1.1 1.11
1.1 1.13
1.1 1.14
1.1 1.23
1.1 1.32
1.1 1.39
1.1 1.49
1.1 1.50
1.1 1.61
1.2 1.6
1.2 1.16
1.2 1.20
1.2 1.21

What we see here are the subtopic ID and then the result Id. You will
see that the result Ids are sorted numerically within each cluster,
which doesn't make sense to me since my understanding is that results
within a cluster are to be sorted by confidence.

My question is really just this : how can see see the confidence
ordering of the result Ids in the gold standard data? I'd like to
understand this better, but I keep running into this issue or how the
result Ids are ordered. They just look like the are ordered by result
id within the clusters, and I don't see how that can reflect any
notion of confidence.

I'd really like to figure this out before the camera ready deadline
(which is April 29, I think), so any help of any kind would be most
appreciated!

Thanks!
Ted.

Jey Han Lau

unread,

Apr 23, 2013, 8:05:18 PM4/23/13

to semeval-2013-ws...@googlegroups.com

Hi Ted,

The rationale for taking into account the confidence of instances in
system induced clusters is that the collapsing of system induced
clusters into list uses the order of the instances - putting instances
that you have high confidence in a cluster first in the list means the
collapsed list has a higher chance of covering more unique gold
senses. The gold senses, however, do no need to be ordered.

To give a toy example. Say the gold standard is (3 senses, 6 instances):

1.1 1.1
1.1 1.2
1.1 1.3

1.2 1.4
1.3 1.5
1.3 1.6

And your system induced clusters (without taking into account confidence):
1.1 1.1
1.2 1.2
1.2 1.3
1.2 1.4
1.3 1.5
1.3 1.6

This would yield the collapsed list: (1.1), (1.2), (1.5), (1.3), (1.6), (1.4)
The number of distinct senses covered for each k (i.e. recall@k) would be:
k=1: 1
k=2: 1
k=3: 2
k=4: 2
k=5: 2
k=6: 3

However, if the system had sorted the instances within a cluster and gave:
1.1 1.1
*1.2 1.4*
1.2 1.2
1.2 1.3
1.3 1.5
1.3 1.6

The clusters are essentially the same, but the order of instances have
changed in induced cluster 1.2. This would give the collapsed list:
(1.1), (1.4), (1.5), (1.2), (1.6), (1.3)
And the recall@k would be:
k=1: 1
k=2: 2
k=3: 3
k=4: 3
k=5: 3
k=6: 3

Which is a much better result.

Hope that helps.

Cheers,
Jey Han

Ted Pedersen

unread,

Apr 23, 2013, 9:54:57 PM4/23/13

to semeval-2013-ws...@googlegroups.com

Hi Jey Han,

Wow. That was brilliant. Thank you! I could feel the scales dropping
away from my eyes as I read your explanation. It all makes good sense,
many many thanks!!

Cordially,
Ted

Jey Han Lau

unread,

Apr 23, 2013, 10:05:46 PM4/23/13

to semeval-2013-ws...@googlegroups.com

All good. Important that the measures make sense since we're working
in a word sense application.

Alright. That wasn't too funny.

Jey Han

Reply all

Reply to author

Forward