Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

SSF2013 metrics

19 views
Skip to first unread message

shun

unread,
Nov 26, 2015, 11:03:59 PM11/26/15
to TREC-KBA
Hello.

For SSF 2013, only the first level (DOCS) is a valid comparison across systems. I want to know this reason.
Why aren't other metrics (OVERLAP, FILL, DATE_HOUR) used to compare across systems?

Thanks,
Shun

John R. Frank

unread,
Nov 27, 2015, 7:54:17 AM11/27/15
to TREC-KBA
> For SSF 2013, only the first level (DOCS) is a valid comparison across
> systems. I want to know this reason. Why aren't other metrics (OVERLAP,
> FILL, DATE_HOUR) used to compare across systems?

The recall of all SSF system was generally under 10%, and the four layers
of metrics only used an individual system's TPs from the previous layer.
Thus, only the first layer (DOCS) was considering the same inputs for all
runs.

Let me know if you have more questions.

jrf

shun

unread,
Nov 27, 2015, 10:17:08 AM11/27/15
to TREC-KBA
Thank you for your reply.

I think that all metrics can be used to compare across systems if all metrics use all of individual system's run instead of an individual system's TPs from the previous layer.
Why aren't all of individual system's run used for all metrics?

Thanks,
shun

2015年11月27日金曜日 21時54分17秒 UTC+9 John R. Frank:

John R. Frank

unread,
Nov 27, 2015, 10:27:42 AM11/27/15
to TREC-KBA

> I think that all metrics can be used to compare across systems if all
> metrics use all of individual system's run instead of an individual
> system's TPs from the previous layer.

That's probably true in theory.


> Why aren't all of individual system's run used for all metrics?

IIRC, in actuality, the numbers were indistinguishable from zero because
the recall was already so low that the filtering effect of each layer
reduced the number of TPs too much to be statistically meaningful.
Instead of an error analysis, it became an anecdotal success analysis.

It's also important to keep in mind that the assessing task was very hard,
and therefore had incomplete coverage--- especially when compared with the
much better recall on the CCR task.

jrf

Reply all
Reply to author
Forward
0 new messages