kba scoring tool - toy kba system

81 views
Skip to first unread message

nacho

unread,
Jul 21, 2014, 1:34:00 PM7/21/14
to trec...@googlegroups.com
Hi John,

I saw that there's a scoring tool available, https://github.com/trec-kba/kba-scorer. Is that the version that you will run? Or you are planning on updating it for this year competition?
Also I read that the toy kba system will be updated. Is that right?
Regards

nacho

John R. Frank

unread,
Jul 21, 2014, 1:42:30 PM7/21/14
to trec...@googlegroups.com
> I saw that there's a scoring tool
> available, https://github.com/trec-kba/kba-scorer. Is that the version
> that you will run? Or you are planning on updating it for this year
> competition?

I need to test it on 2014 data, but for CCR should "just work."

I have a list of changes partially started for SSF.



> Also I read that the toy kba system will be updated. Is that right?

Yes, I plan to update that to more closely resemble, maybe replace the
"oracle baseline" that we used last year.

We'll try to get a time table for both of those circulated in the next
week or so.



jrf

nacho

unread,
Jul 22, 2014, 2:36:50 AM7/22/14
to trec...@googlegroups.com
Hi John,

I just ran the scorer on the "after-cutoff" data. I use it as if it was the data generated from a run. I expected to get F1 = 1 but didn't. Got something like this for the vital only task. I may have run the script with the wrong params. Not sure. This is what I tried:

python -m  kba.scorer.ccr --cutoff-step 1 runs/ trec-kba-2014-07-11-ccr-and-ssf.after-cutoff.tsv >& 2014-kba-runs-vital-ccr.log &

  micro_average: max(F_1(avg(P), avg(R))): 0.912

    micro_average: max(avg(SU)):  0.936

    macro_average: max(F_1(avg(P), avg(R))): 0.580

    macro_average: max(avg(SU)):  0.542

    weighted_average: max(F_1(avg(P), avg(R))): 0.004

    weighted_average: max(avg(SU)):  0.004


Added the --include-useful flag and got the following:

    micro_average: max(F_1(avg(P), avg(R))): 0.999
    micro_average: max(avg(SU)):  0.999
    macro_average: max(F_1(avg(P), avg(R))): 0.961
    macro_average: max(avg(SU)):  0.960
    weighted_average: max(F_1(avg(P), avg(R))): 0.005
    weighted_average: max(avg(SU)):  0.005

Shall I try with different parameters?
Regards

nacho

John R. Frank

unread,
Jul 22, 2014, 7:17:12 AM7/22/14
to trec...@googlegroups.com
Nacho,

Thanks for the heads up.

The scorer needs to learn about the new TTR boundary-per-entity:

https://github.com/trec-kba/kba-scorer/blob/master/src/kba/scorer/ccr.py#L138

I'll get to this later this week.

jrf
> --
> You received this message because you are subscribed to the Google Groups "TREC-KBA" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to trec-kba+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

nacho

unread,
Jul 28, 2014, 1:11:39 PM7/28/14
to trec...@googlegroups.com
Hi John,

Did you get a chance to look at this?
I've been generated some basic runs and used my own scoring tool. I'm getting "too good to be true" numbers, I must be doing something wrong, but wanted to double check with the official one.
Thanks

nacho

John R. Frank

unread,
Jul 28, 2014, 9:40:54 PM7/28/14
to trec...@googlegroups.com

KBAers,

This commit updates the KBA CCR (vital filtering) scorer:

https://github.com/trec-kba/kba-scorer/commit/9a8cc11b6e0f55f1382b23fb4bd1aaea7b289d3c

To properly show 100% on the truth data, one must run with
--require-positives=4 and --any-up

--any-up resolves overlapping judgments using the heuristic that if *any*
assessor said it is vital, then it is vital.

--require-positives=4 matches the requirement that there be at least 5
true positives in the total data set, and 20% of them are in the TTR,
which is in the file named "...before-cutoff..."


Nacho -- please let me know if you see any issues with this, and we look
forward to seeing your too-good-to-be-true runs.


Regards,
John



Jingtian Jiang

unread,
Aug 4, 2014, 9:59:43 PM8/4/14
to trec...@googlegroups.com
Hi John,

Since there are 71 entities for CCR, our systems would only output results for 71 entities. However, there are 109 entities in the truth data. So the score is biased if we directly use the ...after-cutoff.tsv. And the macro F1=0.652. If we filter out the 38 entities from the ...after-cutoff.tsv, then the macro F1=0.971.

John R. Frank

unread,
Aug 4, 2014, 10:29:47 PM8/4/14
to trec...@googlegroups.com


The --require-positives=4 flag filters out those 38 entities and also four
more so that only 67 are used in the evaluation.

John

Jingtian Jiang

unread,
Aug 5, 2014, 1:00:50 AM8/5/14
to trec...@googlegroups.com
Yes. I know that. The current scorer sums the P/R of 67 entities and then divide the sums with 109, right?
So I wonder, should we remove the 38+4 entities after loading the annotations, and before loading system runs? 

John R. Frank

unread,
Aug 5, 2014, 1:09:49 AM8/5/14
to trec...@googlegroups.com
Hi Jiangtian,

I think you are refering to this place where the averaging divides by
num_entities:

https://github.com/trec-kba/kba-scorer/blob/master/src/kba/scorer/_metrics.py#L217

When I run the scorer with --require-positives=4 --any-up, I see these two
log messages:

computed macro_average using num_entities=67
computed weighted_average using num_entities=67

Am I misunderstanding you? If there is a bug in the scorer, we would be
very glad to uncover it.


John



On Mon, 4 Aug 2014, Jingtian Jiang wrote:

> Yes. I know that. The current scorer sums the P/R of 67 entities and
> then divide the sums with 109, right?So I wonder, should we remove the

Jingtian Jiang

unread,
Aug 5, 2014, 1:36:56 AM8/5/14
to trec...@googlegroups.com
Yes. That's what I mean. But i got the message like below:
computed macro_average using num_entities=105
computed weighted_average using num_entities=105

btw: my runs contains 109 entities.
Reply all
Reply to author
Forward
0 new messages