kba scoring tool - toy kba system

nacho

unread,

Jul 21, 2014, 1:34:00 PM7/21/14

to trec...@googlegroups.com

Hi John,

I saw that there's a scoring tool available, https://github.com/trec-kba/kba-scorer. Is that the version that you will run? Or you are planning on updating it for this year competition?

Also I read that the toy kba system will be updated. Is that right?

Regards

nacho

John R. Frank

unread,

Jul 21, 2014, 1:42:30 PM7/21/14

to trec...@googlegroups.com

> I saw that there's a scoring tool
> available, https://github.com/trec-kba/kba-scorer. Is that the version
> that you will run? Or you are planning on updating it for this year
> competition?

I need to test it on 2014 data, but for CCR should "just work."

I have a list of changes partially started for SSF.

> Also I read that the toy kba system will be updated. Is that right?

Yes, I plan to update that to more closely resemble, maybe replace the
"oracle baseline" that we used last year.

We'll try to get a time table for both of those circulated in the next
week or so.

jrf

nacho

unread,

Jul 22, 2014, 2:36:50 AM7/22/14

to trec...@googlegroups.com

Hi John,

I just ran the scorer on the "after-cutoff" data. I use it as if it was the data generated from a run. I expected to get F1 = 1 but didn't. Got something like this for the vital only task. I may have run the script with the wrong params. Not sure. This is what I tried:

python -m kba.scorer.ccr --cutoff-step 1 runs/ trec-kba-2014-07-11-ccr-and-ssf.after-cutoff.tsv >& 2014-kba-runs-vital-ccr.log &

micro_average: max(F_1(avg(P), avg(R))): 0.912

micro_average: max(avg(SU)): 0.936

macro_average: max(F_1(avg(P), avg(R))): 0.580

macro_average: max(avg(SU)): 0.542

weighted_average: max(F_1(avg(P), avg(R))): 0.004

weighted_average: max(avg(SU)): 0.004

Added the --include-useful flag and got the following:

micro_average: max(F_1(avg(P), avg(R))): 0.999

micro_average: max(avg(SU)): 0.999

macro_average: max(F_1(avg(P), avg(R))): 0.961

macro_average: max(avg(SU)): 0.960

weighted_average: max(F_1(avg(P), avg(R))): 0.005

weighted_average: max(avg(SU)): 0.005

Shall I try with different parameters?

Regards

nacho

John R. Frank

unread,

Jul 22, 2014, 7:17:12 AM7/22/14

to trec...@googlegroups.com

Nacho,

Thanks for the heads up.

The scorer needs to learn about the new TTR boundary-per-entity:

https://github.com/trec-kba/kba-scorer/blob/master/src/kba/scorer/ccr.py#L138

I'll get to this later this week.

jrf

> --
> You received this message because you are subscribed to the Google Groups "TREC-KBA" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to trec-kba+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

nacho

unread,

Jul 28, 2014, 1:11:39 PM7/28/14

to trec...@googlegroups.com

Hi John,

Did you get a chance to look at this?

I've been generated some basic runs and used my own scoring tool. I'm getting "too good to be true" numbers, I must be doing something wrong, but wanted to double check with the official one.

Thanks

nacho

John R. Frank

unread,

Jul 28, 2014, 9:40:54 PM7/28/14

to trec...@googlegroups.com

KBAers,

This commit updates the KBA CCR (vital filtering) scorer:

https://github.com/trec-kba/kba-scorer/commit/9a8cc11b6e0f55f1382b23fb4bd1aaea7b289d3c

To properly show 100% on the truth data, one must run with
--require-positives=4 and --any-up

--any-up resolves overlapping judgments using the heuristic that if *any*
assessor said it is vital, then it is vital.

--require-positives=4 matches the requirement that there be at least 5
true positives in the total data set, and 20% of them are in the TTR,
which is in the file named "...before-cutoff..."

Nacho -- please let me know if you see any issues with this, and we look
forward to seeing your too-good-to-be-true runs.

Regards,
John

Jingtian Jiang

unread,

Aug 4, 2014, 9:59:43 PM8/4/14

to trec...@googlegroups.com

Hi John,

Since there are 71 entities for CCR, our systems would only output results for 71 entities. However, there are 109 entities in the truth data. So the score is biased if we directly use the ...after-cutoff.tsv. And the macro F1=0.652. If we filter out the 38 entities from the ...after-cutoff.tsv, then the macro F1=0.971.

John R. Frank

unread,

Aug 4, 2014, 10:29:47 PM8/4/14

to trec...@googlegroups.com

The --require-positives=4 flag filters out those 38 entities and also four
more so that only 67 are used in the evaluation.

John

Jingtian Jiang

unread,

Aug 5, 2014, 1:00:50 AM8/5/14

to trec...@googlegroups.com

Yes. I know that. The current scorer sums the P/R of 67 entities and then divide the sums with 109, right?

So I wonder, should we remove the 38+4 entities after loading the annotations, and before loading system runs?

John R. Frank

unread,

Aug 5, 2014, 1:09:49 AM8/5/14

to trec...@googlegroups.com

Hi Jiangtian,

I think you are refering to this place where the averaging divides by
num_entities:

https://github.com/trec-kba/kba-scorer/blob/master/src/kba/scorer/_metrics.py#L217

When I run the scorer with --require-positives=4 --any-up, I see these two
log messages:

computed macro_average using num_entities=67
computed weighted_average using num_entities=67

Am I misunderstanding you? If there is a bug in the scorer, we would be
very glad to uncover it.

John

On Mon, 4 Aug 2014, Jingtian Jiang wrote:

> Yes. I know that. The current scorer sums the P/R of 67 entities and

> then divide the sums with 109, right?So I wonder, should we remove the

Jingtian Jiang

unread,

Aug 5, 2014, 1:36:56 AM8/5/14

to trec...@googlegroups.com

Yes. That's what I mean. But i got the message like below:

computed macro_average using num_entities=105

computed weighted_average using num_entities=105

btw: my runs contains 109 entities.

Reply all

Reply to author

Forward