[TREC KBA] truth data and filtered corpus (fwd)

John R. Frank

unread,

Jul 16, 2014, 3:49:27 PM7/16/14

to trec...@googlegroups.com

The note below has been sent to all registered KBA teams. If you did not
receive it and the attached tarball, please reach out to me offlist.

jrf

---------- Forwarded message ----------
Date: Wed, 16 Jul 2014 15:46:50 -0400 (EDT)
From: John R. Frank <j...@mit.edu>
To: John R. Frank <j...@mit.edu>
Subject: [TREC KBA] truth data and filtered corpus

KBAers,

Thanks for your patience.

1) The official KBA 2014 truth data and scripts are attached. See README.txt

2) The filtered corpus is here:

http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html#kba-streamcorpus-2014-v0_3_0-kba-filtered

Stats: 20,494,260 StreamItems (text files) stored in 2,022,998 chunk files and
639GB (xz-compressed)

3) Runs are due by *Wednesday September 10th*

We *strongly* encourage you to submit runs earlier.

You can submit as many runs as you like.

We plan to fill the pooled assessment budget with results from each team's best
scoring systems (scored using the attached truth data).

Please send questions to the trec...@googlegroups.com list

Regards,
John & The KBA Organizers

Vibou

unread,

Jul 17, 2014, 3:21:17 AM7/17/14

to trec...@googlegroups.com

Hi John ! Thanks very much for all this amazing work !

I have some questions though:

I believe that the filtered corpus has been build using surface form name of entities. Is it only the canonical name (from yaml file) or did you also consider variant names manually set ?

Is it worth to use the full corpus to search with our own variant names ... ? If so, what if some documents from the full corpus are in my run. Are those documents going to be evaluated manually or would they just be ignored ?

Last question on entities. There is a field external_profile in the yaml file. I guess we are allowed to use those web pages to build an entity profile out of it. Could you may be provide the html documents so that everybody starts with the same information? We never know whether a document is going to change or not depending on the time you crawl the document. Don't you think ?

Best Regards,

Vincent.

John R. Frank

unread,

Jul 17, 2014, 5:24:10 AM7/17/14

to trec...@googlegroups.com

Vincent,

Thanks for highlighting these issues.

> I believe that the filtered corpus has been build using surface form
> name of entities. Is it only the canonical name (from yaml file) or did
> you also consider variant names manually set ?

We considered name variants and also all the strings that appeared in the
slot fills.

There *are* documents in the larger corpus that probably refer to the
target entities. We had to drop some of them in our semi-principled
sampling (and hacking) to cut the corpus size down from 11TB (to 639GB),
because we want people to have enough cycles (mentally and
computationally) to actually try to use the Serif tagging data, which is
quite good.

> Is it worth to use the full corpus to search with our own variant names
> ... ? If so, what if some documents from the full corpus are in my run.

You can include stream_ids from the full corpus in your run submissions.
We might possibly filter them out before scoring.

> Are those documents going to be evaluated manually or would they just be
> ignored ?

They will be ignored in the second round of pooled assessing. I think we
have to do that in order to spend the assessor budget most fairly. If it
turns out that we have "extra" asessor time, we could consider adding them
in. Either way, we'd have to keep them separate from official eval sets
for scoring all systems.

The kba-streamcorpus-2014-v0_3_0-kba-filtered corpus is the *official*
corpus for the CCR and SSF tasks.

For the Accelerate & Create task--- anything goes! Use the full 16TB
corpus, if you'd like :-)

> Last question on entities. There is a field external_profile in the yaml
> file. I guess we are allowed to use those web pages to build an entity
> profile out of it. Could you may be provide the html documents so that
> everybody starts with the same information? We never know whether a
> document is going to change or not depending on the time you crawl the
> document. Don't you think ?

See the two examples in the README.txt:

"... As with all of the KBA 2014 query entities, this entity is *defined*
by the documents the refer to it (rating=1 or rating=2) before the
training_time_range_end and also by its home page, which is provided as a
URL in the external_profile. A CCR system may only use the URLs in
external_profile, if it can ensure that it obeys the "no future info" rule
by accessing only versions of the content from before each date hour being
processed. This rule is straightforward to obey for the 27 entities with
Wikipedia URLs, and may also possible to obey using tools like the WayBack
Machine http://archive.org/help/wayback_api.php

"This second example is a person entity that did not have enough vitals to
be used as a query target in Vital Filtering (CCR). However, his profile
contains many slot fills, so he is a good SSF query. For SSF, this entity
is defined by *all* of the truth data that mentions him, and also the
external profiles. It is valid for an SSF system to use all of the truth
data from before and after the cutoff and also external profiles. (If we
find that this is sufficient to make SSF "easy," then there will be much
rejoicing and we will make it harder next year :-)"

Does that answer your question?

John

nacho

unread,

Jul 18, 2014, 12:37:47 PM7/18/14

to trec...@googlegroups.com

Hi John,

Quick question regarding the ground truth data.

Is there a simple and fast way to get the file name where each of the examples in the truth data were computed from?

I see that each entry in the ground truth has streamId and folder, among other things. To get the specific file name where it appears, shall I go over the folder, process the files and see in which one the streamId is in?

Maybe there's a faster and more direct way to do it?

Regards

nacho

Vibou

unread,

Jul 18, 2014, 1:49:56 PM7/18/14

to trec...@googlegroups.com

Thanks a lot John. This answer all of my questions :D

Vibou

unread,

Jul 18, 2014, 1:53:56 PM7/18/14

to trec...@googlegroups.com

The stream is composed of a timestamp and a hashcode. The first part (before the dash) is a timestamp. So you can take this and transform it to a date to obtain the name of the folder. Watch out though to be in GMT+0 otherwise you may go to the wrong folder.

For the chunk, you unfortunately have to go through all chunks from that folder until you find the document with the correct stream-id.

Hope this help.

Not sure about the GMT+0 John could you confirm ?

and...@diffeo.com

unread,

Jul 18, 2014, 2:35:49 PM7/18/14

to trec...@googlegroups.com

There is now a chunk <-> half docid map in s3 at:

http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-kba-filtered_chunk-to-docid.map

This will make it easier and faster to identify a chunk file containing a particular
stream item.

In the map file, each line contains a mapping from chunk to a half
document id. The format for each line is as follows:

YYYY-MM-DD-HH/{chunk-name}.sc.xz.gpg#{first 16 characters of docid}

This can be used to locate chunks containing specific stream items.
Note that a stream id is the concatenation of a timestamp and a
document id. e.g.,

{timestamp}-{docid}

You can look up the document id in the map using the first 16
characters of `{docid}`.

- Andrew

Vibou

unread,

Jul 21, 2014, 3:43:42 AM7/21/14

to trec...@googlegroups.com

Hi John,

I think there is a mistake in the YAML file. For the entity https://kb.diffeo.com/Corisa_Bell the external profile value is g/ is it normal ?

regards,

Vincent.

Vibou

unread,

Jul 21, 2014, 3:49:56 AM7/21/14

to trec...@googlegroups.com

I think another mistake stepped in:

twitter.com/BRODIE_CLOWES instead of 'https://twitter.com/BRODIE_CLOWES'

John R. Frank

unread,

Jul 21, 2014, 5:13:44 AM7/21/14

to trec...@googlegroups.com

Vincent,

Thanks for catching that. Here are the two external profiles that should
be there:

http://www.corisabell.com/

https://twitter.com/CorisaBell

When we generate another official data release, these will be fixed in it.
(I don't have a planned date before end of August for that yet, but it
might happen.) In the meantime, everyone can use this email as an
addition to that entity's slot values.

jrf

> --
> You received this message because you are subscribed to the Google Groups "TREC-KBA" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to trec-kba+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

John R. Frank

unread,

Jul 21, 2014, 5:15:21 AM7/21/14

to trec...@googlegroups.com

yes, you are right. The assessors sometimes modified the external
profiles. We'll check more carefully for this in future exports.

jrf

On Mon, 21 Jul 2014, Vibou wrote:

John R. Frank

unread,

Jul 21, 2014, 9:51:10 AM7/21/14

to trec...@googlegroups.com

> Not sure about the GMT+0 John could you confirm ?

Correct. All timestamps are GMT+0

jrf

Message has been deleted

张东旭

unread,

Jul 21, 2014, 11:36:33 AM7/21/14

to trec...@googlegroups.com

Hi,John. It's amazing to see the filtered corpus so that we can stop building that big index ourselves!

I have seen the yaml text. Is that 109 entities the whole query for this year?

在 2014年7月17日星期四UTC+8上午3时49分27秒，John R. Frank写道：

John R. Frank

unread,

Jul 21, 2014, 11:42:48 AM7/21/14

to trec...@googlegroups.com

> Hi,John. It's amazing to see the filtered corpus so that we can stop
> building that big index ourselves!

Yes. Hopefully this will allow everyone to focus on using the text in
more sophisticated ways to identify vital documents and select slot fill
texts.

> I have seen the yaml text. Is that 109 entities the whole query for this
> year?

Yes, that's right 109 entities. It is a bit smaller than last year,
because we had slightly less assessor time this year, and also spent more
of it on gathering slot fills.

These counts can put it in context:

$ egrep -ve '^#' trec-kba-2014-07-11-ccr-and-ssf.before-cutoff.tsv | cut -f 6 | sort | uniq -c
481 0
4354 1
3196 -1
951 2

$ egrep -ve '^#' trec-kba-2014-07-11-ccr-and-ssf.after-cutoff.tsv | cut -f 6 | sort | uniq -c
1914 0
15565 1
1614 -1
3262 2

jrf

张东旭

unread,

Jul 24, 2014, 7:29:37 AM7/24/14

to trec...@googlegroups.com

Hi, John. I have a question.

When you mentioned,

We considered name variants and also all the strings that appeared in the slot fills.

For instance, if there is a slot " New York", Do you mean that every doc that contained New York would be in the filtered corpus?

And also, could you tell us what are those name variants exactly?

在 2014年7月17日星期四UTC+8下午5时24分10秒，John R. Frank写道：

John R. Frank

unread,

Jul 24, 2014, 12:39:07 PM7/24/14

to trec...@googlegroups.com

On Thu, 24 Jul 2014, 张东旭 wrote:

> Hi, John. I have a question.When you mentioned, We considered name

> variants and also all the strings that appeared in the slot fills.
>
> For instance, if there is a slot " New York", Do you mean that every doc
> that contained New York would be in the filtered corpus?

No, that would be much larger than this 20M doc sample. The
...-kba-filtered corpus is a *sample* of the larger streamcorpus. The KBA
Vital Filtering and SSF task are focused on this sample.

> And also, could you tell us what are those name variants exactly?

We used a combination of information from the assessors doing slot fills,
spans found by serif in rated documents, and text from external profiles.

If you can tell me more about what you are trying to do, I can try to
provide something useful.

John

张东旭

unread,

Jul 24, 2014, 9:49:12 PM7/24/14

to trec...@googlegroups.com

Sorry I didn't say it clearly.

What I comprehend (may be not correct) is that each doc in this *filtered* corpus has at least one query in it , which means I don't have to search for the docs that mentioned the query, which also means that the query expansion is not that useful as before.

But I don't know what the meaning of all the strings that appeared in the slot fills ,So I asked the question before.

Am I right?

在 2014年7月25日星期五UTC+8上午12时39分07秒，John R. Frank写道：

John R. Frank

unread,

Jul 25, 2014, 4:51:38 AM7/25/14

to trec...@googlegroups.com

> each doc in this *filtered* corpus has at least one query in it

The filtered corpus of 20m docs is a high-recall set. It's not the
highest possible recall set from the 1.2b doc corpus, but it's pretty
high. Even though the assessors probably didn't touch every document that
should be rated=1 or 2, they probably touched a large fraction of them.
I predict that it was *at least* 20% of them. Which means that the
filtered corpus is ~100x larger than the complete true positive set of
rating=1 or 2.

This means that instead of filtering teh 1.2b docs by ~10,000x, your
system has to filter by ~100x. That's an aggregate number, i.e. across
all entities.

> query expansion is not that useful

The entities are all pretty small, and there are ~100 entities. So, for
*each* entity, you need to filter the 20m doc corpus by about ~10,000x.

Whether that means query expansion is useful depends on your
implementation strategy.

I hope this explanation helps. Please feel free to ask more questions.

John

Vibou

unread,

Jul 25, 2014, 8:27:31 AM7/25/14

to trec...@googlegroups.com

Hi everyone.

For the ground truth files, is it best to consider the lowest class in overlaps assessments or the highest ??

In previous scorer I think you did only consider the lowest am I wrong ?

best regards.

John R. Frank

unread,

Jul 25, 2014, 9:12:31 AM7/25/14

to trec...@googlegroups.com

For (target_id, stream_id) pairs that have multiple judgments, there are
two general heuristics for resolving disagreement to get a single rating:
take min or max of the relevance ratings.

We have *not* done "annotation merging," in which a special assessor
resolves conflicts to give a final judgment.

If you want to simulate such a "final judgment," I think the most accurate
heuristic is to take the maximum, not the minimum.

For scoring, we'll study both heuristics.

jrf

张东旭

unread,

Jul 29, 2014, 5:36:12 AM7/29/14

to trec...@googlegroups.com

Hi, John.

Is there a handbook of BBN's serif tag or more details about tagging rules? I'm confused with the tag like NPA NPP NX SBAR, etc.

Second question. The yaml file offered the slot's offset but didn't tell the type of the slot. I don't know if it's proper to ask, but how can I use this information to train different models for different slots since I can't use the slots in the yaml file for training? Maybe I have to find some patterns manually?

在 2014年7月17日星期四UTC+8上午3时49分27秒，John R. Frank写道：

John R. Frank

unread,

Jul 29, 2014, 9:27:14 AM7/29/14

to trec...@googlegroups.com

> Hi, John.Is there a handbook of BBN's serif tag or more details about

> tagging rules? I'm confused with the tag like NPA NPP NX SBAR, etc.

See the Serif section here:

http://trec-kba.org/kba-stream-corpus-2014.shtml

I'll seek more detailed docs.

> Second question. The yaml file offered the slot's offset but didn't tell
> the type of the slot. I don't know if it's proper to ask, but how can I
> use this information to train different models for different slots since
> I can't use the slots in the yaml file for training? Maybe I have to
> find some patterns manually?

The slots are mostly covered by the TAC KBP slot ontology:

https://github.com/trec-kba/streamcorpus/blob/master/if/TAC_2013_KBP_Slot_Descriptions_1.0.pdf

so the many LDC corpora used in TAC KBP data are useful training data.
You might also look at the RelationFactory:

http://www.aclweb.org/anthology/E/E14/E14-2023.pdf

https://github.com/beroth/relationfactory

jrf

John R. Frank

unread,

Jul 29, 2014, 1:34:29 PM7/29/14

to trec...@googlegroups.com

> Is there a handbook of BBN's serif tag or more details about tagging
> rules? I'm confused with the tag like NPA NPP NX SBAR, etc.

Here is more info on the output from SERIF:

Most of the tags used by SERIF should be described here:
https://catalog.ldc.upenn.edu/docs/LDC95T7/cl93.html

There are a few that are non-standard. Two that come to mind are:

NPP --> used for names

NPA --> base noun phrase

The names/entity/relation annotations should follow mostly from the ACE
2005 guidelines:

https://www.ldc.upenn.edu/collaborations/past-projects/ace/annotation-tasks-and-specifications

Let us know if you have further questions.

jrf

Jingtian Jiang

unread,

Jul 31, 2014, 9:55:50 PM7/31/14

to trec...@googlegroups.com

Hi John,

I just had a question: I found that there are 71 entities in training set while 109 entities in testing set. So how do we "define" an entity if it is not in the training set?

Even for the entity https://kb.diffeo.com/Jamie_Lawson, it is not only out of the training set, but also all the annotated docs in the testing set are unknown or non-referent. I wonder is there a way to know what this entity is about?

John R. Frank

unread,

Aug 1, 2014, 9:07:00 PM8/1/14

to trec...@googlegroups.com

> python construct_time_progression_per_entity.py trec-kba-2014-07-11-ccr-and-ssf.profiles.yaml trec-kba-2014-07-11-vital-histogram.tsv trec-kba-2014-07-11-ccr-and-ssf-query-topics.json

I just sent a tar archive to all registered teams including these two
files, so that you don't have to generate them.

> I just had a question: I found that there are 71 entities in training
> set while 109 entities in testing set. So how do we "define" an entity
> if it is not in the training set?

...before-cutoff.tsv has the 71 CCR query entities

...after-cutoff.tsv has the 71 CCR query entities and 38 more SSF query
entities, which do have training_time_range_end==None

If you play with the histograms and scorer, you might notice that the
scorer actually only uses 67 entities, because there are four that
meet the requirement of having five ratings but they are not all different
documents --- we sent the same text to multiple assessors.

This bit of python shows how to count to 67:

>>> from collections import defaultdict
>>> vitals = defaultdict(set)
>>> for line in open('trec-kba-2014-07-11-ccr-and-ssf.after-cutoff.tsv'):
... if line.startswith('#'): continue
... parts = line.split()
... if int(parts[5]) == 2:
... vitals[parts[3]].add(parts[2])

## parts[3] is target_id
## parts[2] is stream_id
## parts[5] is relevance rating

This is in the "after-cutoff" so we want to see at least 4 vitals, which
is 80% of 5 vitals and the other one is in the training data, i.e.
"before-cutoff"

>>> over_4 = list()
>>> for target_id, stream_ids in vitals.items():
... if len(stream_ids) >= 4:
... over_4.append(target_id)
...
>>> len(over_4)
67

The four entities that bring the count up to 71 are:

https://kb.diffeo.com/Josh_Vander_Vies
https://kb.diffeo.com/Mike_Kluse
https://kb.diffeo.com/Nolan_Watson
https://kb.diffeo.com/Genaveve_Starr

These four are valid CCR queries, but official scoring may decide to
ignore them.

If you use the scorer to load the after-cutoff.tsv file, it will ignore
the entities that do not have at least 5 vital documents in the total
ground truth data. Entities with fewer than five vitals are not CCR
queries.

Here is how to run the scorer:

python -m kba.scorer.ccr --any-up --require-positives=4 runs trec-kba-2014-07-11-ccr-and-ssf.after-cutoff.tsv

For entities that are included in CCR, at least 20% of the vitals are in
...before-cutoff.tsv, so the remainder must be at least 4, which is
enforced by --require-positives=4

For people building automatic CCR systems, this scoring tool is really the
only way that you should load the ...after-cutoff.tsv data.

*All* of the entities are valid SSF targets. The upcoming SSF scorer
release will clarify whether any of the 109 entities need to be dropped
from SSF.

John

Jingtian Jiang

unread,

Aug 3, 2014, 7:53:42 PM8/3/14

to trec...@googlegroups.com

Thanks very much for your clear explanation, John.

Jingtian

Reply all

Reply to author

Forward