updated tarball with training data for SSF and fixed stream_ids for CCR training examples

Skip to first unread message

John R. Frank

Jul 19, 2013, 2:25:12 PM7/19/13
to trec...@googlegroups.com, stream...@googlegroups.com

This tarball updated tarball is now available:


It includes important updates on:

- plots describing the kba-streamcorpus data volumes per hour

- NEW: training data for KBA SSF

- assessor guidelines for both KBA SSF and CCR

- statistics on KBA SSF evaluation data

- fixed stream_ids for KBA CCR training data

Here is an excerpt from the updated README.rst in the tarball:

KBA 2013 has two tasks:
1) Cumulative Citation Recommendation (CCR), and
2) Streaming Slot Filling (SSF)

CCR is a document filtering task, and SSF is a slot filling task.

In studying CCR, many people realized that a large fraction of "vital"
documents can be explained with a sentence of the form:

"The entity's _____ attribute acquired this value: ____."

In fact, it is an interesting research question to identify vital
documents that do not fit this pattern. These changing entity profiles
reflect real-world events, which often appear as spikes in this time
series plot of all of the vital documents across the entire
seventeen-month time range, which depicts both training and evaluation
ground truth data:


The underlying corpus time series is plotted here:


The CCR task requires coreference resolution of entity mentions. The SSF
task requires coreference resolution of both entities and slot fills.
See assessor guidelines for details.

One of KBA's goals is to attract researchers from both information
retrieval and natural language understanding. CCR naturally caters to IR,
and SSF to NLU. By weaving the two together, we hope to foster

The KBA Organizers

John R. Frank

Aug 20, 2013, 12:31:52 PM8/20/13
to trec...@googlegroups.com
> Can someone expalain to me why there are two annotation files and how we
> are supposed to use them ?� The two files that I found confusing are :�
> trec-kba-ccr-judgments-2013-04-08.before-cutoff.filter-run.txt and�
> trec-kba-ccr-judgments-2013-07-08.before-cutoff.filter-run.txt

Answering this in trec-kba, moving streamcorpus to BCC.

No worries, this is not going to cause you any problems.

The first file had some stream_ids that were not discoverable, because the
epoch_ticks in the stream_id differed from the released corpus. I think
the number of non-discoverable stream_ids was 95.

The second file includes all stream_ids within 3700 seconds of an
annotated stream_id that have the same doc_id, i.e. the same abs_url.
Therefore, these "extra" examples are mostly duplicative, and just make it
more likely that you can locate the training example documents.

Let us know if you have other questions.

Reply all
Reply to author
0 new messages