updated tarball with training data for SSF and fixed stream_ids for CCR training examples

瀏覽次數:28 次
跳到第一則未讀訊息

John R. Frank

未讀,
2013年7月19日 下午2:25:122013/7/19
收件者:trec...@googlegroups.com、stream...@googlegroups.com
KBAers,

This tarball updated tarball is now available:

http://trec.nist.gov/act_part/tracks/kba/2013/trec-kba-ccr-and-ssf-2013-07-19.95d535403012fcb2447ce3389f764ebe.tar.gz

It includes important updates on:

- plots describing the kba-streamcorpus data volumes per hour

- NEW: training data for KBA SSF

- assessor guidelines for both KBA SSF and CCR

- statistics on KBA SSF evaluation data

- fixed stream_ids for KBA CCR training data


Here is an excerpt from the updated README.rst in the tarball:

KBA 2013 has two tasks:
1) Cumulative Citation Recommendation (CCR), and
2) Streaming Slot Filling (SSF)

CCR is a document filtering task, and SSF is a slot filling task.

In studying CCR, many people realized that a large fraction of "vital"
documents can be explained with a sentence of the form:

"The entity's _____ attribute acquired this value: ____."


In fact, it is an interesting research question to identify vital
documents that do not fit this pattern. These changing entity profiles
reflect real-world events, which often appear as spikes in this time
series plot of all of the vital documents across the entire
seventeen-month time range, which depicts both training and evaluation
ground truth data:

trec-kba-ccr-2013-vital-doc-counts-per-entity-per-hour.pdf


The underlying corpus time series is plotted here:

kba-streamcorpus-2013-v0_2_0-source-counts-per-hour.pdf


The CCR task requires coreference resolution of entity mentions. The SSF
task requires coreference resolution of both entities and slot fills.
See assessor guidelines for details.

One of KBA's goals is to attract researchers from both information
retrieval and natural language understanding. CCR naturally caters to IR,
and SSF to NLU. By weaving the two together, we hope to foster
cross-pollination.




Regards,
The KBA Organizers

John R. Frank

未讀,
2013年8月20日 中午12:31:522013/8/20
收件者:trec...@googlegroups.com
> Can someone expalain to me why there are two annotation files and how we
> are supposed to use them ?� The two files that I found confusing are :�
> trec-kba-ccr-judgments-2013-04-08.before-cutoff.filter-run.txt and�
> trec-kba-ccr-judgments-2013-07-08.before-cutoff.filter-run.txt


Answering this in trec-kba, moving streamcorpus to BCC.

No worries, this is not going to cause you any problems.

The first file had some stream_ids that were not discoverable, because the
epoch_ticks in the stream_id differed from the released corpus. I think
the number of non-discoverable stream_ids was 95.

The second file includes all stream_ids within 3700 seconds of an
annotated stream_id that have the same doc_id, i.e. the same abs_url.
Therefore, these "extra" examples are mostly duplicative, and just make it
more likely that you can locate the training example documents.


Let us know if you have other questions.


John
回覆所有人
回覆作者
轉寄
0 則新訊息