TREC kba-streamcorpus-2013-v0_2

j...@mit.edu

unread,

Apr 7, 2013, 2:05:51 AM4/7/13

to stream...@googlegroups.com

All,

The TREC kba-streamcorpus-2013-v0_2_0 is linked from here:

http://aws-publicdatasets.s3.amazonaws.com/trec/kba/index.html

It contains 1,040,520,595 documents in 11,948 hourly directories from October 2011 into February 2013. See link above for more details on substreams.

The total size of the data after XZ compression and GPG encryption is 7096486977581 bytes, or 6.45TB.

The data is stored in concatenated streamcorpus.thrift messages, which you can read about here:

https://github.com/trec-kba/streamcorpus

Documents identified as English by the Chromium Compact Language Detector have been tagged with LingPipe from Alias-I and the resulting NER and coreference chains have been attached to tokenization generated by python's NLTK and organized in StreamItem.body.sentences["lingpipe"] on each thrift messages. See the streamcorpus interface definitions in github for further details.

Special thanks to these sponsors for contributing content and tools:

http://spinn3r.com/

http://alias-i.com/

http://arxiv.org/

To get the GPG decryption key, send this data use restriction agreement to NIST:

http://trec.nist.gov/data/kba.html

We are working on a v0_3_0 with even more NLP metadata, which we hope to finish in the next month or so, cloud willing.

Enjoy!

The KBA Organizers

weitai.z...@gmail.com

unread,

Apr 14, 2013, 11:13:47 PM4/14/13

to stream...@googlegroups.com

John,

I am eager to know which part of the corpus is the annotation.

thanks.

在 2013年4月7日星期日UTC+8下午2时05分51秒，j...@mit.edu写道：

j...@mit.edu

unread,

Apr 19, 2013, 7:54:00 PM4/19/13

to stream...@googlegroups.com

The english-and-unknown-language subset of the corpus is now posted. See link here:

http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html

jrf

Tom Kenter

unread,

Jul 25, 2013, 7:45:53 AM7/25/13

to stream...@googlegroups.com

Dear all,

I have got a question about this.

It might be that some files/chunks mentioned in the ground truth happen not to be present in the english-and-unknown-language subset, I guess.

But actually I am having a hard time finding any of the chunks mentioned in trec-kba-ccr-judgments-2013-07-08.before-cutoff.chunk-to-stream_id-and-sizes.txt.

Is this just bad luck or am I missing/misunderstanding something here...?!?

Cheers,

Tom

John R. Frank

unread,

Jul 25, 2013, 2:16:50 PM7/25/13

to Tom Kenter, stream...@googlegroups.com

> But actually I am having a hard time finding any of the chunks mentioned
> in trec-kba-ccr-judgments-2013-07-08.before-cutoff.chunk-to-stream_id-and-sizes.txt.

The paths in that file are to the full corpus, not the stripped corpus,
which has two hashes in each file name.

See here:

http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html

John

Reply all

Reply to author

Forward

TREC kba-streamcorpus-2013-v0_2_0 released

j...@mit.edu

weitai.z...@gmail.com

j...@mit.edu

Tom Kenter

John R. Frank