TREC kba-streamcorpus-2013-v0_2_0 released

397 views
Skip to first unread message

j...@mit.edu

unread,
Apr 7, 2013, 2:05:51 AM4/7/13
to stream...@googlegroups.com
All,

The TREC kba-streamcorpus-2013-v0_2_0 is linked from here:


It contains 1,040,520,595 documents in 11,948 hourly directories from October 2011 into February 2013.  See link above for more details on substreams.

The total size of the data after XZ compression and GPG encryption is 7096486977581 bytes, or 6.45TB.

The data is stored in concatenated streamcorpus.thrift messages, which you can read about here:


Documents identified as English by the Chromium Compact Language Detector have been tagged with LingPipe from Alias-I and the resulting NER and coreference chains have been attached to tokenization generated by python's NLTK and organized in StreamItem.body.sentences["lingpipe"] on each thrift messages.  See the streamcorpus interface definitions in github for further details.

Special thanks to these sponsors for contributing content and tools: 





To get the GPG decryption key, send this data use restriction agreement to NIST:



We are working on a v0_3_0 with even more NLP metadata, which we hope to finish in the next month or so, cloud willing.


Enjoy!


The KBA Organizers

weitai.z...@gmail.com

unread,
Apr 14, 2013, 11:13:47 PM4/14/13
to stream...@googlegroups.com
John,
I am eager to know which part of the corpus is the annotation.
thanks.

在 2013年4月7日星期日UTC+8下午2时05分51秒,j...@mit.edu写道:

j...@mit.edu

unread,
Apr 19, 2013, 7:54:00 PM4/19/13
to stream...@googlegroups.com
The english-and-unknown-language subset of the corpus is now posted.  See link here:

http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html


jrf

Tom Kenter

unread,
Jul 25, 2013, 7:45:53 AM7/25/13
to stream...@googlegroups.com
Dear all,

I have got a question about this.
It might be that some files/chunks mentioned in the ground truth happen not to be present in the english-and-unknown-language subset, I guess.

But actually I am having a hard time finding any of the chunks mentioned in trec-kba-ccr-judgments-2013-07-08.before-cutoff.chunk-to-stream_id-and-sizes.txt.

Is this just bad luck or am I missing/misunderstanding something here...?!?

Cheers,

Tom

John R. Frank

unread,
Jul 25, 2013, 2:16:50 PM7/25/13
to Tom Kenter, stream...@googlegroups.com
> But actually I am having a hard time finding any of the chunks mentioned
> in trec-kba-ccr-judgments-2013-07-08.before-cutoff.chunk-to-stream_id-and-sizes.txt.

The paths in that file are to the full corpus, not the stripped corpus,
which has two hashes in each file name.

See here:

http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html


John
Reply all
Reply to author
Forward
0 new messages