All,
The TREC kba-streamcorpus-2013-v0_2_0 is linked from here:
It contains 1,040,520,595 documents in 11,948 hourly directories from October 2011 into February 2013. See link above for more details on substreams.
The total size of the data after XZ compression and GPG encryption is 7096486977581 bytes, or 6.45TB.
The data is stored in concatenated streamcorpus.thrift messages, which you can read about here:
Documents identified as English by the Chromium Compact Language Detector have been tagged with LingPipe from Alias-I and the resulting NER and coreference chains have been attached to tokenization generated by python's NLTK and organized in StreamItem.body.sentences["lingpipe"] on each thrift messages. See the streamcorpus interface definitions in github for further details.
Special thanks to these sponsors for contributing content and tools:
To get the GPG decryption key, send this data use restriction agreement to NIST:
We are working on a v0_3_0 with even more NLP metadata, which we hope to finish in the next month or so, cloud willing.
Enjoy!
The KBA Organizers