Groups
Groups
Sign in
Groups
Groups
streamcorpus
Conversations
About
Send feedback
Help
Re: MD5 checksums for kba-2014-clean compressed files?
34 views
Skip to first unread message
John R. Frank
unread,
Jan 14, 2015, 2:10:24 PM
1/14/15
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to stream...@googlegroups.com
This question might come up for other people interested in using the TREC
KBA StreamCorpora described here:
http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html
> What's the first hex string in the filename for?
Those are from the hashes from earlier versions of the corpus.
When we transformed KBA 2013 into KBA 2014, we kept the 2013 hash and
appended the new 2014 hashes.
When we did serif tagging, we added a third.
When we filtered for KBA, we added a fourth. Here is a progression.
Notice the reduction in document count for this chunk from
500-->500-->397-->15
http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0/2011-10-13-23/social-500-7876e61d5367986c0371ffbd4f9b3c5f.sc.xz.gpg
http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0/2011-10-13-23/social-500-7876e61d5367986c0371ffbd4f9b3c5f-c9a8a596287cdfaf93e7bbdb5209390e.sc.xz.gpg
http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only/2011-10-13-23/social-397-7876e61d5367986c0371ffbd4f9b3c5f-c9a8a596287cdfaf93e7bbdb5209390e-f165f6c95962a89c0bf0b95f64dfed16.sc.xz.gpg
http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-kba-filtered/2011-10-13-23/social-15-7876e61d5367986c0371ffbd4f9b3c5f-c9a8a596287cdfaf93e7bbdb5209390e-f165f6c95962a89c0bf0b95f64dfed16-22d0fbfc295559b6e7855fa2a3b2ff72.sc.xz.gpg
jrf
Reply all
Reply to author
Forward
0 new messages