11 missing labeled stream_ids

14 views

Skip to first unread message

John R. Frank

unread,

Aug 29, 2014, 1:50:47 AM8/29/14

to trec...@googlegroups.com

KBAers,

The 11 StreamItems indicated by the partial doc_id at the end of each of
these lines have a CCR judgment, but were filtered from the final corpus
(probably because they lacked sufficient NER tags).

These will *not* be included in scoring of runs.

2011-11-29-22/news-26-48d036f505c13d66eda1c4e139592355-8b45959e7446d1a1d8971739d51ba164-dd2f5a7ef75902302c8e76db3ba92426-a94fb537583a73f60b4f93f097b3efa5.sc.xz.gpg#f57abf663179f288
2011-12-08-00/news-82-7290ada51f0cebd7bc17a48cb3d79074-c00af3df77b830c9fba6792d78f6d7f4-46bf26d7725dea7720ca3ce02467b820-f29555cae507d5bd018011ffa47c3efa.sc.xz.gpg#f8e141f5db4e003a
2011-12-12-19/news-41-d91d8b7ad69f1496833feea5e70d720b-06dfa6fd3b35c11c8e41648b7a23a5cb-11cca29be5add330750f48e788cad92e-095d6585ec7ee76ba88c6b509718c0bc.sc.xz.gpg#47243919cf90a10e
2012-01-06-20/news-55-8900083cdeff04d83e7d5d16a6f83577-40ee29106008295fe2c6a6fd0da9b1f5-2790ce35f9147da2289934ce8e529aa8-c4b9ca189b9447149463071d149ef231.sc.xz.gpg#3dc8d981d49976e3
2012-01-10-23/news-44-386e04872f980e34ff1d2816089bef4b-97b32e9de624654cceeda77dd4f4de8e-ea7eb734175725c606fad7def68349fc-0ebe5c86072509371af00b88aafcbc5f.sc.xz.gpg#04013b061f737c74
2012-01-22-21/news-20-e8bac2e42af0b2d8b5529955935ab2b8-bff5343c5d11c7e648ca083256ed52b6-9527d3efc77cd72cef2e26b5b3b39414-7ea87994dee687af49a11769bb3eb1c0.sc.xz.gpg#9a81becae20c3324
2012-01-22-21/news-37-64730d3d93cc4d96f99441024c0bd94e-03d87caeee39faa38909e20799b63c3d-80542fdca003e1224330a6b0d49cd92a-8e5b625fb8e366e628a1d2f39b974c59.sc.xz.gpg#3522e287008852e9
2012-01-23-08/news-22-930cb1833fb3f3a94f0660cd0f8503c3-956f2bf36bc54a7fcd54549168c71670-29a06904a69ef8a9d412b9d88293ea93-5d98ef8addca5cbe36e2306cf15666e6.sc.xz.gpg#d867718f61d09a15
2012-01-23-16/news-63-79a6eed43b023530115d857072ec6f9e-ea6ca05ba3078b6453bdc16e339498d5-cca8cb2cb779ad975981cb1362bfde3c-c161fcfd89bc0e5a04e02f0d83bfe68e.sc.xz.gpg#1b605f28922c7e70
2012-01-23-21/news-55-683ed88bf9d3b437a0e53fea2662a9ef-8c8b1cae4a820322912153706b37be69-2a0f8a6ce93ab6609551ca7d337d74aa-4eae7fd9e68b861853601788ce780cbd.sc.xz.gpg#e7d532cc4f4b744e
2012-02-02-23/news-38-22bf508c6a6b2d35ece6e7fc916846cc-57f039d8d7dbe69dc7728b7c5fd639a6-91a6f8ff24a3eb55b6d9572aa543699f-c3819350fc5e12fe62589eded6b97354.sc.xz.gpg#626e3004fe312f6c

The attached file is the subset of the map file that corresponds to
StreamItems with a rating level assigned by assessors. The complete map
of chunk path to half-doc_id is still here (unchanged from earlier
posting):

http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-kba-filtered_chunk-to-docid.map

Here's an example python snippet for working with this map file:

#!/usr/bin/env python2

from __future__ import print_function
import argparse
import sys

if __name__ == '__main__':
p = argparse.ArgumentParser()
p.add_argument('chunk_to_si', metavar='MAP_FILE')
p.add_argument('labeled_docids', metavar='LABELED_DOCIDS_FILE')
args = p.parse_args()

chunk_to_si = {}
for line in open(args.chunk_to_si):
s3path, half_docid = line.strip().split('#')
chunk_to_si[half_docid] = s3path
for docid in open(args.labeled_docids):
docid = docid.strip()[0:16]
cpath = chunk_to_si.get(docid, None)
if cpath is None:
print('%s not in map!' % docid, file=sys.stderr)
else:
print('%s#%s' % (cpath, docid))

Regards,
John