duplicate doc_id and stream_id in kba-streamcorpus-2013-v0_2_0

74 views
Skip to first unread message

John R. Frank

unread,
May 3, 2013, 2:24:15 PM5/3/13
to stream...@googlegroups.com
> did anyone notice same document occurring multiple times in corpus with
> same streamid ?

The doc_id is the md5 hash of the abs_url, and is not unique, because the
same page may be revisited at multiple times.

The stream_id is "%d-%s" % (epoch_ticks, doc_id) and is unique up to one
second, which means essentially always unique. We also aggressively
rejected frequent refetches of pages, e.g. some spinn3r substreams can
recheck a page thirty times in an hour because it keeps popping up on an
RSS feed for an auction or similar site.

The same stream_id may appear multiple times in the corpus. Of the 1.040
billion StreamItems of the kba-streamcorpus-2013, approximately 1.025
billion have unique stream_id. These duplicates resulted are in a small
set of chunk files that slipped through a version of our validation
process that double checks the non-atomic-write nature of S3 uploading. We
may clean these out in the future. There were more duplicates in the
kba-stream-corpus-2012


jrf

Reply all
Reply to author
Forward
0 new messages