John R. Frank
unread,May 3, 2013, 2:24:15 PM5/3/13Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to stream...@googlegroups.com
> did anyone notice same document occurring multiple times in corpus with
> same streamid ?
The doc_id is the md5 hash of the abs_url, and is not unique, because the
same page may be revisited at multiple times.
The stream_id is "%d-%s" % (epoch_ticks, doc_id) and is unique up to one
second, which means essentially always unique. We also aggressively
rejected frequent refetches of pages, e.g. some spinn3r substreams can
recheck a page thirty times in an hour because it keeps popping up on an
RSS feed for an auction or similar site.
The same stream_id may appear multiple times in the corpus. Of the 1.040
billion StreamItems of the kba-streamcorpus-2013, approximately 1.025
billion have unique stream_id. These duplicates resulted are in a small
set of chunk files that slipped through a version of our validation
process that double checks the non-atomic-write nature of S3 uploading. We
may clean these out in the future. There were more duplicates in the
kba-stream-corpus-2012
jrf