StreamItem.stream_time.zulu_timestamp does always correspond to the
directory containing the chunk.
You can verify that using the attached python and this index file:
http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0-chunk-path-to-stream-id-time.txt.xz
Here's how to run the attached python:
xzcat /home/jrf/2013-corpus-time-stamps/kba-streamcorpus-2013-v0_2_0-chunk-path-to-stream-id-time.txt.xz | python ~/2013-corpus-time-stamps/verify-timestamps.py &>~/2013-corpus-time-stamps/verify-timestamps.log
uniq -c /home/jrf/2013-corpus-time-stamps/verify-timestamps.log
2222638 correct
Three related issues:
1) approx 25% of the epoch_ticks are off by 3600 as a result of incorrect
timezone converstion.
2) 95 of the training examples (<1%) in the KBA 2013 training data are
hard to find because they are off by 3600 +/- 600 seconds, where the +/-
comes from reducing rapid duplicates. We will publish an updated list of
the training data to make it easy to find these.
3) clean_html and clean_visible processing was run for all documents
classified as English by the CLD, and many (but not all!) of the documents
classified as "unknown" by the CLD. However, this process sometimes
failed to generate clean_html and clean_visible. When it failed on
documents that were shown to assessors, we *re-ran* those documents, which
in some cases created duplicates of those documents. This means that some
of stream_id that appear the KBA training examples appear >1 times in the
corpus, and some of those multiple instances might not have clean_html or
clean_visible. For any document in the training/evaluation data, there
should be at least one instance with clean_visible.
These issues should not cause any serious problems. Please let us know
what questions you have.
John