stream item timestamps: stream_time.zulu_timestamp vs stream_time.epoch_ticks

93 views
Skip to first unread message

Fernando Diaz

unread,
Jun 30, 2013, 6:09:36 PM6/30/13
to stream...@googlegroups.com

It was reported that stream_time.epoch_ticks were sometimes off by a few hours due to crawl machine settings. We tested to see if stream_time.zulu_timestamp suffered from the same error.

TREC Temporal Summarization organizers have done a pass confirming that stream_time.zulu_timestamp seems to give accurate time stamps in GMT. We are looking for secondary confirmation.

Thanks.

Fernando Diaz

John R. Frank

unread,
Jul 1, 2013, 12:36:24 PM7/1/13
to stream...@googlegroups.com
StreamItem.stream_time.zulu_timestamp does always correspond to the
directory containing the chunk.

You can verify that using the attached python and this index file:

http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0-chunk-path-to-stream-id-time.txt.xz


Here's how to run the attached python:

xzcat /home/jrf/2013-corpus-time-stamps/kba-streamcorpus-2013-v0_2_0-chunk-path-to-stream-id-time.txt.xz | python ~/2013-corpus-time-stamps/verify-timestamps.py &>~/2013-corpus-time-stamps/verify-timestamps.log

uniq -c /home/jrf/2013-corpus-time-stamps/verify-timestamps.log

2222638 correct



Three related issues:

1) approx 25% of the epoch_ticks are off by 3600 as a result of incorrect
timezone converstion.

2) 95 of the training examples (<1%) in the KBA 2013 training data are
hard to find because they are off by 3600 +/- 600 seconds, where the +/-
comes from reducing rapid duplicates. We will publish an updated list of
the training data to make it easy to find these.

3) clean_html and clean_visible processing was run for all documents
classified as English by the CLD, and many (but not all!) of the documents
classified as "unknown" by the CLD. However, this process sometimes
failed to generate clean_html and clean_visible. When it failed on
documents that were shown to assessors, we *re-ran* those documents, which
in some cases created duplicates of those documents. This means that some
of stream_id that appear the KBA training examples appear >1 times in the
corpus, and some of those multiple instances might not have clean_html or
clean_visible. For any document in the training/evaluation data, there
should be at least one instance with clean_visible.


These issues should not cause any serious problems. Please let us know
what questions you have.


John

verify-zulu-timestamp.py.gz

Craig Willis

unread,
Jul 1, 2013, 1:23:02 PM7/1/13
to John R. Frank, stream...@googlegroups.com
Thank you, John.

On Jul 1, 2013, at 11:36 AM, John R. Frank wrote:
> This means that some of stream_id that appear the KBA training examples appear >1 times in the corpus, and some of those multiple instances might not have clean_html or clean_visible. For any document in the training/evaluation data, there should be at least one instance with clean_visible.


This explains what I'm seeing.

In these cases, should the stream_ids be the same or just the doc_ids? I'm looking at a specific case, and indeed one instance has empty clean_html/clean_visible. The doc_ids are the same, but the stream_ids are different because of the stream_time.

All the best,
Craig Willis

John R. Frank

unread,
Jul 1, 2013, 10:36:19 PM7/1/13
to stream...@googlegroups.com
> In these cases, should the stream_ids be the same or just the doc_ids?
> I'm looking at a specific case, and indeed one instance has empty
> clean_html/clean_visible. The doc_ids are the same, but the stream_ids
> are different because of the stream_time.

Right, there are some documents with two instances with epoch_ticks that
differ by 3600 seconds, and only one with clean_html + clean_visible.
Since some of these were labeled, we're updating the truth data file to at
least point to both, if not the one with clean_html + clean_visible.


John

Craig Willis

unread,
Jul 3, 2013, 5:33:20 PM7/3/13
to John R.Frank, stream...@googlegroups.com

> Right, there are some documents with two instances with epoch_ticks that differ by 3600 seconds, and only one with clean_html + clean_visible. Since some of these were labeled, we're updating the truth data file to at least point to both, if not the one with clean_html + clean_visible.


Just to confirm, there will be an updated version of trec-kba-ccr-judgments-2013-04-08.before-cutoff.filter-run.txt?

Thank you,
Craig

John R. Frank

unread,
Jul 3, 2013, 6:40:01 PM7/3/13
to stream...@googlegroups.com
correct. next week or earlier if possible.


jrf

Craig Willis

unread,
Jul 18, 2013, 2:58:48 PM7/18/13
to stream...@googlegroups.com

>> Just to confirm, there will be an updated version of trec-kba-ccr-judgments-2013-04-08.before-cutoff.filter-run.txt?
>
> correct. next week or earlier if possible.


Just checking, has an updated file been published?

Thank you,
Craig
Reply all
Reply to author
Forward
0 new messages