data updates

64 views
Skip to first unread message

Fernando Diaz

unread,
Jun 12, 2015, 5:11:34 PM6/12/15
to tre...@googlegroups.com

A few clarifications on the released data,
  1. the directories pointed to in this version of the filtered set are named after older event ids.  The id of the file (e.g. "27.txt") is the correct id to assign those documents to.  A corrected/less-confusing naming scheme will be released shortly.
  2. a new test events file has been posted which uses the format described in the guidelines.  The original release omitted the query and type fields.  Participants should ignore the previously-released file.

Fernando Diaz

unread,
Jun 15, 2015, 7:51:24 AM6/15/15
to tre...@googlegroups.com


The filtered data set archive has been updated to use S3 directory names consistent with official topic ids.  The older paths still exist and the article content is the same.  However, you may want to re-download to be on the safe side.  




From: tre...@googlegroups.com <tre...@googlegroups.com> on behalf of Fernando Diaz <fd...@microsoft.com>
Sent: Friday, June 12, 2015 5:11 PM
To: tre...@googlegroups.com
Subject: [TREC-TS] data updates
 

A few clarifications on the released data,
  1. the directories pointed to in this version of the filtered set are named after older event ids.  The id of the file (e.g. "27.txt") is the correct id to assign those documents to.  A corrected/less-confusing naming scheme will be released shortly.
  2. a new test events file has been posted which uses the format described in the guidelines.  The original release omitted the query and type fields.  Participants should ignore the previously-released file.

--
You received this message because you are subscribed to the Google Groups "temporalsummarization" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trec-ts+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeroen Vuurens

unread,
Jun 15, 2015, 2:25:51 PM6/15/15
to tre...@googlegroups.com
Hi,

We are supposed to process the articles in the order they arrive in. As far as I can see, the articles in the collection do not contain an attribute to easily preserve the exact crawled order when map-reducing, just the crawl time which is not unique. Is it ok to process the articles in order of crawl time, not worrying about the order of articles that have the same crawl time?

Thanks, Jeroen 

Fernando Diaz

unread,
Jun 15, 2015, 2:52:49 PM6/15/15
to Jeroen Vuurens, tre...@googlegroups.com

Please use stream_time.zulu_timestamp.  You can break ties however you'd like.




From: tre...@googlegroups.com <tre...@googlegroups.com> on behalf of Jeroen Vuurens <jbpvu...@gmail.com>
Sent: Monday, June 15, 2015 2:25 PM
To: tre...@googlegroups.com
Subject: [TREC-TS] temporal order
 
Hi,

We are supposed to process the articles in the order they arrive in. As far as I can see, the articles in the collection do not contain an attribute to easily preserve the exact crawled order when map-reducing, just the crawl time which is not unique. Is it ok to process the articles in order of crawl time, not worrying about the order of articles that have the same crawl time?

Thanks, Jeroen 

--

agia...@gmail.com

unread,
Jun 18, 2015, 6:19:53 AM6/18/15
to tre...@googlegroups.com
Hi, the majority of the files for the topic 32 are not accessible (I get an error "ERROR 403: FORBIDDEN"). Could you please tell me what we have to do with that?

Thanks,
Anastasia

Fernando Diaz

unread,
Jun 18, 2015, 6:31:44 AM6/18/15
to agia...@gmail.com, tre...@googlegroups.com
We had ACL issues this weekend but they are fixed now. When did you try this?


Richard McCreadie

unread,
Jun 18, 2015, 1:26:50 PM6/18/15
to tre...@googlegroups.com, agia...@gmail.com

Hi,

We should have fixed the prolem with topic 32.

We have been experiencing some issues with S3 this year, so let us know if you observe any other problems.

RichardM

Jeroen Vuurens

unread,
Jun 18, 2015, 2:37:53 PM6/18/15
to tre...@googlegroups.com
Hi,

Well downloading is *very* slow, we are downloading the entire set (estimated 10TB), and a few days ago we estimated that at the current speed that would take about 15 days in total (we’re about halfway), btw our bandwidth is not the issue. I can’t monitor the process myself (an admin has to do it), but are the files for the topics coming from the same folder as the entire set, or were those separately stored? If they came out of the entire collection, we’ll have to check what we didn't get and retry those.

Thanks, Jeroen

Richard McCreadie

unread,
Jun 18, 2015, 3:48:28 PM6/18/15
to tre...@googlegroups.com
Hi Jeroen,

I would not recommend downloading the entire KBA corpus set, since as you observed, it is huge.

Instead, we have pre-filtered the KBA corpus for participants, like last year. The files listed for each topic in TREC-TS-2015F.tar.gz are from the original KBA corpus, but were subject to retrieval-based filtering and shoud not total more than a few hundred Gb.

For instance, to get all of the files for topic 26, you can extract the TREC-TS-2015F.tar.gz zip and run (on unix):

   cat 26.txt | while read l; do wget -x $( echo $l | sed 's/s3:\/\//http:\/\/s3.amazonaws.com\//g' ); done

RichardM
Reply all
Reply to author
Forward
0 new messages