TREC StreamCorpus 2014 released -- 1.2B docs, rich NLP tagging, >18 months of contiguous news, web, blogs

205 views
Skip to first unread message

John R. Frank

unread,
May 30, 2014, 4:40:58 PM5/30/14
to stream...@googlegroups.com

For immediate release: TREC StreamCorpus 2014 released


TREC 2014 Registration closes *very soon*:

https://ir.nist.gov/trecsubmit.open/application.html


To get the truth data from TREC tracks using this corpus, you must
register. Registration is free and non-binding.


Highlights from the new TREC StreamCorpus 2014:

* 1.2 billion documents total, 41% in English

* 13,663 contiguous hours of web, news, blog, and forum content

* ~300,000 documents tagged by BBN's Serif NER, sentence parsing, within-doc coreference chains

* Further details and discussions in https://groups.google.com/forum/#!forum/streamcorpus


Corpus:

http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html



TREC tracks using this data and/or coordinating topic development:

Knowledge Base Acceleration
http://trec-kba.org/

Temporal Summarization
http://www.trec-ts.org/

Microblog
http://trec.nist.gov/pubs/call2014.html



Regards,
The KBA Organizers


Shun

unread,
Jun 2, 2014, 6:51:13 AM6/2/14
to stream...@googlegroups.com
Hello.

Are the contents of TREC StreamCorpus 2014 in 2011-10-05-00 ~ 2013-2-13-23 same as 2013 one?
In other words, Can I use 2013 ones to 2014 tasks?

Shun


2014年5月31日土曜日 5時40分58秒 UTC+9 John Frank:

张东旭

unread,
Jun 2, 2014, 10:41:13 AM6/2/14
to stream...@googlegroups.com
Hi.

It's great to find the corpus available. Thank you!

When could we get the rsa key file?
And also, is there an English-only corpus version available  for us to download?

Don

 

张东旭

unread,
Jun 2, 2014, 11:05:03 PM6/2/14
to stream...@googlegroups.com
hi,
another question. 
Is there any difference between the 300000 tagged documents and  other untagged docs? (I mean, how did you choose that 300000 docs? or just randomly chose?)

Don
 

John R. Frank

unread,
Jun 3, 2014, 12:58:26 AM6/3/14
to stream...@googlegroups.com
> When could we get the rsa key file?

you must request the RSA key from NIST:

http://trec.nist.gov/data/kba.html



> And also, is there an English-only corpus version available  for us to
> download?

Yes, for the KBA track of TREC, we are generating a subset of the corpus
that is somewhat smaller than English-only. I will post an update on this
in a couple days.


jrf

John R. Frank

unread,
Jun 3, 2014, 1:00:06 AM6/3/14
to stream...@googlegroups.com
> Are the contents of TREC StreamCorpus 2014 in 2011-10-05-00 ~
> 2013-2-13-23 same as 2013 one? In other words, Can I use 2013 ones to
> 2014 tasks?

The 2014 corpus fixes the stream_id errors from the 2013 corpus, so you'll
want the new corpus. Assuming you are doing the KBA track tasks, there
will be a stripped down version of the corpus available.

jrf

Shun

unread,
Jun 3, 2014, 1:08:23 AM6/3/14
to stream...@googlegroups.com
Thank you for your answer.
Our team are doing the KBA track and TS track tasks.
Where could we get the stripped down version of the corpus?

Shun

2014年6月3日火曜日 14時00分06秒 UTC+9 John Frank:

张东旭

unread,
Jun 3, 2014, 10:02:42 PM6/3/14
to stream...@googlegroups.com
Thanks for your reply.Here is another question. 
Is there any difference between the 300000 tagged documents and  other untagged docs since it is quite a small part of the whole corpus? (I mean, how did you choose that 300000 docs? or just randomly sample?)

Don


在 2014年6月3日星期二UTC+8下午12时58分26秒,John Frank写道:

John R. Frank

unread,
Jun 3, 2014, 10:03:51 PM6/3/14
to stream...@googlegroups.com

We'll provide stats on the subset. They are the documents for which Serif
generated tags, so not at all random.

jrf
Message has been deleted

张东旭

unread,
Jun 4, 2014, 5:36:21 AM6/4/14
to stream...@googlegroups.com
Oh, I'm sorry I didn't say it clearly. I mean why had those 300 thousand documents been chosen tagged instead of other docs?

And here is another two questions(Hope it does not bother you...)
1.Can we download those 300,000 tagged docs separately?
2.Since we have participated 2013 kba and TS, we already have 2013 corpus. Due to the lack of memory space, it will be great help if we can use 2013 corpus. Otherwise we will have to waste a long time downloading and dealing with memory problem. So is there a big change in stream_id or other aspects that we can't submit our answer correctly without 2014 corpus? (Hopefully we can download the tagged corpus separately or maybe just use other tool like standford parser)

thank you.

Don 

在 2014年6月4日星期三UTC+8上午10时03分51秒,John Frank写道:

John R. Frank

unread,
Jun 4, 2014, 6:04:24 AM6/4/14
to stream...@googlegroups.com
> Oh, I'm sorry I didn't say it clearly. I mean why had those 300 thousand documents been chosen tagged instead of other docs?

It is 300 million. These are English documents with sufficient content.
We will provide more explanatory stats.


> 1.Can we download those 300,000 tagged docs separately?

Yes, working on making the 300M easily accessible separately.


> is there a big change in stream_id or other aspects 
> that we can't submit our answer correctly without 2014 corpus?

We added ~200M more documents to the end of the corpus to align with the
microblog track's corpus. You could easily fetch only this portion by
getting date_hour dirs after the end of the 2013 corpus.

About 25% of the 2013 stream_ids had the wrong timestamp such that they
were in the wrong date_hour directory. These are now corrected in the
2014 corpus.

The new corpus has the old stream_ids stored in them, so it will be
possible to generate a mapping between them. We haven't tried to compile
this mapping yet, but it should be possible. Assuming that it is, you
could use the 2013 corpus and then map the stream_ids.


> (Hopefully we can download the tagged corpus separately or
> maybe just use other tool like standford parser)

The corpus includes data from the BBN Serif tagger includes parse trees
and within-doc coref chains.

While the Stanford CoreNLP deterministic seive within-doc coref algorithm
generates nice results, our tests with it took ~10 sec/doc.


John
Message has been deleted

张东旭

unread,
Jun 4, 2014, 6:51:14 AM6/4/14
to stream...@googlegroups.com
Thank you very much indeed for your patient reply. It's really helpful ! And our team look forward to get the separate download for our further work.

在 2014年6月4日星期三UTC+8下午6时04分24秒,John Frank写道:
Reply all
Reply to author
Forward
0 new messages