> Oh, I'm sorry I didn't say it clearly. I mean why had those 300 thousand documents been chosen tagged instead of other docs?
It is 300 million. These are English documents with sufficient content.
We will provide more explanatory stats.
> 1.Can we download those 300,000 tagged docs separately?
Yes, working on making the 300M easily accessible separately.
> is there a big change in stream_id or other aspects
> that we can't submit our answer correctly without 2014 corpus?
We added ~200M more documents to the end of the corpus to align with the
microblog track's corpus. You could easily fetch only this portion by
getting date_hour dirs after the end of the 2013 corpus.
About 25% of the 2013 stream_ids had the wrong timestamp such that they
were in the wrong date_hour directory. These are now corrected in the
2014 corpus.
The new corpus has the old stream_ids stored in them, so it will be
possible to generate a mapping between them. We haven't tried to compile
this mapping yet, but it should be possible. Assuming that it is, you
could use the 2013 corpus and then map the stream_ids.
> (Hopefully we can download the tagged corpus separately or
> maybe just use other tool like standford parser)
The corpus includes data from the BBN Serif tagger includes parse trees
and within-doc coref chains.
While the Stanford CoreNLP deterministic seive within-doc coref algorithm
generates nice results, our tests with it took ~10 sec/doc.
John