The NYT annotated corpus

1,297 views
Skip to first unread message

Aaron Swartz

unread,
Nov 21, 2008, 7:32:11 PM11/21/08
to get-t...@googlegroups.com
http://corpus.nytimes.com/

The New York Times Annotated Corpus is a collection of over 1.8
million articles annotated with rich metadata published by The New
York Times between January 1, 1987 and July 19, 2007.

With over 650,000 individually written summaries and 1.5 million
manually tagged articles, The New York Times Annotated Corpus has the
potential to be a valuable resource for a number of natural language
processing research areas, including document summarization, document
categorization and automatic content extraction.

The corpus is provided as a collection of XML documents in the News
Industry Text Format (NITF). Developed by a consortium of the world's
major news agencies, NITF is an internationally recognized standard
for representing the content and structure of news documents. To
learn more about NITF please visit the NITF website.

Highlights of The New York Times Annotated Corpus include:

* Over 1.8 million articles written and published between January
1, 1987 and June 19, 2007.
* Over 650,000 article summaries written by the staff of The New
York Times Index Department.
* Over 1.5 million articles manually tagged by The New York Times
Index Department with a normalized indexing vocabulary of people,
organizations, locations and topic descriptors.
* Over 275,000 algorithmically-tagged articles that have been hand
verified by the online production staff at NYTimes.com.
* Java tools for parsing corpus documents from xml into a memory
resident object.

Michael E. Driscoll

unread,
Nov 21, 2008, 11:39:23 PM11/21/08
to get.theinfo
Aaron -

Thanks for the info, but unless I'm missing something, actually
getting one's hands on the corpus data looks like a non-trivial task.

The link you provided (http://corpus.nytimes.com/) redirects to a
Google group that provides informational overview, but no data.

There is a message in the Newsgroup from Evan Sandhaus at the NYT
which reads:

"We are releasing the data in conjunction with the Linguistic Data
Consortium."

The link provided is:

http://ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T19

Available on DVD for $300.00. According to the LDC web site "these
discs are the only practical media for data collections involving
hundreds or thousands of megabytes (MB)". That's news to me.

Mike

_______________________
Michael E. Driscoll, Ph.D.
www.dataspora.com/blog

Brendan O'Connor

unread,
Nov 21, 2008, 11:54:26 PM11/21/08
to get-t...@googlegroups.com
Yeah seriously -- just put it all up on S3!

One person pointed out that probably people like to distribute through
the LDC if there's license agreements involved (e.g. Google does it
for their n-grams corpus) in terms of having LDC handle the
bureaucratic hassle. (Which is lame of course, it could be done
electronically too)
http://behind-the-enemy-lines.blogspot.com/2008/11/social-annotation-of-nyt-corpus.html

Brendan
[ anyall.org ]

Josh Tauberer

unread,
Nov 25, 2008, 7:08:55 AM11/25/08
to get-t...@googlegroups.com
In defense of the LDC (which really is quite rare for me)... They deal
with lots and lots and lots of linguistic databases and have been doing
this for a very long time, and I can empathize with their need to have a
simple and relatively cheap distribution method like DVDs that probably
works well for most of their subscribers. For LDC, it's not a matter of
how to distribute a single DVD's worth --- this is probably
one-hundredth of their total data or less.

Since I'm a Penn grad student and the LDC is at Penn, I have free access
to the database (for research purposes). :)

--
- Josh Tauberer
- GovTrack.us

http://razor.occams.info

"Yields falsehood when preceded by its quotation! Yields
falsehood when preceded by its quotation!" Achilles to
Tortoise (in "Godel, Escher, Bach" by Douglas Hofstadter)

Jonathan Gray

unread,
Dec 10, 2008, 2:54:57 PM12/10/08
to get-t...@googlegroups.com
I've created a package page for this on CKAN:

http://ckan.net/package/read/nyt-corpus

Its a shame they can't just put it on archive.org :-)

J.

Bryan Bishop

unread,
Dec 10, 2008, 5:25:43 PM12/10/08
to get-t...@googlegroups.com, kan...@gmail.com, Jonathan Gray
On Wed, Dec 10, 2008 at 1:54 PM, Jonathan Gray wrote:
> I've created a package page for this on CKAN:
>
> http://ckan.net/package/read/nyt-corpus

Jonathan, how serious is CKAN? Does it have a shell program just like
cpan does? Does it do dependencies and other package maintenance
tasks?

- Bryan
http://heybryan.org/
1 512 203 0507

Rufus Pollock

unread,
Dec 15, 2008, 7:05:11 AM12/15/08
to get-t...@googlegroups.com, kan...@gmail.com, Jonathan Gray
On Wed, Dec 10, 2008 at 10:25 PM, Bryan Bishop <kan...@gmail.com> wrote:
>
> On Wed, Dec 10, 2008 at 1:54 PM, Jonathan Gray wrote:
>> I've created a package page for this on CKAN:
>>
>> http://ckan.net/package/read/nyt-corpus
>
> Jonathan, how serious is CKAN? Does it have a shell program just like
> cpan does? Does it do dependencies and other package maintenance
> tasks?

Yes there is a shell program (datapkg) though it is currently in a
fairly rudimentary:

http://www.okfn.org/datapkg/
http://pypi.python.org/pypi/datapkg/0.1

If you have python and easy_install you can do: $ easy_install datapkg

Dependency tracking is not yet implemented but is planned. I would
also point that given that most knowledge (data/content/...)
'packages' are not even available for download in a simple form
(tar.gz, ...) dependency support is perhaps not the top priority
(though I definitely think it is important).

Our rough project plan:

1. Basic metadata support in an online service with support for entry
of data by non-coders
2. Download urls so that automated registration and downloading can take place
3. Basic installation (using download urls etc)
4. Support for 'wrapper-packages' (e.g. packages that have no content
but wrap a web-api or which come from another system e.g. apt)
5. Dependency tracking

So far we have done 1 + 2 (ckan.net), and 3 is working in basic form
in datapkg. We are working on 4 + 5 and would welcome any suggestions
or assistance :)

Rufus

Reply all
Reply to author
Forward
0 new messages