The New York Times Annotated Corpus
The New York Times Annotated Corpus is a collection of over 1.8 million articles annotated with rich metadata published by The New York Times between January 1, 1987 and July 19, 2007.
With over 650,000 individually written summaries and 1.5 million manually tagged articles, The New York Times Annotated Corpus has the potential to be a valuable resource for a number of natural language processing research areas, including document summarization, document categorization and automatic content extraction.
The corpus is provided as a collection of XML documents in the News Industry Text Format (NITF). Developed by a consortium of the world’s major news agencies, NITF is an internationally recognized standard for representing the content and structure of news documents. To learn more about NITF please visit the NITF website.
Highlights of The New York Times Annotated Corpus include:
- Over 1.8 million articles written and published between January 1, 1987 and June 19, 2007.
- Over 650,000 article summaries written by the staff of The New York Times Index Department.
- Over 1.5 million articles manually tagged by The New York Times Index Department with a normalized indexing vocabulary of people, organizations, locations and topic descriptors.
- Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at NYTimes.com.
- Java tools for parsing corpus documents from xml into a memory resident object.
To learn more about The New York Times Annotated Corpus please read the PDF Overview.