The New York Times Annotated Corpus is a collection of over 1.8
million articles annotated with rich metadata published by The New
York Times between January 1, 1987 and July 19, 2007.
With over 650,000 individually written summaries and 1.5 million
manually tagged articles, The New York Times Annotated Corpus has the
potential to be a valuable resource for a number of natural language
processing research areas, including document summarization, document
categorization and automatic content extraction.
The corpus is provided as a collection of XML documents in the News
Industry Text Format (NITF). Developed by a consortium of the world's
major news agencies, NITF is an internationally recognized standard
for representing the content and structure of news documents. To
learn more about NITF please visit the NITF website.
Highlights of The New York Times Annotated Corpus include:
* Over 1.8 million articles written and published between January
1, 1987 and June 19, 2007.
* Over 650,000 article summaries written by the staff of The New
York Times Index Department.
* Over 1.5 million articles manually tagged by The New York Times
Index Department with a normalized indexing vocabulary of people,
organizations, locations and topic descriptors.
* Over 275,000 algorithmically-tagged articles that have been hand
verified by the online production staff at NYTimes.com.
* Java tools for parsing corpus documents from xml into a memory
resident object.
Jonathan, how serious is CKAN? Does it have a shell program just like
cpan does? Does it do dependencies and other package maintenance
tasks?
Yes there is a shell program (datapkg) though it is currently in a
fairly rudimentary:
http://www.okfn.org/datapkg/
http://pypi.python.org/pypi/datapkg/0.1
If you have python and easy_install you can do: $ easy_install datapkg
Dependency tracking is not yet implemented but is planned. I would
also point that given that most knowledge (data/content/...)
'packages' are not even available for download in a simple form
(tar.gz, ...) dependency support is perhaps not the top priority
(though I definitely think it is important).
Our rough project plan:
1. Basic metadata support in an online service with support for entry
of data by non-coders
2. Download urls so that automated registration and downloading can take place
3. Basic installation (using download urls etc)
4. Support for 'wrapper-packages' (e.g. packages that have no content
but wrap a web-api or which come from another system e.g. apt)
5. Dependency tracking
So far we have done 1 + 2 (ckan.net), and 3 is working in basic form
in datapkg. We are working on 4 + 5 and would welcome any suggestions
or assistance :)
Rufus