Announcing Proposed Changes to New York Times RDF Documents

17 views
Skip to first unread message

Evan Sandhaus

unread,
Nov 5, 2009, 11:00:24 AM11/5/09
to The New York Times Linked Open Data Community
Well its been a week since we released our data, and, in that time,
we've received a great deal of feedback and advice from the community
about what we did right and what we need to change. After reading
many of the blog posts/comment threads and personally corresponding
with several members of this community, we are pleased to announce
several changes to our RDF documents. A complete example of the
proposed changes may be found at http://data.nytimes.com/sample/N66220017142656459133.rdf.
Please note that these changes have not yet gone live on
data.nytimes.com.

Below I describe each change in detail.

For the sake of my examples, lets pretend we have a resource with the
uri "http://data.nytimes.com/foo" that is served from a file named
"http://data.nytimes.com/foo.rdf".

The most significant change is the partitioning of the resource into
two separate resources. Where before we had a single resource
""http://data.nytimes.com/foo", we now have two resources:

http://data.nytimes.com/foo
http://data.nytimes.com/foo.rdf

All the licensing and rights triples have been moved to the "http://
data.nytimes.com/foo.rdf" resource. Making this change, preserves the
licensing constraints attached to the document, but does not propagate
them to other data sets through the owl:sameAs triples in the "http://
data.nytimes.com/foo" resource.

We have also made several more important changes.

0) The predicates 'time:start' and 'time:end' have been replaced with
'nyt:first_use' and 'nyt:last_use' respectively. The intent of the
'time:[start|end]' triples was to express the time a subject heading
was first and last used in the Times. Unfortunately, these triples
were ambiguous, so we have decided to extend the 'time:[start|end]'
predicate with our own predicates which (when we document) we will
define to have the above semantics.

1) The 'nyt:topicPage', 'cc:attributionURL', and 'cc:license' triples
now refer to resource URIs , rather than literal URLs.

2) The incorrectly stated 'cc:Attribution' predicate has been replaced
with the correct 'cc:attributionURL' predicate.

3) The incorrectly stated 'cc:License' predicate has been replaced
with the correct 'cc:license' predicate. (capitalization)

4) We have resolved issues with content negotiation on our server.

5) An XML declaration was added to the top of the document.

6) The freebase namespace declaration 'xmlns:fb="http://
rdf.freebase.com/ns/"' was removed from the RDF declaration as it is
not used in any statements contained in our document.

Please let us know what you think about these changes, your input is
very very welcome.

All the best,

Evan Sandhaus
Semantic Technologist
New York Times Research + Development
@kansandhaus

Pius Uzamere

unread,
Nov 5, 2009, 11:27:42 AM11/5/09
to nyt_linked...@googlegroups.com
This looks much better!  I've got some feedbck about time, though.

I'd recommend "nyt:latest_use" rather than "nyt:last_use" unless you literally mean that the date corresponds to the last time the person will ever be mentioned.

Also, assuming we're talking about the latest use, this statement is only truly meaningful in the context of when this designation was last updated.  This points to the more general need for adding semantics for when the resource was added and last modified.  Off the top of my head, the Dublin Core properties of "created" and "modified" should work.

Best,
Pius

Kingsley Idehen

unread,
Nov 5, 2009, 11:28:34 AM11/5/09
to nyt_linked...@googlegroups.com
Evan,

A tweak suggestion re. the resource:
<http://data.nytimes.com/colbert_stephen_per> that will apply to similar
.rdf resources (the triple containers).

<http://data.nytimes.com/sample/N66220017142656459133.rdf> needs to be
connected to entity/resource descriptions that it hosts.

One pattern (of course there are others) would be:
<http://data.nytimes.com/sample/N66220017142656459133.rdf>
foaf:primarytopic <http://data.nytimes.com/colbert_stephen_per> .

The above can be expressed via HTML+RDFa or even just as an entry in
<link/>, if adding at source is in any way problematic.

Examples to show implications of not establishing this vital relationship:

1. http://tr.im/EeV4 - Page that attempts to describe the resource
<http://data.nytimes.com/sample/N66220017142656459133.rdf>
2. http://tr.im/EeUR -- A Browser Page that explores the contents of
<http://data.nytimes.com/sample/N66220017142656459133.rdf>

--


Regards,

Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO
OpenLink Software Web: http://www.openlinksw.com


Richard Cyganiak

unread,
Nov 5, 2009, 2:06:28 PM11/5/09
to nyt_linked...@googlegroups.com
Evan, this is great news! Looks much better.

Just a quick +1 from me on Kingsley's proposal below: Assuming you have

http://data.nytimes.com/foo
http://data.nytimes.com/foo.rdf

then it's a nice thing to add the following triple:

</foo.rdf> foaf:primaryTopic </foo> .

This makes it easier for clients that happen to get hold of just the </
foo.rdf> URI to identify the “primary resource” of the file. This
helps browsers and visualizers to show the right stuff to users.
Several linked data browsers (at least OpenLink Data Explorer,
Tabulator and Disco) will deliver better user experience if this
triple is present.

It's a bit unfortunate that the primaryTopic predicate is in the FOAF
namespace -- it really doesn't have anything to do with personal
profiles or social networking, but it somehow has emerged as the quasi-
standard way of doing this.

Best,
Richard

Hugh Glaser

unread,
Nov 6, 2009, 7:40:20 PM11/6/09
to The New York Times Linked Open Data Community
Hi Evan,
Great to see the progress.
Did you get any further with my suggestion that the sameAs data is of
a different character to the other data, and therefore should be
separated out, especially from the point of view of license?
this is the sort of stuff that brings traffic to the site, so surely
you want to make it CC0 or something like that?
best
Hugh
Reply all
Reply to author
Forward
0 new messages