Several questions

11 views
Skip to first unread message

Will Fitzgerald

unread,
Jan 4, 2010, 9:55:08 AM1/4/10
to The New York Times Linked Open Data Community
The main page states there are 5,000 people in the list, but I think
there are only 4,794 in the RDF file. Why the difference?

How were the entries on the list decided? Why is (e.g.) Franz Kafka on
the list, but not (e.g.) Jesus Christ?

Does the Times distinguish individual people from groups of people
internally? For example, the "Gambino Crime Family" is an entry in the
list.

How does the Times decide which canonical name to use?
- For example, why is RFK "Kennedy, Robert Francis", but his son is
"Kennedy, Robert F Jr"?
- For example, why are some nobles given their titles? ("Charles,
Prince of Wales", "Abdullah II, King of Jordan")
- For example, why "Teresa (Mother)" ?

Does the Times keep standard aliases for its entries anywhere? (eg.
"RFK/Robert F Kennedy/Robert Francis Kennedy")?

Will Fitzgerald

Oh, and thanks for a great resource!

Evan Sandhaus

unread,
Jan 4, 2010, 10:55:58 AM1/4/10
to The New York Times Linked Open Data Community
Will,

Thanks for posting your several questions, I'll attempt several
answers.

> The main page states there are 5,000 people in the list, but I think
> there are only 4,794 in the RDF file. Why the difference?

Reviewing the database and my logs for the manual mapping process, it
seems that internal tool we were using for the mappings was
incorrectly reporting the total number of people mapped. (The query
was probably missing a GROUP BY) This bug has since been addressed and
the tool now accurately reports that we've mapped 4,993 people. I
apologize for this minor inconsistency.

> How were the entries on the list decided? Why is (e.g.) Franz Kafka on
> the list, but not (e.g.) Jesus Christ?

The reason for this is that although we aggregate articles about Jesus
Christ (http://topics.nytimes.com/topics/reference/timestopics/people/
j/jesus_christ/index.html) and approximately 10,500 other names, our
approach to sampling this data unexpectedly excluded the "Jesus
Christ" subject heading. As we continue to map our personal names
subject headings, you'll see this subject headings and many more
appear in the data.

As for the rest of your questions - I will refer them to our indexing
staff and have them respond on this forum.

Thanks again for your interest in our data.

All the best,

Evan

Will Fitzgerald

unread,
Jan 4, 2010, 11:02:40 AM1/4/10
to The New York Times Linked Open Data Community
Thanks for your kind replies, Evan!

Do you know how the Times decides which people to aggregate topics on?
(Eg, why John Paul II, but not John Paul I?)

Will

web gurl

unread,
Jan 4, 2010, 11:16:09 AM1/4/10
to nyt_linked...@googlegroups.com
Hi all~
I'm a rather recent lurker here, but very interested in this project. You mentioned aggregating under a single name. How is that form determined (e.g., when to use the title and when not?) Is it by frequency of use in the NYT db or are you using a thesaurus of some sort?
Have you considered using the freely available VIAF (Virtual Authority File) as a source for determining what the form of the name should be?  VIAF is project by the library/metadata community to pull together records of names from a variety of sources, including the Vatican Library and the Library of Congress. Name authority (the library lingo) attempts to resolve naming issues and conventions (such as when to use the suffix or title...) by establishing one form of the name as the "authorized" one.

Here is the example for Robert Kennedy:
http://viaf.org/viaf/9882357

Just curious.
Robin Fay
http://contentdivergent.blogspot.com
--
http://contentdivergent.blogspot.com

Kristi

unread,
Jan 4, 2010, 7:47:56 PM1/4/10
to The New York Times Linked Open Data Community
To follow-up on Evan's response:

1.) Yes we do make a distinction between individuals and groups. We
would use the term "Gambino Crime Family" for articles about the
family/organization in general but if an article focused on a specific
member such as John Gotti, we would use "Gotti, John A" for the son,
or "Gotti, John J" for the father.

2.) When choosing a canonical name, there is no single rule. We try
to follow these general guidelines:

-- Comply with Times Style: i.e. match the name used in Times
articles.

-- Avoid collisions: In cases of ambiguous names, favor longer, more
explicit name variations, i.e. if the person has a middle initial or
suffix such as Jr/Sr/I/II/III, use it.

-- Be clear: The name should be recognizable to our indexers. If, in
using a longer unambiguous variation, we end up with a variation that
is quite different from what the paper is using, we include the alias/
nickname in quotation marks, e.g. If the paper mentions “Bill Thomas”
for the former congressman, we index him as Thomas, William Marshall
"Bill"

3.) In general, we have avoided titles on personal names because they
may change often over a person's career/lifespan. We tend to make an
exception for nobles because their titles change less. We append the
title after name for which the person is known, which for nobles,
tends to be first names.

4.) Yes, we maintain authority files that contain all variations for a
given name.


On Jan 4, 11:16 am, web gurl <georgiawebg...@gmail.com> wrote:
> Hi all~
> I'm a rather recent lurker here, but very interested in this project. You
> mentioned aggregating under a single name. How is that form determined
> (e.g., when to use the title and when not?) Is it by frequency of use in the
> NYT db or are you using a thesaurus of some sort?
> Have you considered using the freely available VIAF (Virtual Authority File)
> as a source for determining what the form of the name should be?  VIAF is
> project by the library/metadata community to pull together records of names
> from a variety of sources, including the Vatican Library and the Library of
> Congress. Name authority (the library lingo) attempts to resolve naming
> issues and conventions (such as when to use the suffix or title...) by
> establishing one form of the name as the "authorized" one.
>
> Here is the example for Robert Kennedy:http://viaf.org/viaf/9882357
>
> Just curious.

> Robin Fayhttp://contentdivergent.blogspot.com


>
>
>
> On Mon, Jan 4, 2010 at 10:55 AM, Evan Sandhaus <kan...@gmail.com> wrote:
> > Will,
>
> > Thanks for posting your several questions, I'll attempt several
> > answers.
>
> > > The main page states there are 5,000 people in the list, but I think
> > > there are only 4,794 in the RDF file. Why the difference?
>
> > Reviewing the database and my logs for the manual mapping process, it
> > seems that internal tool we were using for the mappings was
> > incorrectly reporting the total number of people mapped.  (The query
> > was probably missing a GROUP BY) This bug has since been addressed and
> > the tool now accurately reports that we've mapped 4,993 people.  I
> > apologize for this minor inconsistency.
>
> > > How were the entries on the list decided? Why is (e.g.) Franz Kafka on
> > > the list, but not (e.g.) Jesus Christ?
>
> > The reason for this is that although we aggregate articles about Jesus
> > Christ (http://topics.nytimes.com/topics/reference/timestopics/people/

> > j/jesus_christ/index.html<http://topics.nytimes.com/topics/reference/timestopics/people/%0Aj/je...>)

Evan Sandhaus

unread,
Jan 4, 2010, 8:14:34 PM1/4/10
to The New York Times Linked Open Data Community
Robin,

Thanks for posting, happy to have you as part of the group.

To your questions:

> How is that form determined

On this issue, I'll refer you to Kristi's excellent reply.

This is a very interesting resource.

>Have you considered using the freely available VIAF...

Not yet, but it sounds like a great resource. Does there exist any
mapping from VIAF to freebase/wikipedia/dbpedia? If so, then it
should be relatively straightforward to divine links from our data to
the appropriate VIAF resources.

All the best,

Evan

On Jan 4, 11:16 am, web gurl <georgiawebg...@gmail.com> wrote:
> Hi all~
> I'm a rather recent lurker here, but very interested in this project. You
> mentioned aggregating under a single name. How is that form determined
> (e.g., when to use the title and when not?) Is it by frequency of use in the
> NYT db or are you using a thesaurus of some sort?
> Have you considered using the freely available VIAF (Virtual Authority File)
> as a source for determining what the form of the name should be?  VIAF is
> project by the library/metadata community to pull together records of names
> from a variety of sources, including the Vatican Library and the Library of
> Congress. Name authority (the library lingo) attempts to resolve naming
> issues and conventions (such as when to use the suffix or title...) by
> establishing one form of the name as the "authorized" one.
>
> Here is the example for Robert Kennedy:http://viaf.org/viaf/9882357
>
> Just curious.

> Robin Fayhttp://contentdivergent.blogspot.com


>
>
>
>
>
> On Mon, Jan 4, 2010 at 10:55 AM, Evan Sandhaus <kan...@gmail.com> wrote:
> > Will,
>
> > Thanks for posting your several questions, I'll attempt several
> > answers.
>
> > > The main page states there are 5,000 people in the list, but I think
> > > there are only 4,794 in the RDF file. Why the difference?
>
> > Reviewing the database and my logs for the manual mapping process, it
> > seems that internal tool we were using for the mappings was
> > incorrectly reporting the total number of people mapped.  (The query
> > was probably missing a GROUP BY) This bug has since been addressed and
> > the tool now accurately reports that we've mapped 4,993 people.  I
> > apologize for this minor inconsistency.
>
> > > How were the entries on the list decided? Why is (e.g.) Franz Kafka on
> > > the list, but not (e.g.) Jesus Christ?
>
> > The reason for this is that although we aggregate articles about Jesus
> > Christ (http://topics.nytimes.com/topics/reference/timestopics/people/

> > j/jesus_christ/index.html<http://topics.nytimes.com/topics/reference/timestopics/people/%0Aj/je...>)

Tom Morris

unread,
Jan 4, 2010, 8:39:19 PM1/4/10
to nyt_linked_open_data
On Mon, Jan 4, 2010 at 11:16 AM, web gurl <georgia...@gmail.com> wrote:

> Have you considered using the freely available VIAF (Virtual Authority File)
> as a source for determining what the form of the name should be?

> ...


> Here is the example for Robert Kennedy:
> http://viaf.org/viaf/9882357

What is the license for this data? There doesn't appear to be any
link to a license on that page.

If it's anything like OCLC's other agreements, it's pretty far from
"freely available" and is probably not usable. Not only does the OCLC
prohibit the commercial use of the merged data, they prohibit their
members from transferring *their own data* to any entity that the OCLC
doesn't agree with (including all commercial establishments such as
the NYT).

Tom

Eric Hellman

unread,
Jan 4, 2010, 8:46:21 PM1/4/10
to nyt_linked...@googlegroups.com
I've asked Thom Hickey to repy.

Eric
Eric Hellman
President, Gluejar, Inc.
41 Watchung Plaza, #132
Montclair, NJ 07042
USA




Jonathan Gray

unread,
Jan 4, 2010, 9:12:02 PM1/4/10
to nyt_linked...@googlegroups.com
It would be great to add details of licensing/openness to:

http://ckan.net/package/viaf

As in:

http://ckan.net/package/nytimes-linked-open-data

Regarding licensing for data, these may be relevant:

http://www.opendatacommons.org/
http://creativecommons.org/choose/zero

We're also currently looking into an 'attribution' license
specifically meant for data.

--
Jonathan Gray

Community Coordinator
The Open Knowledge Foundation
http://www.okfn.org

Jonathan Gray

unread,
Jan 4, 2010, 9:12:32 PM1/4/10
to nyt_linked...@googlegroups.com
It would be great to add details of licensing/openness to:

http://ckan.net/package/viaf

As in:

http://ckan.net/package/nytimes-linked-open-data

Regarding licensing for data, these may be relevant:

http://www.opendatacommons.org/
http://creativecommons.org/choose/zero

We're also currently looking into an 'attribution' license
specifically meant for data.

--
Jonathan Gray

Community Coordinator
The Open Knowledge Foundation
http://www.okfn.org

On Tue, Jan 5, 2010 at 1:46 AM, Eric Hellman <ope...@gmail.com> wrote:

--

Reply all
Reply to author
Forward
0 new messages