[ol] OpenLibrary data refresh 3 - Hobbit Edition

44 views
Skip to first unread message

Ian Davis

unread,
May 4, 2009, 4:05:24 PM5/4/09
to datain...@googlegroups.com
Hi all,

I'm uploading a refresh of the open library data that includes two main changes: linkage to the new LCSH dataset and a first crude stab at FRBRisation.

To provide the subject linkage I downloaded the dump of RDF from http://id.loc.gov/authorities/ and parsed it to create a local database mapping skos:prefLabel to URI (using gdbm). Then I modified my json2rdf.py script to use this database and look up the subjects.

An example of an item where all the subjects matched:

<http://ol.dataincubator.org/works/6102>

This worked less well:

<http://ol.dataincubator.org/works/5991>

And this didn't match any:

<http://ol.dataincubator.org/works/62514>

I haven't done any in-depth analysis of whether these are subjects that should have matched or if they are simply not available in LCSH.

For the FRBRisation I combined two approaches. For each edtiion record I created a Work resource and one Manifestation for each ISBN listed in the record. I remembered that dublin core had isVersionOf/hasVersion properties which seemed to fit the semantic I wanted so I used them to link Works directly to Manifestations bypassing any Expression layer. The results of this approach were moderately successful as can be seen by this work which has 5 manifestations.

<http://ol.dataincubator.org/works/61705>

The main problem is that there is no distingushing information between the versions. To address this I grabbed a dump of ThingISBN (which is free for non-commercial use) and parsed it to create a lookup database of ISBN -> work id. I think used that to link Manifestations to Works, only creating a new work resource where there was no work id in ThingISBN.

The coverage of ThingISBN turns out to be pretty good. Here's a work with 3 Manifestations:

<http://ol.dataincubator.org/works/60798>

Since I'm only working with 1% of the total open library corpus I wanted to increase the number of records that were likely to be related. I grepped the entire corpus for hobbit and tolkien and ran those records through the converter.

Here's The Hobbit, which looks pretty good:

<http://ol.dataincubator.org/works/59650>

Here's another work that represents The Hobbit:

<http://ol.dataincubator.org/works/152602>

And the annotated Hobbit:

<http://ol.dataincubator.org/works/152611>

I also made some minor formatting changes for skos:prefLabel which is now composed of title_prefix, title and subtitle. I also removed the dct:isPartOf properties that I added in last time - that was my misunderstanfing of the void convention on datasets. I did a tiny bit of work on formatting series titles but it needs a whole lot more work - I think a lot of information has been lost by open library in their conversion from MARC.

Ian



Bruce D’Arcus

unread,
May 5, 2009, 1:13:01 PM5/5/09
to Data Incubator


On May 4, 4:05 pm, Ian Davis <m...@iandavis.com> wrote:

> I'm uploading a refresh of the open library data that includes two main
> changes: linkage to the new LCSH dataset and a first crude stab at
> FRBRisation.

Nice Ian. Just a couple of minor comments:

1) as w/periodicals, would be nice to have publishers as URIs if
possible; maybe http://publishers.dataincurbator.org?

2) the versions/manifestations; do you have the input data to be able
to assign them a bibo type as well?

Bruce

Ian Davis

unread,
May 5, 2009, 3:54:21 PM5/5/09
to datain...@googlegroups.com
On Tue, May 5, 2009 at 6:13 PM, Bruce D’Arcus <bda...@gmail.com> wrote:
On May 4, 4:05 pm, Ian Davis <m...@iandavis.com> wrote:

> I'm uploading a refresh of the open library data that includes two main
> changes: linkage to the new LCSH dataset and a first crude stab at
> FRBRisation.

Nice Ian. Just a couple of minor comments:

Thanks :)
 

1) as w/periodicals, would be nice to have publishers as URIs if
possible; maybe http://publishers.dataincurbator.org?

Yes, that's a good idea. Apart from the publisher data in open library and periodicals sets is there a good database we could target for conversion?
 

2) the versions/manifestations; do you have the input data to be able
to assign them a bibo type as well?


The records contain a format indicator which could give hints. There is a list of values I have come across so far at the end of <http://code.google.com/p/dataincubator/wiki/OpenLibrary> It's not clear to me how to map them to specific classes - I need help here.


 Ian

Ross Singer

unread,
May 5, 2009, 9:37:47 PM5/5/09
to datain...@googlegroups.com
On Tue, May 5, 2009 at 3:54 PM, Ian Davis <m...@iandavis.com> wrote:
> Yes, that's a good idea. Apart from the publisher data in open library and
> periodicals sets is there a good database we could target for conversion?
>

We could probably get a good set of journal publisher data from the
CUFTS KB [1] or the old jake data (although it's now quite out of
date). Not sure of the licensing.

I assume the OL is already loaded with the LC bib records in archive.org?

-Ross.
1. http://cufts2.lib.sfu.ca/knowledgebase/

Leigh Dodds

unread,
May 6, 2009, 6:10:48 AM5/6/09
to datain...@googlegroups.com
Hi Ross,

The CUFTS KB looks really useful, plenty of journals and titles in there.

Cheers,

L.

2009/5/6 Ross Singer <rossf...@gmail.com>


Please consider the environment before printing this email.

Find out more about Talis at www.talis.com

shared innovationTM

Any views or personal opinions expressed within this email may not be those of Talis Information Ltd or its employees. The content of this email message and any files that may be attached are confidential, and for the usage of the intended recipient only. If you are not the intended recipient, then please return this message to the sender and delete it. Any use of this e-mail by an unauthorised recipient is prohibited.

Talis Information Ltd is a member of the Talis Group of companies and is registered in England No 3638278 with its registered office at Knights Court, Solihull Parkway, Birmingham Business Park, B37 7YB.

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________



--
Leigh Dodds
Programme Manager, Talis Platform
Talis
leigh...@talis.com
http://www.talis.com

Tom Pasley

unread,
May 6, 2009, 7:40:19 PM5/6/09
to datain...@googlegroups.com
Ross,

Thanks for reminding me of CUFTS... your point about archive.org is well made.

Another place to watch, (even though the content is also on there too), is the Biodiversity Heritage Library, which is getting a helping hand from our friends at OCLC: http://twitter.com/chrisfreeland/statuses/1668746642

Biodiversity Heritage Library is a good source for scientific articles, etc. if that's your area... mostly historic stuff, but useful for entomologists, etc...

cheers,

Tom

Ian Davis

unread,
May 6, 2009, 8:16:20 PM5/6/09
to datain...@googlegroups.com
On Wed, May 6, 2009 at 2:37 AM, Ross Singer <rossf...@gmail.com> wrote:

I assume the OL is already loaded with the LC bib records in archive.org?

AFAIK it contains the LC data - it certainly looks like LC data in a lot of the records (to my untrained eye and through the JSON filter).

Reply all
Reply to author
Forward
0 new messages