I'm uploading a refresh of the open library data that includes two main changes: linkage to the new LCSH dataset and a first crude stab at FRBRisation.
To provide the subject linkage I downloaded the dump of RDF from http://id.loc.gov/authorities/ and parsed it to create a local database mapping skos:prefLabel to URI (using gdbm). Then I modified my json2rdf.py script to use this database and look up the subjects.
An example of an item where all the subjects matched:
I haven't done any in-depth analysis of whether these are subjects that should have matched or if they are simply not available in LCSH.
For the FRBRisation I combined two approaches. For each edtiion record I created a Work resource and one Manifestation for each ISBN listed in the record. I remembered that dublin core had isVersionOf/hasVersion properties which seemed to fit the semantic I wanted so I used them to link Works directly to Manifestations bypassing any Expression layer. The results of this approach were moderately successful as can be seen by this work which has 5 manifestations.
The main problem is that there is no distingushing information between the versions. To address this I grabbed a dump of ThingISBN (which is free for non-commercial use) and parsed it to create a lookup database of ISBN -> work id. I think used that to link Manifestations to Works, only creating a new work resource where there was no work id in ThingISBN.
The coverage of ThingISBN turns out to be pretty good. Here's a work with 3 Manifestations:
Since I'm only working with 1% of the total open library corpus I wanted to increase the number of records that were likely to be related. I grepped the entire corpus for hobbit and tolkien and ran those records through the converter.
I also made some minor formatting changes for skos:prefLabel which is now composed of title_prefix, title and subtitle. I also removed the dct:isPartOf properties that I added in last time - that was my misunderstanfing of the void convention on datasets. I did a tiny bit of work on formatting series titles but it needs a whole lot more work - I think a lot of information has been lost by open library in their conversion from MARC.
On May 4, 4:05 pm, Ian Davis <m...@iandavis.com> wrote:
> I'm uploading a refresh of the open library data that includes two main
> changes: linkage to the new LCSH dataset and a first crude stab at
> FRBRisation.
On Tue, May 5, 2009 at 6:13 PM, Bruce D’Arcus <bdar...@gmail.com> wrote: > On May 4, 4:05 pm, Ian Davis <m...@iandavis.com> wrote:
> > I'm uploading a refresh of the open library data that includes two main > > changes: linkage to the new LCSH dataset and a first crude stab at > > FRBRisation.
Yes, that's a good idea. Apart from the publisher data in open library and periodicals sets is there a good database we could target for conversion?
> 2) the versions/manifestations; do you have the input data to be able > to assign them a bibo type as well?
The records contain a format indicator which could give hints. There is a list of values I have come across so far at the end of < http://code.google.com/p/dataincubator/wiki/OpenLibrary> It's not clear to me how to map them to specific classes - I need help here.
On Tue, May 5, 2009 at 3:54 PM, Ian Davis <m...@iandavis.com> wrote: > Yes, that's a good idea. Apart from the publisher data in open library and > periodicals sets is there a good database we could target for conversion?
We could probably get a good set of journal publisher data from the CUFTS KB [1] or the old jake data (although it's now quite out of date). Not sure of the licensing.
I assume the OL is already loaded with the LC bib records in archive.org?
> On Tue, May 5, 2009 at 3:54 PM, Ian Davis <m...@iandavis.com> wrote:
> > Yes, that's a good idea. Apart from the publisher data in open library
> and
> > periodicals sets is there a good database we could target for conversion?
> We could probably get a good set of journal publisher data from the
> CUFTS KB [1] or the old jake data (although it's now quite out of
> date). Not sure of the licensing.
> I assume the OL is already loaded with the LC bib records in archive.org?
> Any views or personal opinions expressed within this email may not be those
> of Talis Information Ltd or its employees. The content of this email message
> and any files that may be attached are confidential, and for the usage of
> the intended recipient only. If you are not the intended recipient, then
> please return this message to the sender and delete it. Any use of this
> e-mail by an unauthorised recipient is prohibited.
> Talis Information Ltd is a member of the Talis Group of companies and is
> registered in England No 3638278 with its registered office at Knights
> Court, Solihull Parkway, Birmingham Business Park, B37 7YB.
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email > ______________________________________________________________________
Thanks for reminding me of CUFTS... your point about archive.org is well
made.
Another place to watch, (even though the content is also on there too), is
the Biodiversity Heritage Library, which is getting a helping hand from our
friends at OCLC: http://twitter.com/chrisfreeland/statuses/1668746642
Biodiversity Heritage Library is a good source for scientific articles, etc.
if that's your area... mostly historic stuff, but useful for entomologists,
etc...
On Wed, May 6, 2009 at 1:37 PM, Ross Singer <rossfsin...@gmail.com> wrote:
> On Tue, May 5, 2009 at 3:54 PM, Ian Davis <m...@iandavis.com> wrote:
> > Yes, that's a good idea. Apart from the publisher data in open library
> and
> > periodicals sets is there a good database we could target for conversion?
> We could probably get a good set of journal publisher data from the
> CUFTS KB [1] or the old jake data (although it's now quite out of
> date). Not sure of the licensing.
> I assume the OL is already loaded with the LC bib records in archive.org?