Automating the production of LODLAM from a Collection Management System's OAI-PMH service

76 views
Skip to first unread message

Conal Tuohy

unread,
Apr 6, 2016, 2:35:34 AM4/6/16
to Linked Open Data in Libraries, Archives, & Museums
List members who are interested in producing LOD and who already have an OAI-PMH service (or some kind of similar Web API) may be interested in my latest blog post:

http://conaltuohy.com/blog/visualizing-government-archives-through-linked-data/

The post describes a gateway system which harvests XML records, converts them to RDF graphs using an XSLT stylesheet, and stores the graphs in a SPARQL graph store. The use of OAI-PMH allows for updates to records to propagate automatically into the RDF world. The use of XSLT to express a mapping between the source XML records and the desired RDF/XML allows for a simple declarative style of mapping (a crosswalk). Other OAI-PMH service providers could use the software to publish their own metadata of whatever schema, in whatever RDF ontology they wish, by modifying the XSLT.

I think there are a lot of collecting institutions with OAI-PMH provider software (or for whom setting up e.g. jOAI would not be a big step), and for whom writing (or commissioning) a bit of XSLT would also not be a big step. The idea of this software is to simplify the task of implementing LODLAM without requiring disruptive changes to institutions existing IT systems.

The software is open source and deployed as a Java web application (e.g. in Tomcat).

Conal


Asa.Let...@prov.vic.gov.au

unread,
Apr 6, 2016, 7:22:01 PM4/6/16
to lod...@googlegroups.com
If enough orgs get on board could this be the beginning of a   Federated Search? Thanks for the passion Conal!

Asa Letourneau
Online Engagement Officer
T:  03 9348 5759  F:  03 9348 5656
asa.let...@prov.vic.gov.au

Public Record Office Victoria
Victorian Archives Centre | 99 Shiel St North Melbourne VIC 3051
www.prov.vic.gov.au
 (Please Note that I work from home on Mondays with no remote access.)

PROV News | http://prov.vic.gov.au/publications/blog
Events | http://prov.vic.gov.au/whats-on
Battle to Farm | http://soldiersettlement.prov.vic.gov.au/

Our offices are located on the land of the Kulin Nations. We acknowledge and pay our respects to the Traditional Owners, past and present.

You must not copy, disclose, distribute, store or otherwise use this material without permission . Any personal information in this email must be handled in accordance with the Information Privacy Act 2000 (Vic) and applicable laws. If you are not the intended recipient, please notify the sender immediately and destroy all copies of this email and any attachments . The State does not accept liability in connection with computer viruses , data corruption, delay, interruption, unauthorised access or use.

--
You received this message because you are subscribed to the Google Groups "Linked Open Data in Libraries, Archives, & Museums" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
lod-lam+u...@googlegroups.com.
For more options, visit
https://groups.google.com/d/optout.

Ethan Gruber

unread,
Apr 7, 2016, 8:44:03 AM4/7/16
to lod...@googlegroups.com
I agree that this could be a jumping off point for wider aggregation. I wrote something similar for the Orbis Cascade Alliance to harvest archival photographs from OAI-PMH associated with finding aids in order to enhance their EAD publication framework (the code is at https://github.com/Orbis-Cascade-Alliance/harvester). The harvesting is essentially OAI-PMH -> DPLA Metadata Application Profile so that Orbis Cascade can easily make the move into being a DPLA Hub. Conal's harvesting mechanism might be adapted similarly--to harvest content and migrate into a model that can feed Europeana or DPLA. This shortcoming in DPLA right now is that content hubs are the sole providers of data. My institution (American Numismatic Society) produces good quality RDF, but there's no way to get our data into DPLA at the moment because there's no functioning hub for the region.

Adapting and deploying harvesters like Conal's would lower the barriers for the creation of these feeders into larger scale aggregations. The obstacle to these aggregations isn't necessarily a technical one, but a data quality one. In my experience, OAI-PMH contains, mostly, strings encapsulated in Dublin Core elements. It is difficult to derive good linked data when there are so few links in the source data.

Ethan

Eric Lease Morgan

unread,
Apr 7, 2016, 8:50:36 AM4/7/16
to lod...@googlegroups.com

On Apr 7, 2016, at 2:44 PM, Ethan Gruber <ewg4...@gmail.com> wrote:

>> http://conaltuohy.com/blog/visualizing-government-archives-through-linked-data/
>
> https://github.com/Orbis-Cascade-Alliance/harvester


There is/was a similar “OAI to LOD” tool available a while ago:

oai2lod (https://github.com/behas/oai2lod) - This is a particular
implementation D2RQ Server. More specifically, this tool is an
intermediary between a OAI-PMH data providers and a linked data
publishing system. Configure oai2lod to point to your OAI-PMH server and
it will publish the server's metadata as linked data.

When I gave it a go a couple of years ago it worked pretty well. I was impressed.


Eric Lease Morgan


Asa.Let...@prov.vic.gov.au

unread,
Apr 7, 2016, 8:15:12 PM4/7/16
to lod...@googlegroups.com
So good to hear about all these similar solutions. I think you make an excellent point about data quality Ethan. I might be totally wrong  Conal, but I think our only option for linking out to other sources from our Archival Control model metadata would be to manufacture those links to DBpedia  e.g. linking this Charles Hotham lithograph record in our collection



to

this DBpedia page

http://dbpedia.org/page/Charles_Hotham ?

or possibly to other archives like State Records New South Wales that historically have close affiliations

 or maybe even VIAF https://viaf.org/viaf/13174831/#Hotham,_Charles,_Sir,_1806-1855 ?


Cheers


Asa Letourneau
Online Engagement Officer
T:  03 9348 5759  F:  03 9348 5656
asa.let...@prov.vic.gov.au

Public Record Office Victoria
Victorian Archives Centre | 99 Shiel St North Melbourne VIC 3051
www.prov.vic.gov.au
 (Please Note that I work from home on Mondays with no remote access.)

PROV News | http://prov.vic.gov.au/publications/blog
Events | http://prov.vic.gov.au/whats-on
Battle to Farm | http://soldiersettlement.prov.vic.gov.au/

Our offices are located on the land of the Kulin Nations. We acknowledge and pay our respects to the Traditional Owners, past and present.

You must not copy, disclose, distribute, store or otherwise use this material without permission . Any personal information in this email must be handled in accordance with the Information Privacy Act 2000 (Vic) and applicable laws. If you are not the intended recipient, please notify the sender immediately and destroy all copies of this email and any attachments . The State does not accept liability in connection with computer viruses , data corruption, delay, interruption, unauthorised access or use.



Conal Tuohy

unread,
Apr 7, 2016, 9:12:45 PM4/7/16
to Linked Open Data in Libraries, Archives, & Museums
There have been a few tools produced to bridge between OAI-PMH and LOD - it's not an original idea of mine by any means. This is not even the first such bridge that I'VE written, and I'm aware of others including one from as far back as 2006 by Stefano Mazzocchi at MIT: <http://simile.mit.edu/repository/RDFizers/oai2rdf/> (which uses XSLT1 as the mapping language; the later tools we're discussing all use XSLT2 which is a big step forward).

On a technical note; my work was motivated by problems I'd had earlier with  tools that didn't scale to large OAI-PMH datasets. n particular, I was working with the National Library of Australia's EAC-CPF dataset which has hundreds of thousands of records. Even the Public Record Office Victoria's dataset is quite large: there are only 32k records but they are individually large; an average of  recor). So my latest effort is designed to build something to scale up to arbitrary size datasets.

Performing an OAI-PMH harvest is theoretically a recursive process of paging through a potentially very large linked list of responses (each response pointing to the next page). It's tempting to use simply use recursion in your programming language to handle it, but your language would need to support optimizing tail recursion; if you want to scale up arbitrarily you can't afford to have each OAI-PMH request consume additional memory, or you will eventually crash. If you implement the recursion in XSLT, in particular, you will be limited to harvesting no more data than can fit in your JVM's memory; I believe Ethan's code has this limitation.

The oai2lod system has a different memory limitation I believe, possibly due to building up a single RDF graph in memory?

I haven't done a lot of testing with my own product (yet), but it has no "call stack", and it stores each RDF graph by performing an HTTP PUT to a SPARQL graph store, so it keeps nothing in memory; profiling it shows that its memory consumption doesn't continue to grow as it runs.

Both Ethan's and mine convert each OAI-PMH record into a distinct graph, and use the SPARQL 1.1 Graph Store protocol (which dates from 2013) to store the graphs, which I think is a great simplification and a step forward architecturally. The use of a RESTful network protocol to manage the graphs (rather than, say, a Java API) provides another extension point for building the system (e.g. a web proxy could distribute graphs to multiple, redundant stores, or could add provenance metadata or trigger reasoning or updates to external indexes, etc.)

Conal Tuohy

unread,
Apr 7, 2016, 9:29:08 PM4/7/16
to Linked Open Data in Libraries, Archives, & Museums
I think there are a few ways to deal with the general issue which Ethan brought up (that legacy metadata uses strings to identify things, whereas in LOD you need URIs).

I think what Asa you may be suggesting that we can automate the creation of links like the one below? If so, I totally agree. I did an experimental LOD service based on the Collections API of Museum Victoria, late last year, and part of that was taking the taxonomic name of each specimen and querying DBpedia for it; then it could grab the DBpedia identifier for that species.

I wrote it up in a blog post at the time: http://conaltuohy.com/blog/museum-names/

Another technique (also mentioned in the blog post) is just to "mint" a new URI from the string: e.g. if you have the strings "kākā", "toucan", "butcher bird", you can mint URIs like "http://example.com/animal/k%C4%81k%C4%81", "http://example.com/animal/toucan", "http://example.com/butcher%20bird" etc. Once you aggregate a bunch of graphs which refer to these URIs, and publish those URIs in a LOD environment, they will function as access points to all the resources which referred to them. Then you can independently assign other data to those URIs; linking them together with broader/narrower relationships, synonyms, etc. I wrote a blog post about that, too, actually:

http://conaltuohy.com/blog/taking-control-of-an-uncontrolled-vocabulary/
To unsubscribe from this group and stop receiving emails from it, send an email to lod-lam+unsubscribe@googlegroups.com.
For more options, visit
https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Linked Open Data in Libraries, Archives, & Museums" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lod-lam+unsubscribe@googlegroups.com.
For more options, visit
https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Linked Open Data in Libraries, Archives, & Museums" group.

To unsubscribe from this group and stop receiving emails from it, send an email to lod-lam+unsubscribe@googlegroups.com.
For more options, visit
https://groups.google.com/d/optout.

Stuart A. Yeates

unread,
Apr 7, 2016, 9:32:52 PM4/7/16
to lod...@googlegroups.com
It's worth noting that some OAI implementations do a form of MARC and MARC has the huge attraction of doing authority control.

Having said that I'm not aware of any implementations that output authority control in OAI. OJS does ORCID and does oai_marc (see for example http://ojs.lib.ucl.ac.uk/index.php/LaJ/oai?verb=ListRecords&metadataPrefix=oai_marc ) but last time I looked doesn't output the $0 subfield (see https://www.loc.gov/marc/authority/ecadcntf.html )

cheers
stuart 

--
...let us be heard from red core to black sky

Conal Tuohy

unread,
Apr 8, 2016, 12:42:23 AM4/8/16
to Linked Open Data in Libraries, Archives and Museums

On 8 April 2016 at 11:12, Conal Tuohy <conal...@gmail.com> wrote:
Even the Public Record Office Victoria's dataset is quite large: there are only 32k records but they are individually large; an average of  recor).

Whoops, I was distracted in the middle of that post (by loudly singing birds) and lost my place. What I'd meant to say was that there were 18000 RIF-CS format records totalling 86Mb; almost 5k of XML per record; quite large compared with typical oai_dc records.


--

Eric Lease Morgan

unread,
Apr 8, 2016, 2:37:17 AM4/8/16
to lod...@googlegroups.com

On Apr 8, 2016, at 3:12 AM, Conal Tuohy <conal...@gmail.com> wrote:

> ...On a technical note; my work was motivated by problems I'd had earlier with tools that didn't scale to large OAI-PMH datasets. n particular, I was working with the National Library of Australia's EAC-CPF dataset which has hundreds of thousands of records. Even the Public Record Office Victoria's dataset is quite large: there are only 32k records but they are individually large; an average of recor). So my latest effort is designed to build something to scale up to arbitrary size datasets…

Interesting conversation! And to echo the thread, yes, when it comes to RDF, scale is an issue. It is so much of an issue, I wonder about the short-term feasibility of RDF/linked data. After a while I suppose someone will be able to throw enough hardware at the problem to allow them to store, index, search, display, and analyze the billions and billions and billions and billions… of triples representing the “facts” of the Semantic Web. But right now, I don’t see that happening. Instead, maybe what somebody ought to do is define a relatively small closed world — say all things about a genre of fine art called “Fauvism”, or the writings of Ralph Waldo Emerson, or all things malaria, or the intellectual capital of a university — and then create a triple store to facilitate discovery and analysis. Hmmm… —ELM


Conal Tuohy

unread,
Apr 8, 2016, 2:48:41 AM4/8/16
to Linked Open Data in Libraries, Archives and Museums
Sure loading the entire web of data into a single SPARQL store would be a big ask, but the kind of datasets I'm thinking about are a lot smaller. e.g. for the Victorian archives we're talking less than a milllion triples, and most museums and similar institutions would be in the same ballpark I'm sure. Even for national aggregators like Australia's Trove and New Zealand's Digital NZ, we'd be talking maybe a few hundred million to a billion triples? That is quite doable with fairly cheap hardware these days.

https://www.w3.org/wiki/LargeTripleStores
 

--
You received this message because you are subscribed to the Google Groups "Linked Open Data in Libraries, Archives, & Museums" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lod-lam+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Reply all
Reply to author
Forward
0 new messages