[Dspace-tech] OAI harvest not processing dc.identifier.uri correctly

244 views
Skip to first unread message

Ben Ryan

unread,
Aug 26, 2015, 9:33:18 AM8/26/15
to dspac...@lists.sourceforge.net
Hi,
    I have attempted to harvest from an OAI feed and having some problems processing the dc.identifier.uri field.
    An example record from the feed is:
 
<record>
<header>
<identifier>oai:generic.eprints.org:9</identifier>
<identifier>http://humbox.ac.uk/9/</identifier>
<datestamp>2012-06-11T18:48:56Z</datestamp>
<setSpec>74797065733D7265736F75726365</setSpec></header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<dc:title>Using EEBO to compare the quarto and Folio editions of Shakespeare's Henry V</dc:title>
<dc:identifier.uri>http://humbox.ac.uk/id/eprint/9</dc:identifier.uri>
<dc:creator>University, Matthew Steggle, Sheffield Hallam</dc:creator>
<dc:description>As EEBO has images of every book printed in England before 1700, it offers students studying Shakespeare the opportunity to look at both the quarto and Folio editions of his plays. By using EEBO to look at different editions of the same play we can start to think about the decisions made by editors when confronted with this dilemma of choice. Which version is best? We can also think about why these differences occur.</dc:description>
<dc:date>2005</dc:date>
<dc:type>Resource</dc:type>
<dc:type>NonPeerReviewed</dc:type>
<dc:format>application/msword</dc:format>
<dc:identifier>http://humbox.ac.uk/9/2/EEBO_Quarto___Folio_of_Henry_V.doc</dc:identifier>
<dc:identifier>Using EEBO to compare the quarto and Folio editions of Shakespeare's Henry V</dc:identifier>
<dc:relation>http://humbox.ac.uk/9/</dc:relation>
<dc:rights>Creative Commons Attribution Non-commercial Share Alike &lt;http://creativecommons.org/licenses/by-nc-sa/2.5/&gt;</dc:rights></oai_dc:dc></metadata></record>
 
The dc.identifier.uri field appears in the record.
 
When I view the item in the full view it shows the field as dc.identifier.uri    http://humbox.ac.uk/id/eprint/9
However when I view the METS metadat (using http://localhost:8080/xmlui/metadata/handle/123456789/4216/mets.xml) it shows the field as
<dim:field element="identifier.uri" mdschema="dc">
</dim:field>
 
In the database the metadat field is recorded in the metadatavalue table with a metadata_field_id of 72 and the entry in the metadatafieldregistry table shows the element name as identifier.uri as the field is unknown and I currently have harvester.unknownfield set to add.
 
Can anybody point me to where I look to see why DSpace is not recognising the field (is it because of pattern matching for handles?
 
Regards,
    Ben

Tim Donohue

unread,
Aug 26, 2015, 9:33:29 AM8/26/15
to Ben Ryan, dspac...@lists.sourceforge.net
Hi Ben,

It sounds like you are trying to run an OAI-PMH Harvest of another site
(in this case it looks like an EPrints site) from the XMLUI interface.

It looks like the main issue here is that the external site is giving
you *invalid* "oai_dc" metadata. As the OAI-PMH protocol states,
"oai_dc" is suppose to just be metadata of the format "dc.[element]":
http://www.openarchives.org/OAI/openarchivesprotocol.html

However, in this sitution, there's a "dc.identifier.uri" field which is
Qualified Dublin Core (QDC) and not a valid oai_dc metadata field.

This field is misunderstood by the DSpace OAI-PMH harvester, as the
harvest expects all fields to be valid oai_dc metadata.

So, unfortunately, the main issue here is that the external site you are
harvesting is returning invalid metadata.

The only way I can think of to "hack" a fix on the DSpace side of things
would be to modify the crosswalk that DSpace is using to transform the
"oai_dc" metadata into it's internal Qualified Dublin Core schema. The
crosswalk DSpace uses to perform this task is:
org.dspace.content.crosswalk.OAIDCIngestionCrosswalk

It is configured by default in your dspace.cfg as the crosswalk to use
whenever DSpace encounters "dc:" namespaced fields (which are what you
see in your "oai_dc" metadata output below). That configuration is in
this area of your dspace.cfg:
https://github.com/DSpace/DSpace/blob/master/dspace/config/dspace.cfg#L484

Here's a few options I can think of:

* You could create a *custom* crosswalk based on the
"OAIDCIngestionCrosswalk" that properly parse out this
"dc.identifier.uri" field and map it to the same field in DSpace. You'd
want to configure this modified crosswalk as being the one used for "dc"
metadata (see link above).

* OR, it *might* be possible to just configure DSpace's QDCCrosswalk
(which can crosswalk Qualified Dublin Core) as the "dc:" metadata
crosswalk. You'd probably only want to do this temporarily & you'd want
to test this on a Test/Development Server (as I've *never* tried this
and am not sure what would happen, so it may error out). To do that,
you'd change the "dc" crosswalk config to point at the QDCCrosswalk
class, e.g.

plugin.named.org.dspace.content.crosswalk.IngestionCrosswalk = \
...
org.dspace.content.crosswalk.QDCCrosswalk = dc, \
...

Good luck!

- Tim
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>
>
>
> _______________________________________________
> DSpace-tech mailing list
> DSpac...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>


Tim Donohue

unread,
Aug 26, 2015, 9:33:31 AM8/26/15
to Ben Ryan, dspac...@lists.sourceforge.net
Ben,

One last thing to add. It's also possible that the external site may
provide other metadata formats that you can harvest via OAI-PMH.

So, even though their "oai_dc" format may be invalid, there may be
another format you can use.

To determine if other metadata formats are available for harvesting, you
can query the OAI-PMH interface using the "ListMetadataFormats" command
described at:
http://www.openarchives.org/OAI/openarchivesprotocol.html#ListMetadataFormats

For example, here's a list of all the metadata formats that are
available for harvesting from our demo.dspace.org server:
http://demo.dspace.org/oai/request?verb=ListMetadataFormats

- Tim

Benjamin Ryan

unread,
Aug 26, 2015, 9:33:38 AM8/26/15
to Tim Donohue, Ben Ryan, dspac...@lists.sourceforge.net
Tim,
Thanks for the info.
This has me a little confused as I was sure I picked up this fix from a reply in the mailing list. I am right in thinking that DSpace needs the dc.identifier.uri field so it can display a link to the resource that has been harvested instead of minting a new uri which will be a link to the external handle system and then be resolved back to the DSpace resource by the handle server.
If it the case that harvesting just oai_dc cannot support the display of the link back to the harvested resource has is this normally achieved?

Regards,
Ben

------------------------------------------------------------------
Dr Ben Ryan
Jorum Technical Coordinator (Services)

5.12 Roscoe Building
The University of Manchester
Oxford Road
Manchester
M13 9PL
Tel: 0160 275 6039
E-mail: benjam...@manchester.ac.uk
------------------------------------------------------------------

Tim Donohue

unread,
Aug 26, 2015, 9:33:53 AM8/26/15
to Benjamin Ryan, dspac...@lists.sourceforge.net, Ben Ryan
Hi Ben,

When harvesting items via OAI-PMH with "oai_dc", DSpace harvests all
external identifiers as "dc.identifier".

However, just before it saves the item, it will *search* through all
'dc.identifier' fields to attempt to find anything that looks to be a
Handle (it uses the 'harvester.acceptedHandleServer' config in oai.cfg
file). If it finds something that looks like a Handle, then it assigns
that handle to the harvested item.

This logic occurs in the OAIHarvester class.
https://github.com/DSpace/DSpace/blob/master/dspace-api/src/main/java/org/dspace/harvest/OAIHarvester.java#L528

That area of the code usess the "extractHandle()" method which is what
searches all 'dc.identifier' fields for a string that looks like the
Handle:
https://github.com/DSpace/DSpace/blob/master/dspace-api/src/main/java/org/dspace/harvest/OAIHarvester.java#L605

Hopefully that better explains what is going on in the OAI Harvester.

- Tim

Benjamin Ryan

unread,
Aug 26, 2015, 9:35:12 AM8/26/15
to Tim Donohue, dspac...@lists.sourceforge.net
Tim,
Thanks for the info, I was looking at the code as it happens.
I will get the OAI feed altered to remove the uri qualifier and probably have to add some more code to the class as the OAI will have a number of identifiers (sometimes handles, sometimes who knows what) and add some checks to see if the identifier supplied is a valid URL that can be used to link the harvested content.

This is a result of the OAI-PMH spec being loose in this area (not surprisingly given the broad coverage of the intent of OAI-PMH) not for DSpace to DSpace harvesting but when harvesting into DSpace from other platforms.

Regards,
Ben

------------------------------------------------------------------
Dr Ben Ryan
Jorum Technical Coordinator (Services)

5.13 Roscoe Building
Reply all
Reply to author
Forward
0 new messages