Harvesting Elsevier Pure repository

107 views
Skip to first unread message

Patrick Vranckx

unread,
Sep 7, 2021, 3:08:25 AM9/7/21
to Dataverse Users Community
Hi,

My testbed runs Dataverse v5.5. I try to harvest an Elsevier Pure repo and get a java error. The dataset seems correct to me.

For example:

<?xml version="1.0" encoding="UTF-8"?>

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
  <responseDate>2021-09-07T09:00:04Z</responseDate>
  <request metadataPrefix="oai_dc" set="openaire_cris_products" verb="ListRecords">https://pure.unamur.be/ws/oai</request>
  <ListRecords>
    <record>
      <header>
        <identifier>oai:pure.unamur.be:openaire_cris_products/c6de93dd-4a18-45f1-bf41-b92838f945ee</identifier>
        <datestamp>2018-06-11T08:52:47Z</datestamp>
        <setSpec>openaire_cris_products</setSpec>
      </header>
      <metadata>
        <oai_dc:dc xmlns="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:ns2="http://purl.org/dc/terms/">
          <title xml:lang="fra">L'écrin aux trésors</title>
          <creator>Biart, Guy</creator>
          <creator>Wattecamps, Paul</creator>
          <creator>Delvaux, Jean</creator>
          <creator>Lepièce, Didier</creator>
          <creator>Duchateau, Edouard</creator>
          <publisher>FUNDP. Centre interfacultaire des médias de l'éducation</publisher>
          <date>2000</date>
          <type>other</type>
          <identifier>https://researchportal.unamur.be/en/publications/lecrin-aux-tresors(c6de93dd-4a18-45f1-bf41-b92838f945ee).html</identifier>
          <source>Biart , G , Wattecamps , P , Delvaux , J , Lepièce , D &amp; Duchateau , E , L'écrin aux trésors , 2000 , Produits numériques ou (audio)visuels , FUNDP. Centre interfacultaire des médias de l'éducation , Namur .</source>
          <language>fra</language>
          <rights>info:eu-repo/semantics/restrictedAccess</rights>
        </oai_dc:dc>
      </metadata>
    </record>
    ......
    ......
  </ListRecords>
</OAI-PMH>

and the harvest log says:

<record>
  <date>2021-09-07T06:10:56.869795Z</date>
  <millis>1630995056869</millis>
  <nanos>795000</nanos>
  <sequence>18444</sequence>
  <logger>edu.harvard.iq.dataverse.harvest.client.HarvesterServiceBean.unamur2021-09-07T08-10-55</logger>
  <level>SEVERE</level>
  <class>edu.harvard.iq.dataverse.harvest.client.HarvesterServiceBean</class>
  <method>logGetRecordException</method>
  <thread>227</thread>
  <message>Exception processing getRecord(), oaiUrl=https://pure.unamur.be/ws/oai, identifier=oai:pure.unamur.be:openaire_cris_products/c6de93dd-4a18-45f1-bf41-b92838f945ee, javax.ejb.EJBTransactionRolledbackException, Exception thrown from bean: java.lang.NullPointerException</message>
</record>

and the server.log:

[2021-09-07T08:13:48.650+0200] [Payara 5.2020.6] [WARNING] [AS-EJB-00056] [javax.enterprise.ejb.container] [tid: _ThreadID=227 _ThreadName=__ejb-thread-pool10] [timeMillis: 1630995228650] [levelValue: 900] [[
  A system exception occurred during an invocation on EJB ImportServiceBean, method: public edu.harvard.iq.dataverse.Dataset edu.harvard.iq.dataverse.api.imports.ImportServiceBean.doImportHarvestedDataset(edu.harvard.iq.dataverse.engine.command.DataverseRequest,edu.harvard.iq.dataverse.harvest.client.HarvestingClient,java.lang.String,java.lang.String,java.io.File,java.util.Date,java.io.PrintWriter) throws edu.harvard.iq.dataverse.api.imports.ImportException,java.io.IOException]]

[2021-09-07T08:13:48.650+0200] [Payara 5.2020.6] [WARNING] [] [javax.enterprise.ejb.container] [tid: _ThreadID=227 _ThreadName=__ejb-thread-pool10] [timeMillis: 1630995228650] [levelValue: 900] [[
 
javax.ejb.EJBTransactionRolledbackException: Exception thrown from bean: java.lang.NullPointerException
        at com.sun.ejb.containers.BaseContainer.mapLocal3xException(BaseContainer.java:2368)
......
......


How to debug the issue ?

Regards,

Patrick

Julian Gautier

unread,
Sep 7, 2021, 4:24:42 PM9/7/21
to Dataverse Users Community
Hi Patrick,

I'm not sure what those logs are saying - I remember hearing a few times from developer colleagues about how "NullPointerException" is very unhelpful and I see that a couple of times in what you shared. There's a GitHub issue at https://github.com/IQSS/dataverse/issues/7546 where someone wrote about how they found more info after seeing "NullPointerException", although the actual cause of the errors might be different from your harvesting attempt:

The problem or one problem later on might be that the Dataverse software can't verify that the identifier in the record is a DOI or Handle (which in this case it appears not to be). Last year while improving the harvesting functionality a bit, the community discussed how the Dataverse software's harvesting code requires a DOI or Handle and the restriction was kept in place because removing it would have been more work than was required for harvesting from a particular repository whose records did have DOIs (the Dataverse software just couldn't recognize them).

Patrick, is it possible to include a DOI or Handle in the records? Is that something we can encourage the folks managing the Elsevier Pure repo to do?

My impressions from last year's discussions are that we can't say for sure that this is a restriction that the Dataverse community is or is not willing to change and why. In the discussion I speculated a few reasons for the restriction but I haven't and I don't think anyone has done the work yet to argue for or against the restriction. I'm also not sure if this was simply a restriction inherited from some other software or piece of code.

If the restriction does wind up being a problem for the harvesting you're trying to do, could you create a GitHub issue about revisiting this restriction? A shorter term thing I'd suggest we do is add this restriction to the Managing Harvesting Clients page of the Admin Guides, maybe under the "What if a Run Fails?" section.

Regards,
Julian

Julian Gautier
Product Research Specialist, IQSS

James Myers

unread,
Sep 7, 2021, 5:39:53 PM9/7/21
to dataverse...@googlegroups.com

FWIW: re: NullPointerExceptions in stack traces: The part of the log where you put … in the email should have more info about why there’s a null pointer.

 

In general, that stack trace has further sections that say ‘caused by’ and more lines. The relevant lines are then ones that show an error in some edu.harvard.iq.dataverse.* classes. If you look at those lines, or post them, you/we can often see exactly which line in the code where the trouble starts. In this case, I wouldn’t be surprised if that fits what Julian is saying and the root cause is Dataverse not finding the DOI/Handle (getting a null pointer) and that causing the overall failure as the code that requires that value fails in turn. (It would be good to check though – it’s possible that there’s some other missing/differently-formatted info that doesn’t match Dataverse’s expectations.).

 

Often people are reluctant to put a long stack trace in email or an issue because it makes it less readable (probably true) , but the right few lines are really useful in helping figure out what’s wrong. If you can find the last ‘caused by’ section and get the lines about Dataverse’s classes, that may be short enough to cut/paste. If not, one good practice would be to add an issue and then use github’s file upload capability to add the stack trace in.

 

  -- Jim

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/7bfd3e58-cf38-42f0-8322-19aa7b75d230n%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages