Harvesting client and server

157 views
Skip to first unread message

Eunice Soh

unread,
Nov 5, 2021, 3:49:17 AM11/5/21
to Dataverse Users Community
Hi!

My institute is trying to harvest from another Dataverse instance at https://researchdata.nie.edu.sg/oai but was not successful. 

Error message: "Invalid URL. Failed to establish connection and receive a valid server response."

2021-11-05_15-40-29.png


For testing purposes, we've harvested from CIMMYT (v4.20) and Harvard demo (v5.7) successfully. 

2021-11-05_15-38-45.png


Any idea what might be the issue?

Tried harvesting using three instances (2 servers and one localhost instance) as harvesting client. Here are the replicated error messages on the harvesting client side:

 on localhost (running v5.7)

[2021-11-05T13:13:01.523+0800] [Payara 5.2021.7] [INFO] [] [edu.harvard.iq.dataverse.HarvestingClientsPage] [tid: _ThreadID=86 _ThreadName=http-thread-pool::http-listener-1(2)] [timeMillis: 1636089181523] [levelValue: 800] [[metadataformats: failed;Failed to execute listmetadataformats; No valid response received from the OAI server.]]

on DEV (running v5.5)

 [2021-11-05T13:21:11.891+0800] [Payara 5.2021.4] [INFO] [] [edu.harvard.iq.dataverse.HarvestingClientsPage] [tid: _ThreadID=95 _ThreadName=http-thread-pool::jk-conne  metadataformats: failed;Failed to execute listmetadataformats; No valid response received from the OAI server.]]

on PROD (running v5.5)

[2021-11-05T13:06:39.673+0800] [Payara 5.2021.4] [INFO] [] [edu.harvard.iq.dataverse.HarvestingClientsPage] [tid: _ThreadID=252 _ThreadName=http-thread-pool::jk-conn metadataformats: failed;Failed to execute listmetadataformats; No valid response received from the OAI server.]]

 

The harvesting server has enabled OAI server and run the dataset:

nie.png


Ancillary info

Trying with an R script as well:

library("OAIHarvester")
baseurl <- "https://researchdata.nie.edu.sg/oai"
x <- oaih_identify(baseurl)


Error message

Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  :
Opening and ending tag mismatch: meta line 1 and head [76]

 

 

Question

  • Any ideas on what could be the problem, whether is it on the harvesting server and client side?
  • Any solutions on the matter?


A related post on Google Groups here: https://groups.google.com/g/dataverse-community/c/bDV5zLysg5E/m/JXCLYgBBAAAJ



Kind regards,
Eunice


Don Sizemore

unread,
Nov 5, 2021, 7:41:03 AM11/5/21
to dataverse...@googlegroups.com
Hello,

One WARN I receive from https://researchdata.nie.edu.sg which I don't receive from other test servers:

"WARNING|Payara 5.2021.6|org.apache.http.client.protocol.ResponseProcessCookies|_ThreadID=93;_ThreadName=http-thread-pool::jk-connector(3);_TimeMillis=1636112208432;_LevelValue=900;|
  Invalid cookie header: "Set-Cookie: visid_incap_2569101=flfJEcWLQiOjvVUNmfuzZE8XhWEAAAAAQUIPAAAAAABwKY8sLfk33DA/z0A1MZdu; expires=Fri, 04 Nov 2022 14:53:05 GMT; HttpOnly; path=/; Domain=.nie.edu.sg". Invalid 'expires' attribute: Fri, 04 Nov 2022 14:53:05 GMT|"

If I attempt to visit the page in a web browser, I'm presented with a CAPTCHA.

Don

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/57728311-1e25-42c0-ac2e-bd50142d3e93n%40googlegroups.com.

Philip Durbin

unread,
Nov 5, 2021, 11:21:43 AM11/5/21
to dataverse...@googlegroups.com
Locally I add some extra debugging in server.log:

[2021-11-05T10:57:47.097-0400] [Payara 5.2021.5] [INFO] [] [edu.harvard.iq.dataverse.HarvestingClientsPage] [tid: _ThreadID=118 _ThreadName=http-thread-pool::http-listener-1(2)] [timeMillis: 1636124267097] [levelValue: 800] [[
  metadataformats: failed;Failed to execute listmetadataformats; No valid response received from the OAI server: com.lyncode.xml.exceptions.XmlReaderException: com.ctc.wstx.exc.WstxParsingException: Unexpected close tag </html>; expected </META>.
 at [row,col {unknown-source}]: [7,13]]]

Then I realized that if I go to https://researchdata.nie.edu.sg/oai?verb=ListMetadataFormats in my browser (that's the URL Dataverse is trying to go to, it seems) it looks fine but if I use curl I get HTML instead of XML. The HTML looks something like this:

<html style="height:100%">
    <head>
        <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
        <meta name="format-detection" content="telephone=no">
        <meta name="viewport" content="initial-scale=1.0">
        <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
    </head>
    <body style="margin:0px;height:100%">
        <iframe id="main-iframe" src="/_Incapsula_Resource?SWUDNSAI=31&xinfo=5-125272236-0%200NNN%20RT%281636125075220%200%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c315%2c0%29%20U18&incident_id=621000620230672156-539812140004740997&edet=12&cinfo=04000000&rpinfo=0&cts=4Z8w84cu8cyvHVnEd9s50dYc1bJa2j6Skm6WAtbUWzFIXJPSxyc6JoHw4uqxNkTt&mth=GET" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 621000620230672156-539812140004740997</iframe>
    </body>
</html>

Incapsula seems to be some kind of firewall. Maybe you can talk to the people at your institution who manage this. Maybe it's blocking requests from my curl command (but not my browser) and your harvesting client.

I hope this helps,

Phil

p.s. Here's the extra debugging I added (locally):

$ git diff src/main/java/edu/harvard/iq/dataverse/harvest/client/oai/OaiHandler.java
diff --git a/src/main/java/edu/harvard/iq/dataverse/harvest/client/oai/OaiHandler.java b/src/main/java/edu/harvard/iq/dataverse/harvest/client/oai/OaiHandler.java
index d1aaea5079..55fa4cdc24 100644
--- a/src/main/java/edu/harvard/iq/dataverse/harvest/client/oai/OaiHandler.java
+++ b/src/main/java/edu/harvard/iq/dataverse/harvest/client/oai/OaiHandler.java
@@ -200,7 +200,7 @@ public class OaiHandler implements Serializable {
         try {
             mfIter = sp.listMetadataFormats();
         } catch (InvalidOAIResponse ior) {
-            throw new OaiHandlerException("No valid response received from the OAI server.");
+            throw new OaiHandlerException("No valid response received from the OAI server: " + ior.getLocalizedMessage());
         }
         
         List<String> formats = new ArrayList<>();




--

Eunice Soh

unread,
Nov 7, 2021, 8:30:01 PM11/7/21
to Dataverse Users Community
Dear all,

Thank you so much for your helpful inputs!

Kind regards,
Eunice

Philip Durbin

unread,
Jul 10, 2024, 2:52:48 PM (12 days ago) Jul 10
to dataverse...@googlegroups.com
For anyone who is here because they saw "Invalid URL. Failed to establish connection and receive a valid server response" when creating a harvesting client, it's also possible to see this error when there is a problem with the server's cert.

The error in server.log looks like this (if you add extra debugging like I did above):

"failed;Failed to execute listmetadataformats; No valid response received from the OAI server: io.gdcc.xoai.serviceprovider.exceptions.InvalidOAIResponse: io.gdcc.xoai.serviceprovider.exceptions.OAIRequestException: javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target"



Reply all
Reply to author
Forward
0 new messages