Failing to load Zenodo set list when configuring harvesting client

132 views
Skip to first unread message

Thomas Jouneau

unread,
Dec 1, 2021, 9:02:17 AM12/1/21
to Dataverse Users Community
Hi

I know the Zenodo harvesting is not perfect. However, I once was able to
at least partially harvest some communities, and to properly display the
set list during the client configuration.

The configuration page shows no sets (it remains empty). The server log
contains this :

[2021-12-01T14:52:27.731+0100] [Payara 5.2020] [WARNING] []
[edu.harvard.iq.dataverse.HarvestingClientsPage] [tid: _ThreadID=89
_ThreadName=http-thread-pool::jk-connector(2)] [timeMillis:
1638366747731] [levelValue: 900] [[
  Failed to execute ListSets;
com.lyncode.xoai.serviceprovider.exceptions.HttpException: Error
querying service. Returned HTTP Status Code: 500]]

Would you have any idea for a possible explanation?

The direct request https://zenodo.org/oai2d?verb=listSets works great.

Is there a (not too risky) way to manually specify the set in the database?

Thanks

Thomas

Philip Durbin

unread,
Dec 1, 2021, 9:28:48 AM12/1/21
to dataverse...@googlegroups.com
Hi Thomas,

Harvesting operations like "create set" have been available via API for some time but we only recently got around to documenting them in 5.7. Rather than manipulating the database directly, I'd suggest giving them a try: https://guides.dataverse.org/en/5.8/api/native-api.html#managing-harvesting-server-and-sets

It's a bummer that more information isn't available beyond that 500 error. Please feel free to open an issue about better error handling. If you could include the steps you're taking so we can reproduce it, it would be much appreciated.

Thanks,

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/a7bdece9-0201-fc60-2e7a-d0157b3d5b81%40gmail.com.


--

Thomas Jouneau

unread,
Dec 1, 2021, 9:36:55 AM12/1/21
to dataverse...@googlegroups.com

Hi Philip

Thanks for this straightaway answer, much appreciated.

I checked the documented API commands earlier, however if I'm not mistaken they only deal with the "OAI server" side, not the clients.

Are there any API commands that can interact with clients, and where do I find them?

Best

Thomas

Philip Durbin

unread,
Dec 1, 2021, 10:00:05 AM12/1/21
to dataverse...@googlegroups.com
Yes, there are API commands for harvesting clients. I'm sorry to report that they aren't documented. (Please feel free to create an issue about this.) You may be able to figure them out from the Java code: https://github.com/IQSS/dataverse/blob/v5.8/src/main/java/edu/harvard/iq/dataverse/api/HarvestingClients.java



Thomas Jouneau

unread,
Dec 1, 2021, 10:17:59 AM12/1/21
to dataverse...@googlegroups.com

Thanks! Coincidentally, I did check this just before your mail ;)

My  use case would be to create a client manually with a JSON file.

I was thinking of something around these lines :

curl -H X-Dataverse-key:$API_TOKEN -X POST -H "Content-Type: application/json" $SERVER_URL/api/harvest/clients --upload-file client.json

and client.json structured like this (for ex. for the "univ-lorraine" set) :

{
    "nickName": "zenodo_univ-lorraine",
    "dataverseAlias": "zenodo_univ-lorraine",
    "type": "oai",
    "harvestUrl": "https://zenodo.org/oai2d",
    "archiveUrl": "https://zenodo.org",
    "archiveDescription": "xxxxx",
    "metadataFormat": "oai_dc",
    "set": "user-univ-lorraine",
    "schedule": "none",
    "status": "inActive"
  }

Would something like this be likely to work?

Best

Thomas

Philip Durbin

unread,
Dec 1, 2021, 10:44:44 AM12/1/21
to dataverse...@googlegroups.com
I think so? One thought is that you could create a client in the GUI and then download it as JSON via the API. I assume the format is the same (I didn't write this code).

Please keep us posted!

Thomas Jouneau

unread,
Dec 1, 2021, 10:56:58 AM12/1/21
to dataverse...@googlegroups.com, Philip Durbin

It does not work, and replies with :

{"status":"ERROR","code":405,"message":"API endpoint does not support this method. Consult our API guide at http://guides.dataverse.org.","requestUrl":"https://bac-dataverse.univ-lorraine.fr/api/v1/harvest/clients","requestMethod":"POST"}%     

The POST method seems supported in the source code you just linked to.

So, not sure what went wrong...

Thomas

Philip Durbin

unread,
Dec 1, 2021, 11:02:22 AM12/1/21
to dataverse...@googlegroups.com
I think you have to put the nickName ("zenodo_univ-lorraine" in your example) as a path parameter like this (but still also in the JSON):


You also need the nickName there to edit with PUT, it seems, which makes sense.

Thomas Jouneau

unread,
Dec 1, 2021, 11:39:11 AM12/1/21
to dataverse...@googlegroups.com

You can't create a client with PUT. So I tried a slightly different approach, first creating zenodo_lmops (I changed the set) through the GUI and trying to complete the "set" field with PUT.

Sending this :

curl -H X-Dataverse-key:$API_TOKEN -X PUT -H "Content-Type: application/json" $SERVER_URL/api/harvest/clients/zenodo_lmops --upload-file client_modify.json

with my json file as this :

{   
    "nickName": "zenodo_lmops", 
    "set": "user-lmops"
  }

My Dataverse instance was not happy and went down (it's only a test instance, no harm done) :

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>500 Internal Server Error</title>
</head><body>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error or
misconfiguration and was unable to complete
your request.</p>
<p>Please contact the server administrator at 
 root@localhost to inform them of the time this error occurred,
 and the actions you performed just before this error.</p>
<p>More information about this error may be available
in the server error log.</p>
</body></html>

Did I break some rule?

Thomas

Philip Durbin

unread,
Dec 1, 2021, 4:06:27 PM12/1/21
to dataverse...@googlegroups.com
The code seems to expect more JSON. Something like this:

{
  "dataverseAlias": "root",
  "nickName": "client1",
  "type": "oai",
  "harvestUrl": "https://demo.dataverse.org/oai",
  "archiveUrl": "https://demo.dataverse.org",
  "archiveDescription": "Description of the archive.",
  "metadataFormat": "oai_dc",
  "set": "foobar"
}

The guides should probably say something like this:

To edit a harvesting client you must supply a JSON file with existing and new fields. It is best to start with JSON from the API (see "List Harvesting Clients").

https://github.com/IQSS/dataverse/issues/8267 is the issue I created to document these APIs but as I played with them I discovered I couldn't get "create" to work. The stuff I got working (list, edit) I put into a branch, in this commit: https://github.com/IQSS/dataverse/commit/1364349f56793c5a3a9d53bacbf7710872e43c90

I hope that helps,

Phil

Thomas Jouneau

unread,
Dec 2, 2021, 3:28:24 AM12/2/21
to dataverse...@googlegroups.com

Hi Philip

Thanks a lot for this, the issue, the commit and your help.

I'm sorry to say that it still doesn't work (FWIW I'm running Dataverse 5.2).

I sent this :

curl -H X-Dataverse-key:$API_TOKEN -X PUT -H "Content-Type: application/json" $SERVER_URL/api/harvest/clients/zenodo_lmops --upload-file client.json

where client.json is like this (I removed only the informations about the last harvests) :

{
    "nickName": "zenodo_lmops",
    "dataverseAlias": "lmops",
    "type": "oai",
    "harvestUrl": "https://zenodo.org/oai2d",
    "archiveUrl": "https://zenodo.org",
    "archiveDescription": "Moissonné depuis la collection LMOPS de l'entrepôt Zenodo. En cliquant sur ce jeu de données, vous serez redirigé vers Zenodo.",
    "metadataFormat": "oai_dc",
    "set": "user-lmops",
    "schedule": "none",
    "status": "inActive",
  }

I got  the same message (500) as before, however I'm happy to report that the application did not go down this time.

The server.log file does not show anything particularly relevant, just stopping at :
[2021-12-02T09:00:30.635+0100] [Payara 5.2020] [INFO] [] [edu.harvard.iq.dataverse.api.HarvestingClients] [tid: _ThreadID=89 _ThreadName=http-thread-pool::jk-connector(1)] [timeMillis: 1638432030635] [levelValue: 800] [[
  retrieved Harvesting Client zenodo_lmops with the GetHarvestingClient command.]]

Sincerely

Thomas

Philip Durbin

unread,
Dec 2, 2021, 8:36:58 AM12/2/21
to dataverse...@googlegroups.com
Bummer. Whelp, I guess I'd suggest changing the database directly. (I hate saying that.) I believe you'd want to modify the "harvestingset" column in the "harvestingclient" table. Here's more on that table: https://guides.dataverse.org/en/5.4/schemaspy/tables/harvestingclient.html

Yes, the issue I created was about adding docs. You are very welcome to open an issue or two about the bugs and problems you're seeing.

Thanks,

Phil

Thomas Jouneau

unread,
Dec 3, 2021, 5:12:32 AM12/3/21
to dataverse...@googlegroups.com

Hi Philip,

Yes, this does look like a last resort. We might have a try on a separate instance, but first, i'd like to take the problem back at its beginning (trying to use the API was a workaround).

I tried to examine closer why the GUI doesn't get the list of sets from Zenodo and I'm puzzled.

The function works great with Pangeae, Ortolang, so there doesn't seem to be any problem apart from Zenodo. And again, the function USED to work with Zenodo. So it's new.

When looking at the log I find a 500, which doesn't tell anything :

[2021-12-03T10:34:15.339+0100] [Payara 
5.2020] [WARNING] [] [edu.harvard.iq.dataverse.HarvestingClientsPage] 
[tid: _ThreadID=92 _ThreadName=http-thread-pool::jk-connector(3)] 
[timeMillis: 1638524055339] [levelValue: 900] [[

  Failed to execute ListSets; 
com.lyncode.xoai.serviceprovider.exceptions.HttpException: Error 
querying service. Returned HTTP Status Code: 500]]

When I simply make a

https://zenodo.org/oai2d?verb=ListSets

from the console on the same VM, the list of sets is perfectly retrieved in XML.

(I asked the people here administrating the VM to look at the port etc. configuration, just in case, but this would not explain that only Zenodo doesn't work).

What would be your take on this?

Have a nice day

Thomas

Valentina Pasquale

unread,
Dec 6, 2021, 11:03:49 AM12/6/21
to dataverse...@googlegroups.com
Dear Thomas, dear Philip,

I am also very interested in understanding why Zenodo harvesting is not working as before.

On November 16th, I sent an email to this Google group, asking for help in fixing some bugs, but it was definitely working, even if not perfectly. It wasn't displaying all sets, but some of them were correctly imported, and I could harvest one of them. Please see: https://mail.google.com/mail/u/0/?tab=rm&ogbl#inbox/FMfcgzGlksFnldWXGFTJPLSmxFFWpSfq.
During those days Zenodo was very slow and it went offline for some hours.
The day after (November 17th) it was no longer working and I encountered the same problem described by Thomas. I thought something had changed on the Zenodo side. I don't know whether this is useful piece of information for Zenodo developers to understand what is going on. 

Thanks.

Best regards,

Valentina
IIT Dataverse




Thomas Jouneau

unread,
Dec 6, 2021, 12:06:19 PM12/6/21
to dataverse...@googlegroups.com

Dear Valentina, all,

Thanks for this. I completely share what you relate : we went from a harvesting that worked randomly, to nothing at all.

The link you posted doesn't work : does it point to a Zenodo Google group?

Do you think it would be useful to draft a request to the Zenodo teams? Could we maybe gather the elements we have and start this?

The fact ist that, as I posted earlier, the ListSets request sent manually to Zenodo (https://zenodo.org/oai2d/?verb=ListSets) works perfectly. So it would also help to know exactly the form of the request sent by Dataverse.

Best

Thomas

Valentina Pasquale

unread,
Dec 7, 2021, 4:19:40 AM12/7/21
to dataverse...@googlegroups.com
Dear Thomas,

I do not know why my link does not work: I refer to the message sent by me to this Dataverse Users Google Group on Nov 16th whose object was "OAI harvesting from Zenodo failures". Maybe this one works: https://groups.google.com/d/msgid/dataverse-community/63d93440-280a-4578-99c4-ddc9f32b4333n%40googlegroups.com?utm_medium=email&utm_source=footer
I have not written to the Zenodo team yet.

I agree that we would need to know which request Dataverse sends to Zenodo, but I have no idea of how to retrieve this info or debug it. Maybe @Philip or the Dataverse team can help us?
Once we got this info, we could maybe understand why it does not work and eventually open a ticket to Zenodo... I also tried the manual request and it works (although it returns just part of the sets).

Anyway, I am definitely interested in trying to fix this.

Thanks for the collaboration.

Best wishes,

Valentina



Philip Durbin

unread,
Dec 7, 2021, 11:56:07 AM12/7/21
to dataverse...@googlegroups.com
Hello Party People,

Sorry, I was away writing code. Let's see what's cooking here.

Thomas, the only recent-ish change (from 4.20, a year and a half ago) I can think of is https://github.com/IQSS/dataverse/pull/6686 where we addressed an issue whereby that "list sets" verb was timing out. We truncate the list with the idea that you can configure the set via API (like you're trying to do now).

This probably has something to do with the "properly display the set list during the client configuration" part of your original question.

In addition, it surely relates to Valentina's comments in the other thread: "when setting the harvesting client, some times the list of available sets is completely empty, some other times it contains only part of the OAI sets (i.e. Zenodo communities) and in this case a warning is displayed, saying that not all sets have been retrieved due to time-out problems. Do you think that these issues could be solved anyway? e.g. by changing any setting?"

I'm not sure there's an easy solution to the timeout problem, the "too many sets" problem. I seem to remember that it could take 10 minutes for the full list of sets to download*. Maybe we could download lists of sets overnight or something? And tell the user that the list of sets is a few hours old? I'm just thinking out loud.

I don't think we have an open issue about improving the "too many sets" experience so please feel free to open one and let us know about it here.

As for other problems not related to "too many sets" perhaps we can tease them apart and create separate GitHub issues for them. I created https://github.com/IQSS/dataverse/issues/8267 about the lack of documentation for configuring harvesting clients via API. Some of those APIs (create, for example) don't work for me (nor Thomas, it seems), so that could be a separate issue. I'm confused why Thomas is seeing "Failed to execute ListSets" and a 500 error. That seems like an issue that could be investigated separately from a "rethink the too many sets" problem.

Valentina it looks like you mentioned several problems in the other post. Danny wrote back with some questions and asked you to create an issue, it looks like. If you could do that and then reply on that thread, it would be most appreciated. (This thread is already getting unwieldy with 16 replies.)

To sum up, it seems like there are multiple issues here. We prefer to work in "small chunks" if we can, so please create issues at https://github.com/IQSS/dataverse/issues that we can prioritize and estimate.

I'm glad you're having (or at least had) success harvesting from Zenodo! This hasn't always worked well so from my perspective, kind of working is better than not working at all. Over time, it's sure to get better. Thanks for being on the frontier. It's a little wild out there.

Thanks,

Phil

* 10 minutes to get all the sets: https://github.com/IQSS/dataverse/issues/4964#issuecomment-591093229

Thomas Jouneau

unread,
Dec 8, 2021, 4:54:53 AM12/8/21
to dataverse...@googlegroups.com

Hi Philip

I agree that "Failed to execute ListSets" and "too many sets" are two separate problems.

I just created two issues :

I'm not sure I can create an issue for the "too many sets" problem as it is obscured by the two others for now (in fact, being able to reproduce it again would be an improvement ;) ).

@Valentina, please feel free to comment on any of these issues if needed.

Best,

Thomas JOUNEAU
Université de Lorraine
Soutien aux données de la recherche
Direction de la Documentation - Mission appui recherche
B.U. Ile du Saulcy BP 20728
57045 Metz Cedex 01
Tél. : 03 72 74 10 27

Philip Durbin

unread,
Dec 8, 2021, 8:37:57 AM12/8/21
to dataverse...@googlegroups.com
Hi Thomas,

Those are some really well written issues! A 500 error to try to reproduce and some exceptions to dig into.

Thank you!

Phil

Reply all
Reply to author
Forward
0 new messages