403-forbidden response from DataCite for dataset with many files

62 views
Skip to first unread message

Iglezakis, Dorothea

unread,
Feb 13, 2020, 9:46:33 AM2/13/20
to dataverse...@googlegroups.com

Dear Dataverse community,


we have a problem publishing a dataset with 131 attached files. The publication is aborted because of a 403 response from DataCite after about 25- 89 sucessfully registered DOIs. After three attempts to publish we have now in a bunch of registered DOIs linking to files of a dataset that is still not published.  Some of the files have three or four different DOIs pointing to them, some of the files and the main dataset has no registred DOI at all. We already contacted the DataCite-Support for help. 


Had anyone of you similar problems (Issue #6562 sounds somewhat similar) or any idea how to rescue this dataset. We could manually map the additional dois if the dataset would be public. But we do not dare to start a new release attempt on the productive system.


On our testsystem connected to the test DOI-service (mds.test.datacite.org), the dataset could be published without problems.


Thanks a lot and kind regards,


Dorothea from Stuttgart



Philip Durbin

unread,
Feb 13, 2020, 1:09:14 PM2/13/20
to dataverse...@googlegroups.com
Hi Dorothea,

I don't have any experience with this problem but I can imagine it (especially with even more files in a dataset) and my first thought is the following workaround:

- Using DataCite's API, clean up (delete?) the bad DOIs to files, the ones that done resolve.
- In Dataverse, turn off file-level DOIs.
- Publish the dataset
- In Dataverse, turn file-level DOIs back on
- For each file, one by one or in small batches, register a DOI for it via API: http://guides.dataverse.org/en/4.18.1/admin/dataverses-datasets.html#mint-a-pid-for-a-file-that-does-not-have-one

Obviously, this is not a great plan but it was the first that came to mind.

I was *just* having a beer with Martin Fenner at DataCite a couple weeks ago at PIDapalooza and he asked what APIs Dataverse wants. Does anyone happen to know if DataCite supports the idea of a "bulk registration" of many DOIs (let's say your 131 files) in a single HTTP request?

I hope this helps,

Phil

p.s. This issue seems related as well: https://github.com/IQSS/dataverse/issues/5283

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/66c0978f65b24cd6998581850dc74184%40ub.uni-stuttgart.de.


--

Dorothea Iglezakis

unread,
Feb 14, 2020, 5:15:23 AM2/14/20
to Dataverse Users Community
Dear Phil,

thanks a lot, your answer is really helpful. By now, we also received response from DataCite:

"As far as I can tell from the logs, all your POST requests to /metadata did not error and did successfully create DOI's on DataCite, this matches with the fact you do have Registered DOI's in DataCite i.e. 10.18419/darus-513/1 has been registered.

The 403 I can see came from trying to request the DOI's i.e. "GET https://mds.datacite.org:443/metadata/10.18419/darus-513/1"
Is this call being made authenticated? (I can't tell from our logs) because if not, it's dependant on our public search index having that DOI, which usually happens in minutes but if the process tried to request too quickly then it is possible you'd receive a 403 (as although it's not found, we don't expose that instead we deny permission).

Is this the case that the repository doesn't publish the new DOI's in the repository until they are resolvable on DataCite?

We do limit around 3000 requests every 5 minutes from the same IP across all our API's, but I wouldn't expect to see the behaviour you're describing."

We can't really match this answer to the errors, but will first rescue our dataset by your workaround.

Kind regards,

Dorothea

Philip Durbin

unread,
Feb 14, 2020, 11:53:02 AM2/14/20
to dataverse...@googlegroups.com
The thing that's confusing me is that you're talking about GETs now but that issue you linked (#6562) is talking about POSTs in the stacktrace:

Caused by: java.lang.RuntimeException: Response code: 403, Access is denied
at edu.harvard.iq.dataverse.DataCiteRESTfullClient.postMetadata(DataCiteRESTfullClient.java:190)

So maybe we're talking about two different bugs? Here's what we need:

- a reproducible way to reproduce the bug, hopefully using code from "develop"
- a detailed stacktrace from server.log from the failure

If the bug is easy to reproduce, it's easy to fix.

As we all know, reproducibility is hard. :)

I don't know if others read Jim's comment here but I agree that we need a plan or strategy for what to do when DataCite is down temporarily: https://github.com/IQSS/dataverse/issues/6562#issuecomment-585938590

Of course, DataCite may not have been down in this case. It's hard to know what's going on, unfortunately.

With regard to this question... "Is this the case that the repository doesn't publish the new DOI's in the repository until they are resolvable on DataCite?" ... my understanding is that when a dataset has a lot of files (and file DOIs are enabled), we have to wait and wait until all the files get their DOI before the dataset is published.

It makes me wonder... what if the files could get DOIs one by one *after* the dataset is published? That's basically what I'm suggesting in the workaround. It seems acceptable to you in this case but maybe in other cases, researchers really want all the DOIs for all the files in one shot when the dataset is published (the current behavior).

I hope this helps,

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

James Myers

unread,
Feb 14, 2020, 12:13:28 PM2/14/20
to dataverse...@googlegroups.com

FWIW: From the code, it looks like all calls are made with the same HttpClient which has the authentication info, so DataCite’s idea that we might be making the GET without credentials doesn’t look to be true. With Phil’s note that the POST call also had a 403, it seems more likely that there was a temporary authentication issue affecting both/all calls.

 

-- Jim

Reply all
Reply to author
Forward
0 new messages