DOI attribution to Datasets Vs Datasets and individual files.

196 views
Skip to first unread message

José Carvalho

unread,
Jul 19, 2022, 6:22:28 AM7/19/22
to Dataverse Users Community
Hi!

One of our investigators has brought to our attention that having DOIs both on datasets and files will generate separate entries on its ORCID dashboard and that because of this the dashboard gets cluttered with (possibly) superfluous entries.
From the research we've made, most Dataverse installations are getting DOIs for both dataverses and files.
Could someone please provide us with feedback about the reasons for having DOIs for datasets and files (default behavior) or why You have chosen to 'disable' :FilePIDsEnabled?
Are there any advantages on having DOIs for files apart form being able to cite a file directly?

Regards,

José Carvalho - Univ. Aveiro

o.be...@fz-juelich.de

unread,
Jul 20, 2022, 5:47:08 AM7/20/22
to Dataverse Users Community
Hi José,

while I can't speak for the team at IQSS, let me add my 2 cents.

- I agree with you that a default enabled for File PIDs should be changed to off. People are starting to shove in more and more files and the load (and money) by registring these at DataCite isn't neglectable. BTW we deactivated it in Jülich DATA for exactly these reasons.
- I do see there is value in being able to identify a single file within a dataset. Currently, the way handles and DOIs work, this is not possible with just the dataset DOI. It's very convenient to precisely identify a single file when referencing the data e.g. in workflow contexts and this is what DOIs are there for in the first place: simple persistent identifiers. Also note that there will be another thing adding more DOIs: https://github.com/IQSS/dataverse/issues/4499 (This is already present for Zenodo/InvenioRDM!) This means that ORCID will also need to apply a fix to the Dashboard in terms of filtering this stuff, providing a better UI mid-term.
- I truely believe that Dataverse doesn't do a good job at declaring good resource types when minting the PIDs. See also https://github.com/IQSS/dataverse/issues/7077. Properly used types would make it much easier for orgs like ORCID to filter for Datasets only - currently we are swamping everything with Datasets which is just wrong.

Best,
Oliver

Julian Gautier

unread,
Jul 20, 2022, 7:58:07 AM7/20/22
to Dataverse Users Community
Hi José,

The Harvard Dataverse Repository had file PIDs enabled for a bit, I think for the reasons Oliver cited. I remember hearing that it was turned them off because it was causing some stability issues in the repository. I don't know the details. The team at IQSS has discussed registering PIDs for certain files in the repository.

The issue you raised about ORCID profile pages being flooded with entries for files and datasets, with no way to filter for just datasets, also occurs in an Elsevier product called Data Monitor. Details, including how they try to resolve this , are in the GitHub issue at https://github.com/IQSS/dataverse/issues/5086.

Regards,
Julian


Julian Gautier

unread,
Aug 3, 2022, 2:21:11 PM8/3/22
to Dataverse Users Community
Hello again José,

My colleagues and I met today to speak a little today about file PID registration, particularly in the Harvard Dataverse Repository, and https://github.com/IQSS/dataverse/issues/8889 was opened to discuss the Harvard repository's need to register file PIDs in datasets in a particular collection (while not registering file PIDs for all datasets).

During the meeting I brought up the discussion in this forum thread and we think that learning more about the research you mentioned - about the number of Dataverse installations with file PIDs turned on - would be helpful. When you wrote that "most Dataverse installations are getting DOIs for both [datasets] and files", could you share how you found this out?

Like you suggested, hearing from more installations would help the community figure out how settings related to file PIDs could be improved, e.g. if file PID registration should be turned off by default and which types of users (installation admins, collection admins, etc.) need to be able to assign files to PIDs.

Thanks!
Julian

Philipp at UiT

unread,
Aug 22, 2022, 7:19:17 AM8/22/22
to Dataverse Users Community
Hi all,

At DataverseNO, we have enabled file-level DOIs because in general, granular PIDs are recommended. However, we're aware of the issues connected to file-level DOIs. In addition to the ones you have mentioned, I'd also like to add issues with publishing datasets with a large number of files; see this discussion thread: https://groups.google.com/u/1/g/dataverse-community/c/PXSnfyFNscA.

As for cluttering research output overviews like ORCID, I think this could be avoided if DataCite would introduce a resourceType for dataset files and Dataverse would implement this; see this GitHub issue: https://github.com/IQSS/dataverse/issues/5086.

Best, Philipp

José Carvalho

unread,
Sep 20, 2022, 8:42:51 AM9/20/22
to Dataverse Users Community
Hi Julian,

Sorry for the late reply.
Perhaps, there was a bit of a misunderstanding. When I mentioned "most Dataverse installations are getting DOIs for both [datasets] and files" I was referring to the Dataverse installations we checked, not all the existing Dataverse installations. Sorry if I mislead You.
Regarding the improvement of file PIDs settings, I will post the link to a collaborative document, where users can state their current installation settings and add input on the pros and cons of each scenario.

Regards,

José

José Carvalho

unread,
Sep 20, 2022, 8:58:02 AM9/20/22
to Dataverse Users Community
Hi,

Following on Julian's suggestion on hearing from more installations in order to help the community figure out how settings related to file PIDs could be improved, I reached out to some Portuguese institutions for getting their input on this issue.
We have put together a document stating some pros and cons of both scenarios (feedback from this thread has already been added to the document): DOI registration for file and dataset Vs dataset only.

We ask the community to check the document https://docs.google.com/document/d/1P1txmW0MUWm35AQPTDoh5Kl1X0m7HOZvXwXjX5wZAps and provide any feedback regarding this issue: Pros and cons for each scenario, their current configuration for DOI attribution and suggestions for improving this document. 

Thanks in advance!

Regards,

José

Philip Durbin

unread,
Sep 20, 2022, 3:14:50 PM9/20/22
to dataverse...@googlegroups.com
José, thanks for getting the conversation going in the community call earlier today!


We can also continue discussing in this thread too, of course! I also feel like https://github.com/IQSS/dataverse/issues/5086 is absolutely on topic.

Thanks,

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/8b78cf7e-2d03-47f7-84f0-12fcc0ee6815n%40googlegroups.com.


--

James Myers

unread,
Sep 21, 2022, 10:09:43 AM9/21/22
to dataverse...@googlegroups.com

FYI:

 

I just merged an update to the dataverse-previewers repository which adds a betatest folder with an updated MapViewer and the new ZipPreviewer.

 

If you have deployed the dataverse-previewers from the v1.3 release via github.io (or deployed them locally), nothing will change.

 

If you want to try the Zip Previewer (which requires v5.12), the relevant curl command and additional setup instructions for S3 are at the end of the Example Curl Commands to register previewers for Dataverse, version 5.2+ document.

 

Similarly, if you wish to use the new MapViewer, you should un-reqister the v1.3 version and register the betatest version (sample curl command also at the end of that doc).

 

Thanks to the community (@kaitlinnewson, @haarli) for these additions!

 

-- Jim

 

Julian Gautier

unread,
Oct 3, 2022, 2:13:06 PM10/3/22
to Dataverse Users Community
I wonder if the pros and cons list you've started captures most if not all of the pros and cons for supporting PIDs at the file level.

In addition to asking folks from the different installations, we could also look at their published metadata. How many installations have all or some datasets that all have file PIDs? My dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DCDKZQ could be queried to find out. It's a snapshot from April 2022 of dataset metadata in most known Dataverse repositories and I plan to finish updating it this week. That might help fill the Installation Status table in your Google Doc.

It might also be helpful to start imagining what we could do with this info, which might help generate more specific questions. For example:
  • When installing the Dataverse software, should file PIDs be turned off by default? When groups are installation Dataverse, do any not realize that file PIDs are enabled by default?
  • Do installations need other ways, other than the Dataverse API, to assign file PIDs to some datasets, and in what ways? Such as, should dataset depositors be asked each time they deposit? Should it be a setting at the collection level? As someone else mentioned before, OSF has some workflow for letting depositors tell the platform to assign a DOI for their deposit. So that workflow could be explored.

Julian Gautier

unread,
Oct 11, 2022, 3:21:53 PM10/11/22
to Dataverse Users Community
I looked at the metadata of 70 Dataverse installations (of the 88 known installations) and was able to see which ones have datasets with file PIDs and which ones don't. At least some of these installations are listed in your Installation Status table in the DOI: Dataset + File Vs. Dataset Only Google Doc.

49 installations have some or all datasets with file PIDs:
  • 28 of those installations have a mix of datasets with file PIDs and without file PIDs
  • 21 of those installations' datasets all have file PIDs
21 installations have no datasets have file PIDs:
  • 17 of those installations use the version of the Dataverse software (v4.9+) where PIDs can be registered for files
  • 4 of those installations are pre-v4.9 (do not use a version of the Dataverse software where PIDs can be registered for files)
This shows that at least 49 Dataverse installations (more than half of known Dataverse installations) have had or still have file PIDs turned on.

This is missing the "why" of course (such as what's listed in the DOI: Dataset + File Vs. Dataset Only Google Doc), but it supports that most known installations do have datasets with file PIDs.

I could list each of these installations and we could contact their admins to learn more.

I got these counts from the CSV file in the ZIP file that I've attached. It lists each published dataset version in these 70 installations and whether or not each version has files that have PIDs.
file_pids_in_dataverse_installations.csv.zip
Reply all
Reply to author
Forward
0 new messages