Hi all,
We have a followup question / discussion for the community on this and would appreciate any feedback. As a reminder the issue in github is:
As we work to complete the functionality for file DOIs, we are trying to decide what exactly persistent identifiers for files will look like. In particular, this discussion is about the local part of the identifier, so for a doi like:
we are referring to what comes after the shoulder, "FK2/", i.e the bolded text..
There have already been some discussions about this in the issue and whether file DOIs should be completely arbitrary or be generated using the dataset DOI identifier as part of it (for human readability / usability; internally the DOI would be stored in the same manner, i.e. the system would not infer any meaning). There have been good arguments for both, so rather than limit dataverse to one option, we plan on supporting an ability to configure it.
We currently support a configuration option for DOIs (used by datasets) of whether they are:
- randomString, e.g. BXOJPJ, GKSTMU, MESSI1
- sequentialNumber, e.g. 10001, 10002, 10003
Our plan is to add another configuration option for file DOI's on whether they are dependent or independent. Using this and the existing setting, you would have 4 cases for files:
- dependent randomString, e.g. BXOJPJ/RHBISG
- dependent sequentialNumber, e.g. 10001/1
- independent randomString, e.g. RHBISG
- independent sequentialNumber, e.g. 10002
So some questions are:
- we know of use cases for the first 3 (Harvard, SBGrid, and QDR, respectively), does anyone see a use case for the last choice? Is there any reason not to allow this particular combination? (It has been pointed out that it might be confusing / messy to have a DOI with dataset 10001 and the next being 10010, because the first dataset had some files that used up the identifiers for 10002 - 10009)
- are there any other formats that we should be handling that are not supported by the above?
- any other thoughts on any of this? :)
Thanks,
Gustavo