dataset types (software, workflow, etc.)

Philip Durbin

unread,

May 14, 2025, 11:22:30 AM5/14/25

to dataverse...@googlegroups.com

Hello Dataverse enthusiasts!

I'm working on a talk about dataset types and I don't think we ever announced the proposal.

Here it is: https://docs.google.com/document/d/16RvGXmaPQK9DGsEEbrrFEu8mjUrN93YY_yh7ZONyMDI/edit?usp=sharing

Comments are enabled and feedback is welcome!

We've been working on dataset types incrementally, and they first appeared in Dataverse 6.4. Here are the latest docs as of 6.6: https://guides.dataverse.org/en/6.6/user/dataset-management.html#dataset-types

Here's the parent issue: https://github.com/IQSS/dataverse-pm/issues/307

Finally, there's also been discussion in Zulip. Please see #dev > datasetType (software, workflow, etc.)

Feedback is a gift! 🎁Thanks in advance for any thoughts!

Phil

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

Philipp Conzett

unread,

Jun 22, 2025, 7:52:03 AM6/22/25

to Dataverse Users Community

Thanks for sharing the proposal for designing and implementing support for research objects in Dataverse! I have a couple of comments and questions:

1. Terminology
I think we should consider alternative names for Dataset Type, e.g., Research Object Type, Resource Type or Digital Object Type. I suggest Research Object Type.

2. One single PID
The proposal implies that each dataset ideally should only contain objects of the same Research Object Type. So, if your study involves data, workflows, and software/code, you should archive these in (at least) three different datasets, one for the data, one for the workflows, and one for the software/code. This means you in the end will have (at least) three different PIDs. Of course, links can be established between these datasets using, e.g., the Related Publication metadata field. However, I don't understand how this multiple PID approach aligns with the statement in the Purpose section of the proposal: "This proposal describes a process for designing and implementing support for research objects in Dataverse software. Research objects are mechanisms for associating "related resources about a scientific investigation so that they can be shared using a single identifier." (Wikipedia)" In the example above, would this mean a fourth object would need to be created to link the three datasets together?

3. Levels of description
Essential parts of the discussion in the proposal, in particular about licensing, demonstrate the need to have support at file level for richer metadata and terms of use / license (see most recently this discussion in the Dataverse Google group). This makes me think of whether the discussion about support for multiple Research Object Types should start at the file level. At file level, there is less doubt as to which Research Object Type (data, software/code, workflow, ...) a given file represents and what Terms of Use that should apply. This would suggest that information about Research Object Type and license should be applied at file level. Actually, this level could be called Research Object level. At the next level, what currently is called dataset, multiple files can be collected which may be of different research object types (e.g., documentation, data and workflow) and have different Terms of Use. At this level, the metadata and licensing information could simply be an aggregation of the file level information in addition to the metadata and license of the documentation file(s) documenting the contained files collectively. We could call the object at this level Research Object collection. This level could be used to cover the use case described above, i.e., using one PID to share a collection of different Research Object Types. Yet one level up, we have repository collections, also called sub-dataverses (in some Dataverse installations, they might have the status of repositories). This level is optional. At the top level, we have the repository. The figure below summarizes the four levels:

Repository
(--- Repository collection)
--- Research Object collection (documentation, Research Objects)
--- Research Object (data, software/code, workflow, ...)

I'd be happy to discuss this further with the proposal team and the larger community.

Best, Philipp

Philipp Conzett

unread,

Jun 22, 2025, 8:05:12 AM6/22/25

to Dataverse Users Community

In addition, I think we should consider if/how the implementation of multiple research object support aligns with emerging (de-facto) standards such as RO-Crate and Oxford Common File Layout.

Steven McEachern

unread,

Jun 24, 2025, 11:59:09 AM6/24/25

to Dataverse Users Community

Hi all,

I think some careful consideration is needed here about (a) the structure, and (b) the "types" being proposed here. There is real potential for getting Dataverse structures that don't align well with some of the other structures repositories often need to align to (particularly for aggregators such as Google Dataset Search, or more locally to the CESSDA metadata requirements for the 11 CESSDA repositories using DV).

Regarding types, as Philipp notes, the term "dataset type" here will likely create confusion. I'm not sure "research object" is ideal here, but it's preferable to "dataset type" if code and documentation are being included. "Digital object" might be the better option here.

In terms of structure, you will also want to give consideration to how to align to some of the commonly used standards. RO-Crate and OCFL provide capacity for carrying metadata with data, but I think you want to consider what the levels are equivalent to here. The current DV structure lines up pretty well with the DDI-Codebook hierarchy:

- Dataverse (collection) =~ Series (or DDI StudyUnit)

- Dataset =~ Study

- File =~ File

Of course this is not the only standard alignment to consider. I'd be looking at what is the equivalence between your structural levels and (at a minimum) DDI-Codebook, DCAT 2.0/3.0 (https://www.w3.org/TR/vocab-dcat-2/) and schema.org Datasets (https://schema.org/Dataset). And probably by extension, Google Dataset Search structures (https://developers.google.com/search/docs/appearance/structured-data/dataset).

Happy to discuss further.

Cheers,
Steve

Janet McDougall - Australian Data Archive

unread,

Jun 26, 2025, 9:39:41 PM6/26/25

to Dataverse Users Community

good point, agree Philipp

Reply all

Reply to author

Forward