Metadata change suggestion: Metadata Schema support for content URLs

132 views
Skip to first unread message

Kristian Garza

unread,
Jan 26, 2022, 5:27:22 AM1/26/22
to DataCite Metadata, meta...@datacite.org
Hi Metadata Working Group,

The following is a suggestion by one of our members that was discussed in the last Open Hours session. There was a significant number of comments and discussion about this suggestion. Thus here we are forwarding all the information for your consideration.

Best regards




Internal links: 


Comments from Members

```
Should we do it?Yes, because it allows for direct file downloads. However, it does have implications for download tracking/COUNTER that will need to be considered.What problem?This would allow us to programmatically download files using DataCite metadata. This is not currently possible unless the landing page includes machine-readable file download information—and then, it is an extra request to make.How would it help?We run a metadata harvester for a federated search portal. This harvester also downloads geospatial data files (based on file type) and extracts bounding box information. We can do this programmatically for repositories that have their own API with download URLs, but not for repositories where we harvest metadata using the DataCite API.Who?Metadata and data aggregatorsWhen?When we started downloading geospatial data filesHow often?We would adapt our metadata harvester to use this metadata when available. Realistically, it would take time for repositories to start using this metadata field, so we may not benefit right away. But I think it is a step in the right direction.Before you:We currently don't harvest geospatial files when we use the DataCite API.
This is partially possible with the media api, but would be much better directly implemented in the metadata schema. This is really useful for reproducibility, and enables researchers to directly access data in computing environments like jupyter notebooks via the DOI (https://github.com/pachterlab/CWGFLHGCCHAP_2021/blob/master/notebooks/CellAtlasAnalysis/starvation_Analysis.ipynb)
This is partially available through media types, but having it in the metadata standard would make it much easier to Aspekt. This has a huge potential for making data more reusable; I’ve utilized the existing media types implementation for accessing data directly in a Jupiter notebook.
I kind of alluded to why we want a checksum included above. Dataset files can change over time, and when downloading someone should be able to verify that it hasn't been tampered with. Because we already use the Eth blockchain to store our checksum as a public ledger we don't require this field but felt it might be a good addition to your schema.
Hello

Two weeks ago we had datacite strategy brainstorming session.

Here is a bit more information about one idea, that I perhaps was not able to clarify enough
in that meeting.
The idea, or a vision, is that in future Datacite should support machine actionable PID metadata.
I do not know is that exactly correctly name for idea, but the basic idea is given below.

Why ?

Current metadata schema is mostly defined for humans, not for machines.

Vision:

Metadata schema should support recreating dataset
a) with the help of source datasets and
b) workflow description.

What would be needed ?

In future, in my DOI metadata I  would like to be able to define a direct link to source dataset,
link to landing page is not enough.

In addition, I would like to define a processing workflow. This workflow could perhaps be
in the same dataset, but I need to have a direct link also to that workflow file.

Then -  the difficult part.
After having a direct link to workflow file, using it and workflow attributes, I should be able
to select/find environment, that can process/recreate environment, that can run my workflow.

This approach would greatly enhance data provenance.
The difficult part:
I understand that this is not easy, because, e.g.  in principle, in optimal scenario
services should be able to respond in a similar way than they responded before.
This is perhaps almost possible in some cases in cloud environments, but it is
a rather demanding requirement for all generic services.

I hope that this mail was able to clarify what I ment by "machine actionable PIDs".
Media types are recorded and indexed,  but not expressed by the current version of the  DataCite schema.  What are the plans for Media type support in the future?

```
Best regards

Kristian Garza | Product Designer | DataCite
Support Desk | Support Site | PID Forum
A: DataCite e.V. -- Welfengarten 1B, 30167 Hannover, Germany

Ted Habermann

unread,
Feb 15, 2022, 10:50:29 AM2/15/22
to DataCite Metadata
This change disrupts the idea of a landing page which was created in the beginning of DOIs in order to allow the provider to control directly what users see when they resolve the DOI (if I remember correctly) and to provide more information about download options. With this change, the landing page would still exist but a direct download would also be available? If we decide that is a good idea, would adding a relationType='cotent' accomplish the same goal?

Ted

Ted Habermann

unread,
Mar 10, 2022, 12:06:15 PM3/10/22
to DataCite Metadata
I worked on some international metadata standards that included distribution information (content URLs). This UML diagram illustrates the solution used in ISO 19115-1. It may be helpful in the discussion of how these metadata might be included in the DataCite metadata schema.

Ted
MD_Distribution.png

Reply all
Reply to author
Forward
0 new messages