Explanation on the Decentralised Persistent Identification

2 views

Skip to first unread message

Andrey Vukolov

unread,

Mar 7, 2025, 1:18:49 PMMar 7

to Inna Bukreeva, inna.b...@cnr.it, Elena Longo, dPID Working Group

Dear Inna,

Here is a brief explanation of the challenges we are addressing and the tasks we are handling in our small initiative group of developers working on the decentralised persistent identification of the digital metadata, data and documents. The work on this small initiative was started by the end of the EXPANDS project of EOSC dedicated to the FAIR data openness principles in the european scientific ecosystem.

During the discussion about identification and categorisation of data (electronic libraries, sample catalogs, investigation registries, articles, etc.) held in October 2023 during the RDA International Data Week with Christopher Hill and Erik van Winkle from DeSci Labs, we have formulated the following list of issues and challenges to address:

The categorization of any kind of data or metadata depends on registries and persistent identifiers (PID).
The most used and popular registry now is DOI for any kind of data.
The minting model for the persistent identifier provided by DOI does not depend from the addressed data.
DOI registry actually owns the links it provides to the users and it is responsible for making changes to them.
The storage model DOI ecosystem uses to categorize the data is federated but as it is dependent on the control level from the organuzation, it is centralised.
Minting of PID and storing the records for the given DOI is a social contract where the data owner (researcher) yields ownership of his metadata to the registry operator.
As the PID does not depend from the data it addresses and it cannot be regenerated on the resolver's side, the centralised minting model of DOI seems vulnerable to the things like rotten (unavailable) links, link expiration, data mangling, even censorship and malicious intrusions on the federated nodes.
There is no web of trust in the DOI ecosystem as the users have no opportunity to disclose their identities and to verify the published data fot the other users.
The centralised PID system is a single point of failure vulnerable to the infrastructure outage.
The cost of deployment for the centralised or federated PID infrastructures grows exponentially, so the scalability is finite and limited.

After initial analysis of these challenges, the technology stack was formulated initially. It contains the open source, free to use technologies that allow to:

Obtain the data interchange in secure manner between all the participants based on unique cryptographic keys, like the systems like BitTorrent do.
The identification of the shared data is based on the content-dependent technologies and hash functions to ensure reproducibility of the identifiers on the resolver's side.
The participants (s.c. Agents) of the system are able to disclose and reclaim their idetities by sharing openly the public cryptographic keys.
It is plausible way to avoid the blockchain technologies vulnerable to hardforking and monopolization of the access by blockchain operators.
The initial technology stack includes:

Distributed Hash Tables (DHT) to openly share metadata and securely broadcast encrypted messages between the participants.
Decentralised storage networks like IPFS and Iroh as data sharing layers.
Git as a local versioning repository controller to store and share the data locally.
State-of-the-art customised software layers to ledger and store the openly shared metadata

The main purposes and targets of the initiative group are to provide the PID ecosystem that should be:

Easy: It must be easy to create, find and retrieve entities stored on the network in a programmatic and secure fashion
Versionable and modular: research objects must be easy to update and fork. Versionability must be a first-class property of the research object
Provenance: For every operation on the network, an immutable log of "who, what, and when" needs to be preserved and made accessible
Programmable access: Not all data should or can be openly accessible. The network must allow for both open data and restricted access data
Permissionless: Anyone must be able to create a research object on the network, fork a research object, or enrich a research object with metadata such as machine-readable semantics, community discussions, or attestations
Open Source and decentralised: The network must allow for decentralisation to create resiliency, and its underlying code base must open source to enable collaborative improvements and stewardship
Reproducible and verifiable: The protocol must allow for linkage of research artefacts such as code, models, data and publications into connected entities that enable verifiability and minimises fragmentation

The group created the blog post for RDA (now unavailable due to website reconstruction, see the Google Docs version here) that explains the challenges of creating such ecosystem. The group now relies now also on these two main documentation drafts:

We are now searching for the interested organizations, stakeholders, and collaborators to obtain any kind of support, possibilities of legal recognition or any other kind of collaboration that could be used to promote, support and accelerate the development. We also have the strong support from the Global South because of the problems the people there have with the centralised systems like DOI.

Please do not hesitate to contact me in case of any questions, remarks or clarifications needed. I also provided a CC for this letter to the mailing list of the group to make all the respective participants able to place their remarks too.

-- 
-
==
Best regards,
Andrey Vukolov
Scientific Computing Engineer, PLC Programmer,
ExPaNDS project (expands.eu) Data Stewardship Engineer
ELETTRA Sincrotrone Trieste (elettra.eu)
Strada Statale 14 - 163.5km, AREA Science Park,
34149 Basovizza, Trieste, Italy
Tel.: +39 348 888 4453
-

Reply all

Reply to author

Forward

0 new messages