Dear Inna,
Here is a brief explanation of the challenges we are addressing
and the tasks we are handling in our small initiative group of
developers working on the decentralised persistent identification
of the digital metadata, data and documents. The work on this
small initiative was started by the end of the EXPANDS project of EOSC dedicated
to the FAIR data openness principles in the european scientific
ecosystem.
During the discussion about identification and categorisation of
data (electronic libraries, sample catalogs, investigation
registries, articles, etc.) held in October 2023 during the RDA
International Data Week with Christopher Hill and Erik van
Winkle from DeSci Labs, we have formulated the following list of
issues and challenges to address:
- The categorization of any kind of data or metadata depends on
registries and persistent identifiers (PID).
- The most used and popular registry now is DOI for any kind of data.
- The minting model for the persistent identifier provided by
DOI does not depend from the addressed data.
- DOI registry actually owns the links it provides to the users
and it is responsible for making changes to them.
- The storage model DOI ecosystem uses to categorize the data is
federated but as it is dependent on the control level from the
organuzation, it is centralised.
- Minting of PID and storing the records for the given DOI is a
social contract where the data owner (researcher) yields
ownership of his metadata to the registry operator.
- As the PID does not depend from the data it addresses and it
cannot be regenerated on the resolver's side, the centralised
minting model of DOI seems vulnerable to the things like rotten
(unavailable) links, link expiration, data mangling, even
censorship and malicious intrusions on the federated nodes.
- There is no web of trust in the DOI ecosystem as the users
have no opportunity to disclose their identities and to verify
the published data fot the other users.
- The centralised PID system is a single point of failure
vulnerable to the infrastructure outage.
- The cost of deployment for the centralised or federated PID
infrastructures grows exponentially, so the scalability is
finite and limited.
After initial analysis of these challenges, the technology stack
was formulated initially. It contains the open source, free to use
technologies that allow to:
- Obtain the data interchange in secure manner between all the
participants based on unique cryptographic keys, like the
systems like BitTorrent do.
- The identification of the shared data is based on the
content-dependent technologies and hash functions to ensure
reproducibility of the identifiers on the resolver's side.
- The participants (s.c. Agents) of the system are able to
disclose and reclaim their idetities by sharing openly the
public cryptographic keys.
- It is plausible way to avoid the blockchain technologies
vulnerable to hardforking and monopolization of the access by
blockchain operators.
- The initial technology stack includes:
- Distributed
Hash Tables (DHT) to openly share metadata and securely
broadcast encrypted messages between the participants.
- Decentralised storage networks like IPFS and Iroh as data sharing
layers.
- Git as a local versioning
repository controller to store and share the data locally.
- State-of-the-art customised software layers to ledger and
store the openly shared metadata
The main purposes and targets of the initiative group are to
provide the PID ecosystem that should be:
- Easy: It must be easy to create, find and retrieve
entities stored on the network in a programmatic and secure
fashion
- Versionable and modular: research objects must be easy
to update and fork. Versionability must be a first-class
property of the research object
- Provenance: For every operation on the network, an
immutable log of "who, what, and when" needs to be preserved and
made accessible
- Programmable access: Not all data should or can be
openly accessible. The network must allow for both open data and
restricted access data
- Permissionless: Anyone must be able to create a
research object on the network, fork a research object, or
enrich a research object with metadata such as machine-readable
semantics, community discussions, or attestations
- Open Source and decentralised: The network must allow
for decentralisation to create resiliency, and its underlying
code base must open source to enable collaborative improvements
and stewardship
- Reproducible and verifiable: The protocol must allow
for linkage of research artefacts such as code, models, data and
publications into connected entities that enable verifiability
and minimises fragmentation
The group created the blog post for RDA (now unavailable due to
website reconstruction, see the Google
Docs version here) that explains the challenges of creating
such ecosystem. The group now relies now also on these two main
documentation drafts:
We are now searching for the interested organizations,
stakeholders, and collaborators to obtain any kind of support,
possibilities of legal recognition or any other kind of
collaboration that could be used to promote, support and
accelerate the development. We also have the strong support from
the Global South because of the problems the people there have
with the centralised systems like DOI.
Please do not hesitate to contact me in case of any questions,
remarks or clarifications needed. I also provided a CC for this
letter to the mailing list of the group to make all the respective
participants able to place their remarks too.
--
-
==
Best regards,
Andrey Vukolov
Scientific Computing Engineer, PLC Programmer,
ExPaNDS project (expands.eu) Data Stewardship Engineer
ELETTRA Sincrotrone Trieste (elettra.eu)
Strada Statale 14 - 163.5km, AREA Science Park,
34149 Basovizza, Trieste, Italy
Tel.: +39 348 888 4453
-