Long term sustainablitly of ARKs and federation

74 views
Skip to first unread message

Andreas Kohlbecker

unread,
Apr 17, 2023, 12:12:20 PM4/17/23
to ARKs
Dear ARK community,

after reading all the documentation and on the benefits of using ARKs over DOIs one big question remains:

As far as I understood "over 90% of the ARKs in the world are published without using n2t.net in the URL hostname.", this means consequently that these 90% of ARKs are registered and resolved by servers, that are run and maintained by members of the ARK Alliance community or simply by institutions that have adopted ARKs. How can this open ARK infrastructure be sustainable? With DOIs it holds true at least the metadata of dead object will be available after the institutional server has died. How about ARKs in this case? I guess once an institution can no longer maintain the resolver also the metadata will be lost.

Are there any plans for establishing a federation between all the resolvers in the ARK Alliance community to make the whole PID system more sustainable?

All the best,
Andreas

Stephen Richard

unread,
Apr 18, 2023, 2:27:29 PM4/18/23
to ARKs
Important question. My understanding is that the registration metadata for all ARKs is maintained by n2t.net. It just the who and when of identifier registration-- not much, but something
steve

John Kunze

unread,
Apr 18, 2023, 6:54:09 PM4/18/23
to arks-...@googlegroups.com
Hi Andreas,

I might not completely understand the question, but I'd restate the premise a bit. Instead of

90% of ARKs are registered and resolved by servers, that are run and maintained by members of the ARK Alliance community or simply by institutions that have adopted ARKs.

I'd say 

   100% of ARKs and 100% of DOIs are registered and resolved by servers that are run and maintained by members of the ARK Alliance and DOI communities, respectively. All those identifiers rely critically on thousands of institutional web servers that have adopted ARKs and DOIs, respectively, since those servers collectively host primary content for their communities.

That being said, primary content access would appear to be equally vulnerable to institutional failure independent of whether access goes through the ARK or the DOI infrastructure. So in regard to the main PID function of providing long term access, the ARK and DOI infrastructures could be seen as comparable.

Separate from long term access to primary content is long term access to secondary content (namely, metadata). I agree completely in the importance of maintaining a redundant copy of the metadata outside the institution hosting the primary content. 

For that reason, from the start N2T served as both a resolver and an external metadata store. Unfortunately, it was only available to a handful of ARK organizations, such as EZID users and the Internet Archive. The last I checked, there were about 60 million ARKs (records) in N2T and 79% of them had metadata. Some of the metadata is rich and some is minimal (who, what, when). 

Resource constraints made it hard to implement an accounting system (logins, password, etc) for N2T that would permit its more widespread use as a redundant metadata store for ARK organizations, however, the California Digital Library has been refreshing the technology behind N2T and we will soon have a better understanding of what its capabilities will be. For example, there may be a solution for the accounting (access control) problem, and that would allow all ARK organizations to maintain their NAAN registry entry, and perhaps also to deposit metadata in external storage.

Either way, having at least one external copy of the metadata is a goal that I strongly support. A second copy would be even better. Around 8 years ago we (N2T) entered into discussion with Crossref about their storing a copy of ARK metadata and N2T storing a copy of their DOI metadata. As part of a trial they loaded all their DOIs into N2T and we demonstrated that N2T could do resolution and content negotiation in what might be a kind of hot failover situation. Although that discussion didn't go further, it shows the interest both parties had in scalable, redundant, and collaborative infrastructure. 

I'd say there is real interest in improving ARK metadata redundancy, and given that storage prices keep falling, we may not be far from supplying the missing piece of sustainability that's concerned with federated metadata. Thank you for bringing up this issue. I'll make sure to add it to the agenda in the ongoing discussions we are having in the ARK Alliance Advisory Group.

Best,

-John

--
You received this message because you are subscribed to the Google Groups ARKs group. To post to this group, send email to arks-...@googlegroups.com. To unsubscribe from this group, send email to arks-forum+...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/arks-forum?hl=en
---
You received this message because you are subscribed to the Google Groups "ARKs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arks-forum+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arks-forum/1a78a031-720b-41b5-a000-979babaef821n%40googlegroups.com.

Andreas Kohlbecker

unread,
Apr 20, 2023, 10:53:50 AM4/20/23
to ARKs
Hi John,

My question was primarily focused on the long-term sustainability of what you are naming secondary content, that is, metadata.

It is promising for the future of the ARK system, that there could be enhancements in the latest N2T software that may extend its capabilities in a way that opens the system to more ARK organizations, perhaps enabling them to deposit metadata in external storage.

The pilot study that has been exercised with Crossref, is really interesting, not a least since it has been demonstrated, that it is in principle possible for N2T to serve as a failover system for the DOI infrastructure.
IMHO, for a better redundancy of the ARK system itself, the approach of mutual failover systems of ARK and DOI would only be like a second level backup. The primary metadata redundancy should rather be established within the ARK community.

You mentioned the falling storage prices as a factor that may influence the formation of a federated metadata redundancy. In case the average metadata size would be 1 KB, the 8.2 Billion ARKs that exist by now should easily fit onto a 10 TB store. Given that my assumptions are not too far from reality, this is factor no longer a blocker.

I would be interested in the outcome of the discussion on this topic in the ARK Alliance Advisory Group.

Best,
Andreas

Donny Winston

unread,
Apr 20, 2023, 12:11:17 PM4/20/23
to ARKs
One position on ensuring long-term metadata availability is the "Available data" bullet of <https://openscholarlyinfrastructure.org/#insurance>, i.e. "Underlying data should be made easily available via periodic data dumps."

Crossref has adopted this position, and for its allocation of DOIs and stewardship of associated metadata, has so far provided three annual dumps available via torrent (last blog post, search for "Crossref" on academictorrents.com, landing page for their last (April 2022) dump). Their last dump, in April 2022, contained 134M records and is 160GB.

DataCite has not yet provided a similarly clear dump of their DOI holdings, but someone has taken an interest in doing this for them, posting the dumps to archive.org, e.g. https://archive.org/details/datacite_dump_20221118 is the latest there.

This, of course, is still fragmented for DOI holdings, i.e. one needs to gather such dumps from each DOI provider. This is perhaps a practically sustainable situation for the DOI system because the various providers are known and relatively (vs the ARK system) few in number. For the ARK community, I can see clear value in voluntary consolidation of e.g. CC0-licensable metadata across NAAs to a shared store on a periodic (e.g. quarterly or annual) basis. So yes, I also am interested in ongoing discussion on this topic.

Best,
Donny

P.S. Another potential "leg" of redundancy is to use Amazon's current Open Data program (https://aws.amazon.com/opendata/) as e.g. the OpenAlex effort does for dumps (https://docs.openalex.org/download-all-data/download-to-your-machine). I stress "leg" here because by no means am I suggesting any singular dependence whatsoever on this large corporation's current offering of free hosting.

--
Donny Winston, PhD (he/him/his)
Polyneme LLC
New York, NY

Tallman, Nathan

unread,
Apr 21, 2023, 10:55:43 AM4/21/23
to ARKs
Hi all,

It's been interesting following this discussion. I'm glad that Donny is pointing to POSI. Might a lightweight approach to storing additional copies of the n2t metadata are public github and gitlab repositories that could be updated monthly or on some sort of periodic basis through a simple git comitt and push? If additional preservation is desired, internet archive or OSF might be a good choice.

Thanks, 

Nathan 

  

--  

Nathan Tallman (he/him) 

nt...@psu.edu 

Schedule a Meeting 

Chat with me on Teams

Let me know how I'm Doing




From: arks-...@googlegroups.com <arks-...@googlegroups.com> on behalf of Donny Winston <do...@polyneme.xyz>
Sent: Thursday, April 20, 2023 12:10 PM
To: ARKs <arks-...@googlegroups.com>
Subject: Re: [arks] Long term sustainablitly of ARKs and federation
 
Reply all
Reply to author
Forward
0 new messages