Encoding periods in qualifiers

47 views
Skip to first unread message

Gabriel Müller

unread,
Dec 3, 2024, 6:00:49 AM12/3/24
to ARKs
Dear all,

I have a question about the design of the "qualifier" part of ARKs. A colleague and I are thinking about the best way to assign ARKs to code repositories, and found that the URLs used by github/gitlab align with the ARK standard for the most part. If we assign an opaque name for the repo itself, we can (almost) reuse the existing URLs to give us a legal ARK for every file in every tagged version. For example, we could redirect n2t.net/ark:/15737/p654p8z5cgjv to https://github.com/je4/FairService, and
n2t.net/ark:/15737/p654p8z5cgjv/v2.0.13/pkg/ark/git/gitplugin.go to https://github.com/je4/FairService/blob/v2.0.13/pkg/ark/git/gitplugin.go. The resolver does need to have a few simple replacement rules, like adding the additional "blob/" in the example.

There is an obvious syntactic problem, however: Both the release tags and the file names often contain periods. If I understand the the ARK scheme correctly, though a Name Mapping Authority is free to publish their arks with qualifiers or not, a period anywhere after the "ark:" part *must* indicate an object variant (I'm reading the 2023 draft here). That is clearly not the case in my example above: "v2.0.13" is one single tag, not a variant of a variant. Likewise with the filename at the end: There is no "/gitplugin" object of which "/gitplugin.go" is a variant.

Does anybody have suggestions how we could design our persistent URLs to integrate these types of examples? One could %-encode the periods, but web browsers tend to be 'helpful' and decode them: wikipedia%2Eorg is treated as wikipedia.org. If we encode our periods in this way in our published ARKs, we have no guarantee that they will still be encoded when the user interacts with them. Still, at least we could make sure that when an ARK is visible on a website to humans, all non-semantic periods are %-encoded.
The alternative to %-encoding would be to replace all periods, but the question is with what: Most other characters that are legal in an URL can also appear in our target URLs (e.g. a release tag called "2_0_13").

We would welcome any suggestions or feedback.

Best
Gabriel Müller

John Kunze

unread,
Dec 8, 2024, 3:20:02 AM12/8/24
to arks-...@googlegroups.com
There is an obvious syntactic problem, however: Both the release tags and the file names often contain periods. If I understand the the ARK scheme correctly, though a Name Mapping Authority is free to publish their arks with qualifiers or not, a period anywhere after the "ark:" part *must* indicate an object variant (I'm reading the 2023 draft here). That is clearly not the case in my example above: "v2.0.13" is one single tag, not a variant of a variant. Likewise with the filename at the end: There is no "/gitplugin" object of which "/gitplugin.go" is a variant.

Hi Gabriel,

This is a good question, and it highlights several issues. PID schemes make tradeoffs to try to be usable today and in the future. In order to be compatible with current internet (ie, web) practice, ARKs tried to align with common conventions around periods near the ends of URLs (eg, .pdf), but to keep parsing rules simple and to accommodate providers who prefer to organize some variation (eg, language: .en, .fr, .es) earlier in the path, periods would be treated the same way anywhere they appeared in the ARK, for example, in non-suffix positions such as

ark:/12345/x54b92/foo.en/bar.pdf
ark:/12345/x54b92/foo.fr/bar.pdf
ark:/12345/x54b92/foo.de/bar.pdf

OTOH, we never actually heard demand for the non-suffix position (it doesn't mean there's not latent demand). nd the ARK spec has avoided using examples of periods in non-suffix positions in order to make it a little easier if implementation experience suggests that the rule should be changed (so that periods have special meaning only in the final path component). 

What if that change were proposed to and approved by the Technical working group? First, it would take care of the "v2.0.13" in n2t.net/ark:/15737/p654p8z5cgjv/v2.0.13/pkg/ark/git/gitplugin.go. As for the "/gitplugin.go" implying the existence of a "/gitplugin" object, the latter (with no .suffix part, and even if there's no such file) would plausibly be mapped to the default variant (or the only variant), namely, the go-lang source file.

In line with your observation about browsers trying to be "helpful", long experience suggests it's good to avoid %-encoding in URLs if possible. If it became necessary, I'd be inclined to agree that mapping '.' to '_' is preferable. But before doing either, would you be interested in an adjustment to the ARK spec along the lines in the previous paragraph?


--
You received this message because you are subscribed to the Google Groups ARKs group. To post to this group, send email to arks-...@googlegroups.com. To unsubscribe from this group, send email to arks-forum+...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/arks-forum?hl=en
---
You received this message because you are subscribed to the Google Groups "ARKs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arks-forum+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/arks-forum/80b1940b-e9bc-422b-9dad-8745383289c3n%40googlegroups.com.

Gabriel Müller

unread,
Dec 13, 2024, 6:14:31 AM12/13/24
to ARKs
Hi John

Thank you for the recommendations and for the offer of suggesting a change to the schema just because of our issue. If I understand you correctly, the new rule would be that periods only signify an object variant when they occur to the right of the last slash. That seems like a very good idea to me! I feel that we'll need to think a bit more about what exactly this would mean for our implementation before we can be sure, however. That will probably take us until sometime in January. I will message again when we have any news.

Dave Vieglais

unread,
Dec 13, 2024, 11:32:08 AM12/13/24
to arks-...@googlegroups.com

Hi John,

Reviewing the actual use of periods in ARKs via resolver service logs, it is apparent that there are many instances where periods appear in various locations (and more than once) in ARK identifiers. Hence it would seem that such provisions in the spec are advisory.

Dave Vieglais

Reply all
Reply to author
Forward
0 new messages