Persistent URLs

Declan Fleming

unread,

Mar 28, 2013, 1:28:42 PM3/28/13

to digital-...@googlegroups.com

Hi - I'm puzzling over how to best set up persistent URLs for our digital objects. I thought I'd solved this ten years ago by staking a claim on the URL real estate at "libraries.ucsd.edu/digital/UNIQUEID". But now we're having a rebranding to "library.ucsd.edu" so my root is changing.

Now, I know that Cool URLs don't break, and we can do the Apache magic to make sure that all of the old "libraries" links will keep working, but what I'm thinking about is whether I want the digital object's persistent, citable URLs to have a root that could be rebranded again in another decade of change. Should I pick a new root that is possibly less changeable? Can I avoid at least SOME Apache config table maintenance in the future, or do I just deal with it and shut up?

How did your repo choose a root URL? Have others been through a change in persistent URL roots?

Thanks!
Declan

Matt Jones

unread,

Mar 28, 2013, 3:02:53 PM3/28/13

to digital-...@googlegroups.com

Despite the theory surrounding CoolURLs, it is well established empirically that URLs are simply not persistent. See for example, Figure 1 from this paper (http://xldb.di.fc.ul.pt/daniel/docs/papers/gomes06urlPersistence.pdf), which shows that only about 15% of URLs in a large sample reach a lifetime of 1000 days. Its abysmal. And other papers back it up (e.g., Science (doi:10.1126/science.1088234) and Computer (doi:10.1109/2.901164)). I'm not really trying to rehash this ground, because it has been extensively discussed, and there are partisans on both sides (you can tell which side I am on). But I think there is a reason that DOIs have been so successful in publishing and citing journal articles -- its because a DOI can still resolve an object's location even when the web URL for an object changes over the years. DOIs don't conflate an object's identity with its address, whereas URLs do, and this is what is causing you pain right now as your addresses change. Also, there is a well-established social contract for DOIs that if you mint one, you are responsible for the persistence of the resource, and when one looks at a DOI, they have different expectations than they have for a URL in terms of persistence. Although in theory this could be true for URLs as well using various redirection techniques, in practice it has not been, and so people are rightfully skeptical that URLs will continue to resolve over time. They are super handy as links, and I fully acknowledge that the tool support for resolving DOIs and other types of indirect identifiers is insufficient. But I still think DOIs and other indirect identifiers that are not HTTP URLs make the best persistent handles for objects.

So, have you considered minting a DOI or an ARK or similar identifier for your resources, and using that as their persistent identifier, and allowing the URL address to move as your web server layout evolves?

Matt

--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digital-curati...@googlegroups.com.
To post to this group, send email to digital-...@googlegroups.com.
Visit this group at http://groups.google.com/group/digital-curation?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Gabriel Farrell

unread,

Mar 28, 2013, 3:40:48 PM3/28/13

to digital-...@googlegroups.com

I applaud the move from libraries.ucsd.edu to library.ucsd.edu. Whoever pluralized that to begin with should be fired.

I agree with Matt that DOIs and ARKs are great for providing another path to the resource. You're still going to have to deal with URL changes, though, and deal with them you must. Make the redirects, monitor traffic, and in 5 or 10 years put some annoying "deprecated" redirect in, then 5 years later drop them. All of this is preferable to "handle" URLs or similar efforts around generalized domains perceived as permanent.

Yes, URLs change, but the one thing university libraries know how to do well is change slowly.

Mark Baker

unread,

Mar 28, 2013, 4:32:08 PM3/28/13

to digital-...@googlegroups.com

Hi Matt,

On Thu, Mar 28, 2013 at 3:02 PM, Matt Jones <jo...@nceas.ucsb.edu> wrote:

> And other papers back it up (e.g., Science
> (doi:10.1126/science.1088234) and Computer (doi:10.1109/2.901164)).

So those DOIs are persistent, right? How about these URLs?

http://dx.doi.org/10.1126/science.1088234
http://dx.doi.org/10.1109/2.901164

> I'm not
> really trying to rehash this ground, because it has been extensively
> discussed, and there are partisans on both sides (you can tell which side I
> am on). But I think there is a reason that DOIs have been so successful in
> publishing and citing journal articles -- its because a DOI can still
> resolve an object's location even when the web URL for an object changes
> over the years. DOIs don't conflate an object's identity with its address,
> whereas URLs do,

In fact, they don't. No part of "http://dx.doi.org" includes an
address, only a name ("dx.doi.org") that can resolve to different (IP)
addresses over time.

IME, if you want to mint persistent identifiers, make sure the
governance of the entire naming hierarchy (e.g. ucsd.edu) is committed
to it. If you can't, consider an organization that performs similar
functions to DOI, but one that isn't afraid to declare that their URLs
are persistent (like PURL). Because what would you rather disseminate
of those two sets of identifiers above; the doi URN that you have to
cut and paste and take to doi.org (or was that doi.net or doi.com, I
can never remember), or the http URL that "just works" when you click
on it?

Persistence has absolutely nothing to do with what form your
identifiers take, and everything to do with governance and,
ultimately, money (to maintain the infrastructure necessary to keep
your identifiers usable).

Mark.

Declan Fleming

unread,

Mar 28, 2013, 5:06:51 PM3/28/13

to digital-...@googlegroups.com

Matt, Gabe, Mark - thanks for all that thought!

Matt - in actuality, our object IDs are ARKs. https://libraries.ucsd.edu/ark:/20775/bb04105815 is the real name of something, and if browsers were ark: or doi: aware, we could just use that part of the name. But they aren't and I'm still on the hook to get someone to the object for the next few hundred years. Note that I'm saying "persistent" URLs, not "permanent". Nothing is permanent, I'm just trying to think through a strategy that minimizes the overhead in maintaining persistence. [I'm also debating getting that "ark:" out of the URL. 2 colons are NOT fun to parse in a URL.]

I like your 15% of 1000 live links citation above. Isn't it our kind of institutions that should be in that minority who struggle to succeed in providing persistence? You are correct that DOIs have a good track record, but they still need a URL to get to them, as Mark points out. I think I want a DOI-like persistence for our repo, hence the thinking about some hostname that we can control.

Mark - I hear you in terms of governance... but we have a "let a thousand flowers bloom" mentality here that resists central control. It makes some things really cool, and others a nightmare. I live in fear of the place being renamed "ucsandiego.com". I wonder how the Univ. of Illinois dealt with this when they went from "uiuc.edu" to "illinois.edu".

Gabe - you should have seen some of the tortured English when "Libraries" was used as a singular noun. ;)

Declan

Kevin Hawkins

unread,

Mar 28, 2013, 7:18:15 PM3/28/13

to digital-...@googlegroups.com

On 3/28/13 5:06 PM, Declan Fleming wrote:
> I
> think I want a DOI-like persistence for our repo, hence the thinking
> about some hostname that we can control.

The University of Michigan Library uses the CNRI Handle System, which,
like OCLC's PURL server and DOIs, is designed for persistence. So our
institutional repository, publications hosted by the Library under the
MPublishing brand, and even objects in HathiTrust all use the
university's prefix. While there's human-readable institutional
branding in the URL, having an opaque URL gets us around the problem of
institutional rebranding that Declan has run into it.

--Kevin

Randy

unread,

Mar 29, 2013, 9:17:55 AM3/29/13

to digital-...@googlegroups.com

At Harvard we have gone the route of having an institutionally supported URL - nrs.harvard.edu - that we have so far maintained for around 15 years and hope to maintain indefinitely. More info at http://hul.harvard.edu/ois/systems/nrs_ams/

No doubt long after I am gone, this URL will be have to be remapped to something else...

Randy Stern
Director, Systems Development
Harvard University IT, Library Technology Services

Tom Creighton

unread,

Mar 29, 2013, 7:18:38 PM3/29/13

to digital-...@googlegroups.com

You can use DOI or ARK as others have suggested. But ultimately you have to expect to do "Apache Magic" to handle a change in location. More commonly called HTTP redirect and slightly different approach for DOI resolution, you can't come up with a once-for-all-time resource locator unless you can completely control the location. But a redirect is not necessarily bad. You probably know about PURL. If not, check out purl.org. The downside here is one or both of:
1) The url is not your own choosing - it includes purl.org
2) The redirect is always required since the model is to use a URL that is highly unlikely to change (such as one based on purl.org) that always redirects to the current location.

ARK is a bit nicer in my mind because
1) The URL is always of your choosing, within the constraints of ARK syntax.
2) Redirection is only necessary if the artifact moves in such a way that you can't fix with a change to DNS binding.

For example, look at this URL from my institution. It is not based on ARK, but we have created our own ARK-like model. And we might decide to actually move to ARK which means URLs with pal in them will redirect to the ARK with the exact same identifier, possibly changing only in terms of the subnamespaces delimited by period (as in delimiting with - or /).

https://familysearch.org/pal:/MM9.1.1/KZX1-6KK

But as it is, we can configure traffic management so that depending on the request headers set for a given URL, the actual resource retrieved can be much different. And if we decide to put a subdomain on the domain name part, that's a simple redirect operation.

Hope this helps.

tc

--

Jason Ronallo

unread,

Mar 30, 2013, 2:34:27 PM3/30/13

to digital-...@googlegroups.com

OK, so we know folks are generally bad about maintaining their own
URLs. Are we certain that folks are any better at updating their DOIs
or handles? Does it matter whether a resolver is maintained internally
or by an external organization? All this takes human intervention, so
it seems reasonable to think that there will always be some breakage
somewhere.

I wonder if since we know that humans aren't good at maintaining
persistent URLs, that we ought to think of another way to give us what
we need.

So in the original question you're being forced to change your URLs.
You're afraid that maintaining Apache config will not last. But if you
do change your URLs you'll do the right thing and return the
appropriate header that the resource has been permanently moved. If
the Apache config is forgotten about then the old links are broken. So
how do we maintain these old links without someone having to remember?

This seems like a case for LAMs crawling other LAMs. Let's crawl each
other and create a store of URLs and their headers. This could then
allow us to know when a resource has changed URLs. Or there could be a
central crawler which keeps this data and distributes it around. A
search interface could allow folks to find information on URLs that
used to exist.

As long as folks just do the right thing in a Web way for their
immediate needs (proper headers and all), then we could have a good
fallback with the crawl data. While this is messier than the tidy
world of everyone maintaining Cool URIs or remembering to update
resolvers for their opaque URLs, I wonder if we just came to the
recognition that there's bound to be breakage we might think about how
to deal with that issue instead. I think keeping a record of crawl
data would be one way to allow the researchers of the future to find
the resources they're citing and linking today. While it'd be better
if we regularly crawled each other and kept the full crawl data (the
HTML and all the linked resources), maybe just keeping the headers
would allow us to provide some services and to scale up better.

Efforts like the Common Crawl [1] may one day help us to do this for a
reasonable cost too.

I don't know if it would be feasible or actually solve enough of a
problem, but I just wonder if there's not another way out of this
problem somewhere in here. I'd be interested in your thoughts.

Jason

[1] http://commoncrawl.org/

John A. Kunze

unread,

Mar 30, 2013, 7:05:58 PM3/30/13

to digital-...@googlegroups.com

I think there are several very interesting themes woven into this thread,
and its nice to see Common Crawl brought into the discussion.

Right now I'll only focus on Declan's original question about picking a
stable URL "root". I happen to know that Declan is very identifier-savvy,
and figured that advice about other schemes (ironically, all URL-based!)
wouldn't be germane. Assuming URL-space, how best to stabilize the root?

For starters, Declan has my sympathy. Rebranding of URLs is a conscious
decision to put large numbers of URLs at risk. Most organizations that
we all work for decide to do this every so often, and we have to do
our best to mitigate the fallout. Many of us are struggling to raise
awareness in our organizations about the importance of picking a URL
root string (host and possibly initial path part) that will be stable
(eg, not subject to political pressure). It sounds like Harvard, with
nrs.harvard.edu, has done a pretty reasonable job with this.

A multi-pronged approach might work. I'd push within the organization
for a host name brand that's as neutral as possible. "library.ucsd.edu"
doesn't seem so bad, but what were the political pressures that made the
old name imperative then and the new name imperative now? Organizations
tend to oscillate between centralized and de-centralized structuring
models at some periodicity, so of course the pendulum could swing back
again. The theoretical extreme of pure neutrality (ie, no brand at all)
can be achieved via an opaque root, but that goes against the natural
organizational urge to get its brand "out there".

Sometimes a third-party brand helps with neutrality -- eg, dx.doi.org,
purl.org, or n2t.net. However, neutrality works against organizations
that need to push their brand, so some of them will trade future
redirection table maintenance in order to publish URLs, ARKs, Handles,
etc. bearing their hostname (eg, local Handle resolver). In the DOI
world, this need is commonly expressed by inserting the organizational
brand into the DOI itself (ie, into the path part of the embedding URL),
with subsequent serious challenges to its long-term maintenance.

It sounds like you (Declan) don't have a choice about the beginning part
of your new root, which will be library.ucsd.edu. My guess is you'll
want to avoid colliding with other names in the server document root and
so you might just keep everything under the "digital/UNIQUEID" pattern.
If so, and if libraries.ucsd.edu will be a CNAME for library.ucsd.edu,
you might not have any Apache rewrite rules to do. OTOH, if you're
having second thoughts about "digital" (it's so '90's :-), just make it
something like "d" for brevity, opacity, and collision protection, and do

Alias /digital/ /usr/local/apache/htdocs/d/

Since the change is a deliberate break with the past, technology only goes
so far to mitigate things. So, eg, at least part of the approach will be
preparing users with advance notice of the change in the published URLs
starting at your cutover date. You could add an Apache rule making the
legacy URLs accessible via whatever new root pattern you settle on.

-John

Reply all

Reply to author

Forward