I think there are several very interesting themes woven into this thread,
and its nice to see Common Crawl brought into the discussion.
Right now I'll only focus on Declan's original question about picking a
stable URL "root". I happen to know that Declan is very identifier-savvy,
and figured that advice about other schemes (ironically, all URL-based!)
wouldn't be germane. Assuming URL-space, how best to stabilize the root?
For starters, Declan has my sympathy. Rebranding of URLs is a conscious
decision to put large numbers of URLs at risk. Most organizations that
we all work for decide to do this every so often, and we have to do
our best to mitigate the fallout. Many of us are struggling to raise
awareness in our organizations about the importance of picking a URL
root string (host and possibly initial path part) that will be stable
(eg, not subject to political pressure). It sounds like Harvard, with
nrs.harvard.edu, has done a pretty reasonable job with this.
A multi-pronged approach might work. I'd push within the organization
for a host name brand that's as neutral as possible. "
library.ucsd.edu"
doesn't seem so bad, but what were the political pressures that made the
old name imperative then and the new name imperative now? Organizations
tend to oscillate between centralized and de-centralized structuring
models at some periodicity, so of course the pendulum could swing back
again. The theoretical extreme of pure neutrality (ie, no brand at all)
can be achieved via an opaque root, but that goes against the natural
organizational urge to get its brand "out there".
Sometimes a third-party brand helps with neutrality -- eg,
dx.doi.org,
purl.org, or
n2t.net. However, neutrality works against organizations
that need to push their brand, so some of them will trade future
redirection table maintenance in order to publish URLs, ARKs, Handles,
etc. bearing their hostname (eg, local Handle resolver). In the DOI
world, this need is commonly expressed by inserting the organizational
brand into the DOI itself (ie, into the path part of the embedding URL),
with subsequent serious challenges to its long-term maintenance.
It sounds like you (Declan) don't have a choice about the beginning part
of your new root, which will be
library.ucsd.edu. My guess is you'll
want to avoid colliding with other names in the server document root and
so you might just keep everything under the "digital/UNIQUEID" pattern.
If so, and if
libraries.ucsd.edu will be a CNAME for
library.ucsd.edu,
you might not have any Apache rewrite rules to do. OTOH, if you're
having second thoughts about "digital" (it's so '90's :-), just make it
something like "d" for brevity, opacity, and collision protection, and do
Alias /digital/ /usr/local/apache/htdocs/d/
Since the change is a deliberate break with the past, technology only goes
so far to mitigate things. So, eg, at least part of the approach will be
preparing users with advance notice of the change in the published URLs
starting at your cutover date. You could add an Apache rule making the
legacy URLs accessible via whatever new root pattern you settle on.
-John