Intheory, the domain name space owner owns the domain name space andtherefore all URIs in it. Except insolvency, nothing prevents the domain nameowner from keeping the name. And in theory the URI space under your domainname is totally under your control, so you can make it as stable as you like.Pretty much the only good reason for a document to disappear from the Web isthat the company which owned the domain name went out of business or can nolonger afford to keep the server running. Then why are there so many danglinglinks in the world? Part of it is just lack of forethought. Here are somereasons you hear out there:
That I can sympathize with - the W3C went through a period like that, whenwe had to carefully sift archival material for confidentiality before makingthe archives public. The solution is forethought - make sure you capture withevery document its acceptable distribution, its creation date and ideally itsexpiry date. Keep this metadata.
This is one of the lamest excuses. A lot of people don't know that serverssuch as Apache give you a lot of control over a flexible relationship betweenthe URI of an object and where a file which represents it actually is in afile system. Think of the URI space as an abstract space, perfectlyorganized. Then, make a mapping onto whatever reality you actually use toimplement it. Then, tell your server. You can even write bits of your serverto make it just right.
There is a crazy notion that pages produced by scripts have to be locatedin a "cgibin" or "cgi" area. This is exposing the mechanism of how you runyour server. You change the mechanism (even keeping the content the same )and whoops - all your URIs change.
the main page for starting to look for documents, is clearly not going tobe something to trust to being there in a few years. "cgi-bin" and"oldbrowse" and ".pl" all point to bits of how-we-do-it-now. By contrast, ifyou use the page to find a document, you get first an equally bad
Looking at this one, the "pubs/1998" header is going to give any futurearchive service a good clue that the old 1998 document classification schemeis in progress. Though in 2098 the document numbers might look different, Ican imagine this URI still being valid, and the NSF or whatever carries onthe archive not being at all embarrassed about it.
This is the probably one of the worst side-effects of the URN discussions.Some seem to think that because there is research about namespaces which willbe more persistent, that they can be as lax about dangling links as they likeas "URNs will fix all that". If you are one of these folks, then allow me todisillusion you.
Most URN schemes I have seen look something like an authority ID followedby either a date and a string you choose, or just a string you choose. Thislooks very like an HTTP URI. In other words, if you think your organizationwill be capable of creating URNs which will last, then prove it by doing itnow and using them for your HTTP URIs. There is nothing about HTTP whichmakes your URIs unstable. It is your organization. Make a database which mapsdocument URN to current filename, and let the web server use that to actuallyretrieve files.
Now here is one I can sympathize with. I agree entirely. What you need todo is to have the web server look up a persistent URI in an instant andreturn the file, wherever your current crazy file system has it stored awayat the moment. You would like to be able to store the URI in the file as acheck, and constantly keep the database in tune with actuality. You'd like tostore the relationships between different versions and translations of thesame document, and you'd like to keep an independent record of the checksumto provide a guard against file corruption by accidental error. And webservers just don't come out of the box with these features. When you want tocreate a new document, your editor asks you for a URI instead of tellingyou.
Too bad. But we'll get there. At W3C we use Jigedit functionality(Jigsaw server used for editing) which does track versions, and weare experimenting with document creation scripts. If you make tools, serversand clients, take note!
When you change a URI on your server, you can never completely tell whowill have links to the old URI. They might have made links from regular webpages. They might have bookmarked your page. They might have scrawled the URIin the margin of a letter to a friend.
URIs change when there is some information in them which changes. It iscritical how you design them. (What, design a URI? I have to design URIs?Yes, you have to think about it.). Designing mostly means leaving informationout.
The creation date of the document - the date the URI is issued - is onething which will not change. It is very useful for separating requests whichuse a new system from those which use an old system. That is one thing withwhich it is good to start a URI. If a document is in any way dated, eventhough it will be of interest for generations, then the date is a goodstarter.
is the latest "Money daily" column in "Money" magazine. The main reasonfor not needing the date in this URI is that there is no reason for thepersistence of the URI to outlast the magazine. The concept of "today'sMoney" vanishes if Money goes out of production. If you want tolink to the content, you would link to it where it appears separately in thearchives as
(Looks good. Assumes that "money" will mean the same thing throughout thelife of
pathfinder.com. There is a duplication of "98" and an ".html" youdon't need but otherwise this looks like a strong URI).
I'll go into this danger in more detail as it is one of the more difficultthings to avoid. Typically, topics end up in URIs when you classify yourdocuments according to a breakdown of the work you are doing. That breakdownwill change. Names for areas will change. At W3C we wanted to change "MarkUp"to "Markup" and then to "HTML" to reflect the actual content of the section.Also, beware that this is often a flat name space. In 100 years are you sureyou won't want to reuse anything? We wanted to reuse "History" and"Stylesheets" for example in our short life.
This is a tempting way of organizing a web site - and indeed a temptingway of organizing anything, including the whole web. It is a great mediumterm solution but has serious drawbacks in the long term
Part of the reasons for this lie in the philosophy of meaning. every termin the language it a potential clustering subject, and each person can have adifferent idea of what it means. Because the relationships between subjectsare web-like rather than tree-like, even for people who agree on a web maypick a different tree representation. These are my (oft repeated) generalcomments on the dangers of hierarchical classification as a generalsolution.
A reason for using a topic area as part of the URI is that responsibilityfor sub-parts of a URI space is typically delegated, and then you need a namefor the organizational body - the subdivision or group or whatever - whichhas responsibility for that sub-space. This is binding your URIs to theorganizational structure. It is typically safe only when protected by a datefurther up the URI (to the left of it): 1998/pics can be taken to mean foryour server "what we meant in 1998 by pics", rather than "what in 1998we did with what we now refer to as pics."
Remember that this applies not only to the "path" part of a URI but to theserver name. If you have separate servers for some of your stuff, rememberthat that division will be impossible to change without destroying many manylinks. Some classic "look what software we are using today" domain names are"
cgi.pathfinder.com", "secure", "
lists.w3.org". They are made to makeadministration of the servers easier. Whether it represents divisions in yourcompany, or document status, or access level, or security level, be very,very careful before using more than one domain name for more than one type ofdocument. remember that you can hide many web servers inside one apparent webserver using redirection and proxying.
Oh, and do think about your domain name. If your name is not soap, willyou want to be referred to as "
soap.com" even when you have switched yourproduct line to something else. (With apologies to whoever owns
soap.com atthe moment).
Keeping URIs so that they will still be around in 2, 20 or 200 or even2000 years is clearly not as simple as it sounds. However, all over the Web,webmasters are making decisions which will make it really difficult forthemselves in the future. Often, this is because they are using tools whosetask is seen as to present the best site in the moment, and no one hasevaluated what will happen to the links when things change. The message hereis, however, that many, many things can change and your URIs can and shouldstay the same. They only can if you think about how you design them.
If you are using, for example, Apache, you can set it up to do contentnegotiation. You keep the file extension (such as .png) on the file (e.g.mydog.png), but refer to the web resource without it. Apachethen checks the directory for all files with that name and any extension, andit can also pick the best one out of a set (e.g. GIF and PNG). (You donot have to put different types of file in different directories, infact the content negotiation won't work if you do.)
During 1999, a page I found documenting school closings due to snow. An alternative towaiting for them to scroll past the bottom of the TV screen! I put a pointerto it from my home page. Come the first big storm of 2000, and I check thepage. It says,
One of the smarts which came with a growing dependency on the web was thatapplications could have built-in links back to the manufacturer's web site.This has been used and abused to a great extent, but - you do have to keepthe URL the same. Just the other day I tried a link from Microsoft'sNetmeeting 2/something client under a menu "Help/Microsoft on the Web/Freestuff" and got an Error 404 - not found response from the server. They haveprobably fixed it by now...
3a8082e126