Sounds like you're thinking of categorical (social science, survey,
etc.) data? I know there have been some efforts to build repositories
with some successes and some crashes and burns (the promising but now
defunct Google Research Data project is a good example there).
I'm sure others here know more than I do about this. best, Joe
--
Joseph Lorenzo Hall
ACCURATE Postdoctoral Research Associate
UC Berkeley School of Information
Princeton Center for Information Technology Policy
http://josephhall.org/
Hello Group,
My name is Alexis Madrigal and I'm a science reporter with Wired.com.
I'm working on a story about specific areas where the Obama
administration could make scientific data from the USDA, Minerals
Management Service, DOE, NIH, and other agencies more available and/or
accessible. We're talking not just about having it online, but also in
-- Rick cell: 703-201-9129 web: http://www.rickmurphy.org blog: http://phaneron.rickmurphy.org
> (As to why they aren't running a redirector on the current
> whitehouse.gov site that accesses 43.archive.whitehouse.gov for
> 404s .... well, nobody asked me. :))
>
> Carl
Carl,
This response to that is a little more "techy" and lower level than is
customary on this list but I hope you and others like it anyway.
Every time a static page on whitehouse.gov or a similar site is updated,
the service at that host should generate:
1) A stable URL for that version of that page, valid for some
time period (I'd say 30 years but even 30 days could work).
The site MUST NOT re-use these stable years ever: they
can take down the content after some period of time but
must not re-use the URL.
2) Meta-data on that page (e.g., using something like RDFa)
that includes checksums and perhaps a signature on the payload.
3) An RSS item announcing the publication and its stable URL.
4) Optionally: versioning meta-data relating it to previous
publications (e.g., "THIS replaces THAT" or "THIS combines
THAT and THAT OTHER THING"). Other optional meta-data such
as authorship.
Given those technically simple steps, it is no longer necessary to
archive those sites by spidering and hoping for meaningful snapshots.
An archivist can simply read off the RSS feed and collect the relevant
page snapshots from their stable URLs, using spidering mainly to
validate the archive.
An archivist can save the documents by using their stable URLs as a
relative URL on a new site.
A "redirector" is then a generic thing. A single redirector can be
applied to *any* site that constructs such an archive.
One valuable contribution of taking these steps, earlier rather than
later, is arguably this:
The resulting form of archive is easy for non-technical-type people to
understand. Anyone who can understand the concept of the Federal
Register can understand this form of archive. This form of archiving
not only creates an accurate record of how these sites change over time,
it reifies, in a human-friendly way, the form and function of an
archive.
Government communications to the public should be idealized as a kind of
"journaling" / "write-once" database / file-system with meta-data
sensitively designed for archival needs. Making that ideal real for
the web sites is well within reach, roughly along the lines of what I
described in (1)..(4) above.
As a technical matter, what I've described can likely be implemented in
a layered fashion without the need to substantially disrupt the content
management systems currently used.
Regards,
-t