The idea of movage echoes the kind of architecture for the Oxford archive;
every layer/column of the storage platform is very loosely coupled to the
other parts, as we plan for parts to be superceded and for content to
migrate naturally and as needed - note 'needed' rather than 'planned'
migration. It's our guess that things need to be kept 'in motion' - checked,
characterised, always ready to be moved.
We are aiming to preserve access to materials by storing a canonical
version, and to maintain a dissemination version (or more as required).
The canonical version is as similar to the original as possible - archiving
the explicit knowledge (tiff images of pages, PDFs, LaTeX, audio files)
alongside the implicit (file format characterisation, who/what/where image
taken, text/data-mined metadata, etc)
The dissemination version(s) are user-driven - can be as simple as a
verbatim document viewer, "page turning app" or whatever, but can be more
useful - OCRd text from a scanned page, rendered to the user with key
sentences and keywords drawn in a larger font, or direct links to eJournals
inserted on top of where citations are referenced in a journal article.
SPIDER project (
http://imageweb.zoo.ox.ac.uk/wiki/index.php/Spider_Project)
shows what can be acheived by hand, and IMO there is good scope for
automated improvements to paginated content.
As this is still early days, I am not sure about whether or not to keep
dissemnition copies. If they are referenced, are we obligated to maintain
them? Or should we take the OS versioning strategy and keep the last known
good versions of the numbered releases - only keeping the older
dissemintations if a substantial change was made for a new one [a point
release (0.3 -> 0.4) is an indication that something was fixed or an
addition to the viewing capabilities was made, but a full increment (1.2 ->
2.0) indicates that the new dissemination is substantially different from
the last.]
In some cases, the dissemintation copies can be made on the fly and cached
(TIFF/JPG2000 -> lower res jpgs/pngs) as the amount of total imagery stored
vs.
the amount that will ever be used is a very top heavy ratio :)
In other cases, where the main use is for serendipidous reuse, such as text-
and data-mining, the benefit comes from allowing users immediate access to
this body of derived information, rather than on demand.
Apologies for the inarticulate post, but hey, it's xmas, and I really just
wanted to add that preservation is an extremely active activity, exactly as
was stated in the previous email - your system has to be ready to change at
a drop of a hat.
Oh and one last thing - PREMIS and standardising METS profiles for
interchange of digital items is all well and good, but the white elephant in
the room is the legal and financial issues - legal depts will disallow
transfers based on fear of infringement and accountants will want to charge
someone somehow. So, my yuletide message is standardise for yourselves,
write down all the implicit information and try not to tie too much
information in the packaging - my acid test is that if you just pass the
files for an item in a zip archive to a colleague/peer, someone with little
idea of what you are dealing with or the standards involved, and if they can
work out what all the bits are with the help of google, then you've been
successful.
Merry Xmas all!
Ben O'Steen
2008/12/23 Ed Summers <
ed.su...@gmail.com>: