movage

17 views
Skip to first unread message

Ed Summers

unread,
Dec 23, 2008, 9:18:16 AM12/23/08
to Digital Curation
We recently had a barcamp style event [1] here at the Library of
Congress where Ryan McKinley [2] (Solr developer/hacker
extraordinaire) showed up. Some of us working on digital preservation/
curation/repository stuff at LC got talking with him over beers, not
about search, but about what it is our group is trying to do at LC.
Somewhere in the conversation he mentioned that a friend of his had
written recently about something he called 'movage' [3].

"""
The only way to archive digital information is to keep it moving. I
call this movage instead of storage. Proper movage means transferring
the material to current platforms on a regular basis -- that is,
before the old platform completely dies, and it becomes hard to do.
This movic rythym of refreshing content should be as smooth as a
respiratory cycle -- in, out, in, out. Copy, move, copy, move.
"""

It seems to be a counter-intuitive idea, that moving data could lead
to better preservation. The common wisdom I've heard is that the more
we touch stuff, the more likely we are to corrupt it, and thwart
preservation. So, as a consequence we should try not to touch stuff.
In some ways I guess Kevin's term 'movage' conflates moving with
format migration...so maybe these ideas aren't really juxtaposed?

The really fascinating thing about this piece are actually the
comments. One that stood out was Jim Thomas' analogy between 'movage'
and preservation of life forms in gene-banks:

"""
Heres a thought: Your distinction of storage and movage echoes a live
debate in the world of genetic conservation. Institutional attempts to
conserve genetic diversity (eg by the gene-banks of the CGIAR) often
revolve around so called ex-situ conservation - basically storage of
seeds in large freezers. Farmers have argued that the only way to
conserve seeds is to use and replicate them in the field every year
(in situ conservation) - movage if you like.

Sure enough a worrying proportion of seeds in gene banks lose their
viability and won’t plant out after a few years. One reason may be
because the environment itself is changing (just as the computer/
software environment is changing for digital media) but also because
of physical degredation (just like those CD’s).
"""

However it seems more than 'movage' here Jim is talking about actual
'usage'. I'm kind of new to the digital preservation/curation arena,
and I was wondering if anyone has written about the connection between
digital preservation and usage before. By usage I don't mean format
migration, but actual people using the bits that are attempting to be
preserved.

//Ed

[1] http://barcamp.org/SearchCampDC
[2] http://www.squid-labs.com/people/ryan.html
[3] http://www.kk.org/thetechnium/archives/2008/12/movage.php

Ben O'Steen

unread,
Dec 23, 2008, 10:20:01 AM12/23/08
to digital-...@googlegroups.com
The idea of movage echoes the kind of architecture for the Oxford archive;
every layer/column of the storage platform is very loosely coupled to the
other parts, as we plan for parts to be superceded and for content to
migrate naturally and as needed - note 'needed' rather than 'planned'
migration. It's our guess that things need to be kept 'in motion' - checked,
characterised, always ready to be moved.

We are aiming to preserve access to materials by storing a canonical
version, and to maintain a dissemination version (or more as required).

The canonical version is as similar to the original as possible - archiving
the explicit knowledge (tiff images of pages, PDFs, LaTeX, audio files)
alongside the implicit (file format characterisation, who/what/where image
taken, text/data-mined metadata, etc)

The dissemination version(s) are user-driven - can be as simple as a
verbatim document viewer, "page turning app" or whatever, but can be more
useful - OCRd text from a scanned page, rendered to the user with key
sentences and keywords drawn in a larger font, or direct links to eJournals
inserted on top of where citations are referenced in a journal article.
SPIDER project (http://imageweb.zoo.ox.ac.uk/wiki/index.php/Spider_Project)
shows what can be acheived by hand, and IMO there is good scope for
automated improvements to paginated content.

As this is still early days, I am not sure about whether or not to keep
dissemnition copies. If they are referenced, are we obligated to maintain
them? Or should we take the OS versioning strategy and keep the last known
good versions of the numbered releases - only keeping the older
dissemintations if a substantial change was made for a new one [a point
release (0.3 -> 0.4) is an indication that something was fixed or an
addition to the viewing capabilities was made, but a full increment (1.2 ->
2.0) indicates that the new dissemination is substantially different from
the last.]

In some cases, the dissemintation copies can be made on the fly and cached
(TIFF/JPG2000 -> lower res jpgs/pngs) as the amount of total imagery stored
vs.
the amount that will ever be used is a very top heavy ratio :)

In other cases, where the main use is for serendipidous reuse, such as text-
and data-mining, the benefit comes from allowing users immediate access to
this body of derived information, rather than on demand.

Apologies for the inarticulate post, but hey, it's xmas, and I really just
wanted to add that preservation is an extremely active activity, exactly as
was stated in the previous email - your system has to be ready to change at
a drop of a hat.

Oh and one last thing - PREMIS and standardising METS profiles for
interchange of digital items is all well and good, but the white elephant in
the room is the legal and financial issues - legal depts will disallow
transfers based on fear of infringement and accountants will want to charge
someone somehow. So, my yuletide message is standardise for yourselves,
write down all the implicit information and try not to tie too much
information in the packaging - my acid test is that if you just pass the
files for an item in a zip archive to a colleague/peer, someone with little
idea of what you are dealing with or the standards involved, and if they can
work out what all the bits are with the help of google, then you've been
successful.

Merry Xmas all!

Ben O'Steen


2008/12/23 Ed Summers <ed.su...@gmail.com>:

Brian Vargas

unread,
Jan 8, 2009, 3:33:53 PM1/8/09
to digital-...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Ed,

As our friend (and sometimes nemesis) David Brunton has observed, it
seems that the best preserved materials are the ones that are regularly
accessed. Not only are people more likely to notice if something is
wrong, but the usage leads to natural distribution of the content as
others obtain it and store it for their own purposes. The sad state of
the so-called "dark archive" we have today leaves the content not only
out of reach for access or migration, but also us unable even to state
whether or not the bits are even still there!

I think there is an important distinction to be made between moving data
and transforming data. Moving data can be done with a strong degree of
confidence that the content is intact with fairly simple tools. The
difficulty and cost to semantically and qualitatively analyze all of the
content (due to its vast quantity) means that transformation requires
much more forethought and investment.

Again, though, both movement and transformation - and even development
and analysis of those transformations - require access to the data.

Brian
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (MingW32)
Comment: What is this? http://pgp.ardvaark.net

iD8DBQFJZmMx3YdPnMKx1eMRA9GTAJ4snhbUfQy7wL7s/mzp+Dwko8iXGQCfQxKy
ctymOaixlBWyW+gCFEknw0Q=
=OS24
-----END PGP SIGNATURE-----
Reply all
Reply to author
Forward
0 new messages