OAI-ORE + BagIt Kludge

Jerome

unread,

Aug 31, 2009, 11:43:27 AM8/31/09

to Digital Curation

Howdy,

At Brian V.'s suggestion, I though I'd pass along a kludge solution to
a problem we've been experiencing in combining OAI-ORE with BagIt to
see if people had thoughts on it. This is part of the work we're
doing on the Preserving Virtual Worlds project, which is focused on
preservation of computer games and interactive fiction. Stanford and
UIUC will be working on actual storing content, and we've been working
on means for shipping content between the two systems, with BagIt
serving as our overall packaging wrapper and OAI-ORE being used to
identify subportions of the content (e.g., these files are the game,
these files over here are the emulator you need to run the game on a
modern system) and to indicate the relationships between the content
files (e.g., this file is representation information you're going to
need to understand that file over there).

The problem in a nutshell: OAI-ORE requires the use of protocol-based
URIs to identify content. That makes referencing the content files
*within* a BagIt package a real pain when you're shipping content
between institutions. Our solution: use the fetch.txt file to record
the URI for files as they are referenced in the OAI-ORE documents
included in the BagIt package (this is a URI for where the original
content can be found on the web) and the name of the same files as
they live in the BagIt package. Note that we are doing this even if
we actually have the content files in BagIt already; you don't
actually have to fetch the files to complete the Bag. The fetch.txt
file in this case really just serves as a mapping mechanism between
the URIs in the OAI-ORE files and the file names used inside the Bag.
As far as I can tell, this doesn't actually violate the BagIt spec.
But I would appreciate feedback on whether this seems like a
reasonable solution to the problem.

Ed Summers

unread,

Sep 3, 2009, 5:03:36 PM9/3/09

to Digital Curation

Seems more than reasonable to me. I guess the only thing that is
missing is the URI for the aggregation itself? Did you have any
thoughts about that?

//Ed

Jerome

unread,

Sep 16, 2009, 2:49:08 PM9/16/09

to Digital Curation

URI's for the aggregation we're inventing out of whole cloth. We're
constructing the aggregation so we assign our own URIs.

I have since my first post thought of one little problem with our
approach. Our fetch.txt file does, in its own way, constitute a set
of assertions, that particular URIs represent resources that
correspond to particular files contained within our Bag. While that
is true at the time we construct the Bag, we all know that the content
found at a particular URI can change, and there's no time/date stamp
associated with my link between a URI and a file in our Bag to say
when exactly that URI corresponded with this resource.

I don't know if this is a huge problem for us; grabbing the content
and preserving it is our first concern. But if at some point 10 years
from now someone asks why the file in our Bag doesn't match the file
sitting at that URI on the web, or when exactly things changed, we
won't really have any information to hand them. Archive.org's URIs
contain date stamps within themselves, but the URI's I'll be using
don't, for the most part.

Brian Vargas

unread,

Sep 17, 2009, 3:52:53 PM9/17/09

to digital-...@googlegroups.com

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Jerome,

> I don't know if this is a huge problem for us; grabbing the content
> and preserving it is our first concern. But if at some point 10 years
> from now someone asks why the file in our Bag doesn't match the file
> sitting at that URI on the web, or when exactly things changed, we
> won't really have any information to hand them. Archive.org's URIs
> contain date stamps within themselves, but the URI's I'll be using
> don't, for the most part.

I see this as a limitation directly imposed by the originally transient
nature of the fetch.txt. Section 4.1 is pretty clear that the fetch.txt
has two purposes:

1) To assist in the transfer of an unwieldy bag; and
2) To permit the transfer of a bag where the pieces are located in
different locations, such as often occurs on a distributed file system.

In both cases, it's a transfer-time artifact. I feel like the confusion
results from trying to apply new semantics (long-term identifiers) to
the already-established semantics of the fetch.txt. As you point out,
how do you know when you've switched meanings - especially in ten years?

Perhaps a better approach is to leverage some as-yet nonexistent bag
extension mechanism to store the URIs? This would be the second
scenario I've heard so far that would be potentially benefit from such a
mechanism (forward error correction being the other).

Brian

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: What is this? http://pgp.ardvaark.net
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEAREDAAYFAkqyk5UACgkQ3YdPnMKx1eMvrQCeOiZSSyBmXgeDJG2pN/k9eSrU
bAsAnjz5nP4JU0+MrWU0VEcGcZXNFMOf
=lOa5
-----END PGP SIGNATURE-----

Ed Summers

unread,

Sep 22, 2009, 12:42:49 PM9/22/09

to Digital Curation

On Sep 16, 2:49 pm, Jerome <jmcdo...@uiuc.edu> wrote:
> URI's for the aggregation we're inventing out of whole cloth. We're
> constructing the aggregation so we assign our own URIs.
>
> I have since my first post thought of one little problem with our
> approach. Our fetch.txt file does, in its own way, constitute a set
> of assertions, that particular URIs represent resources that
> correspond to particular files contained within our Bag. While that
> is true at the time we construct the Bag, we all know that the content
> found at a particular URI can change, and there's no time/date stamp
> associated with my link between a URI and a file in our Bag to say
> when exactly that URI corresponded with this resource.

Is it feasible to think of the last modification time of the fetch.txt
as providing this additional context of when the assertions in
fetch.txt were made?

One could imagine an auditing process that walked through the
fetch.txt files in bags, checking for link rot, and updating URIs
based on 301, 302 HTTP status codes, then updating the fetch.txt
appropriately. One could also imagine an audit log getting layered
into the bag ... perhaps as some additional information in the bag-
info.txt, or something more machine readable in a new file?

//Ed

//Ed

Ed Summers

unread,

Sep 25, 2009, 1:11:23 AM9/25/09

to Digital Curation

Another possibility to look at in the same family of "micro"
specifications from the California Digital Library would be dflat [1]
and its use of checkm [2] for manifest files. Specifically there is
the TargetFileOrURL portion of the manifest:

"""
TargetFileOrURL is a secondary location for the content that
applications would use as necessary. For instance, a transfer tool
that also renames files could use this token as the destination name.
"""

It's not entirely clear to me if checkm manifests replaces bagit
manifests in the context of dflat. Perhaps someone from CDL on the
list has a better idea of that.

//Ed

[1] http://www.cdlib.org/inside/diglib/dflat/dflatspec.pdf
[2] http://www.cdlib.org/inside/diglib/checkm/checkmspec.html

Jim Tuttle

unread,

Sep 22, 2009, 2:57:33 PM9/22/09

to digital-...@googlegroups.com

On 09/16/2009 02:49 PM, Jerome wrote:
>
> I have since my first post thought of one little problem with our
> approach. Our fetch.txt file does, in its own way, constitute a set
> of assertions, that particular URIs represent resources that
> correspond to particular files contained within our Bag. While that
> is true at the time we construct the Bag, we all know that the content
> found at a particular URI can change, and there's no time/date stamp
> associated with my link between a URI and a file in our Bag to say
> when exactly that URI corresponded with this resource.
>
> I don't know if this is a huge problem for us; grabbing the content
> and preserving it is our first concern. But if at some point 10 years
> from now someone asks why the file in our Bag doesn't match the file
> sitting at that URI on the web, or when exactly things changed, we
> won't really have any information to hand them. Archive.org's URIs
> contain date stamps within themselves, but the URI's I'll be using
> don't, for the most part.

This is describes how NC State uses BagIt. I don't normally archive
data in web-accessible locations so when I want to transfer data I point
Apache at directories in my storage array and put those URIs in my fetch
file. Directly following notification that the receiving institution
has validated the data, I discontinue access to the objects.