bag versioning

28 views
Skip to first unread message

David Brunton

unread,
Mar 18, 2010, 7:39:49 AM3/18/10
to digital-...@googlegroups.com
Hi all,

I wrote up a post about versioning bags, riffing pretty liberally on
CDL's ReDD spec. Some of the major changes to that spec:

  1. reverse-deltas/ is a directory at the same level as the data/
directory in a bag
2. Deltas directories are a timestamp rather than a version number
3. Each reverse delta acts like a bag, also (so changes are checksummed)
4. Namaste removed

The post is here:

http://davidbrunton.org/2010/03/versioning-bags_18.html

I'm curious, in particular, if anyone sees this as a perversion of the
BagIt spec? It's a lot of other stuff to stick in that top level bag
directory, but doing so also gains a great deal of simplicity to my
reckoning.

-db.

Brian Vargas

unread,
Mar 19, 2010, 10:44:25 AM3/19/10
to digital-...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

I think it's a cool idea, and would have very real uses. I especially
like the nice backwards compatibility with existing bags and tools.

Brian

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: What is this? http://pgp.ardvaark.net
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEAREDAAYFAkujjckACgkQ3YdPnMKx1ePLzQCgzhpCkZ2cN+jm/fM/DaYk0bC3
6BgAn1UWu3FvxjMlTAw2g5D0xpUX7gIO
=dH+L
-----END PGP SIGNATURE-----

Ed Summers

unread,
Mar 19, 2010, 5:38:30 PM3/19/10
to digital-...@googlegroups.com
Thanks for sending this David. I agree w/ Brian ... especially with
the fact that we have real use cases here at LC for versioning bags
*now*. Most recently we've been copying the entire bag to a new bag,
and relating the bags together with file naming conventions, e.g.:

bag_v01
bag_v02

But this is very wasteful of storage when we're just making a small
change to a many-gigabyte or terabyte bag.

I'd be interested to know if there's anything in the BagIt spec as it
stands that would prevent such an approach.

//Ed

Brian Vargas

unread,
Mar 19, 2010, 5:42:29 PM3/19/10
to digital-...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

I think the only thing we have right now is the implicit restriction of
directories in the top-level bag. We're planning on making that an
explicit restriction with the next 0.96 release, just for clarification,
but I am thinking we should change it for 0.97. There are too many cool
use cases we're inhibiting with that restriction.

Brian

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: What is this? http://pgp.ardvaark.net
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEAREDAAYFAkuj78UACgkQ3YdPnMKx1eN5vwCfT9gWegniCih4EvAmvGAN8X4y
b7AAn2RZkyUuYQ3piEt68vHItSBR2jSB
=oY+Y
-----END PGP SIGNATURE-----

Ton Luong

unread,
Mar 19, 2010, 8:18:06 PM3/19/10
to digital-...@googlegroups.com
I had similar thoughts when I first went through the BagIt spec but i lean toward using current open source versioning system instead; i wrote some thoughts down here:

- http://tonluong.com/projects/digital-packaging-format

BagIt has a lot of good points. My thoughts was to extend it by loosely coupling it with other great ideas like JSON, distributed versioning (GIT/Mercurial), parity files, and CouchDB (a document based database that never overwrites committed data); and reap the benefits from these software and at the same time keeping everything simple and easy to integrate.

At the same time, each pieces are optional; each item providing an added value without adding complexity.

The base format (would be similar to the BagIt spec with some changes):
========================================================
- manifest/metadata files (in JSON)


Distributed Versioning (GIT/Mercurial) (Optional - add if versioning is a requirement)
=================================================================
- wide array of transport option (SSH, HTTP/HTTPS, local file system, rsync)
- all the benefits of versioning (checksums, revisions, etc)


Parity Files (PAR) (Optional - add if data recovery is a requirement)
===================================================
- provide different data recovery levels


CouchDB (Optional)
================
- maps one-to-one with JSON manifest/metadata
- http://couchdb.apache.org/docs/overview.html


This feels like it will definitely break the BagIt spec. So I am thinking of a different packaging spec which at its very core resemble BagIt and by design, provides additional functionality for various use case; personally i have a need for versioning and data recovery. I would like to hear your thoughts before moving in that direction?


Best,
Ton

> --
> You received this message because you are subscribed to the Google Groups "Digital Curation" group.
> To post to this group, send email to digital-...@googlegroups.com.
> To unsubscribe from this group, send email to digital-curati...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/digital-curation?hl=en.
>

David Brunton

unread,
Mar 23, 2010, 6:43:13 AM3/23/10
to digital-...@googlegroups.com
On Fri, Mar 19, 2010 at 5:42 PM, Brian Vargas <br...@ardvaark.net> wrote:
> I think the only thing we have right now is the implicit restriction of
> directories in the top-level bag.  We're planning on making that an
> explicit restriction with the next 0.96 release, just for clarification,
> but I am thinking we should change it for 0.97.  There are too many cool
> use cases we're inhibiting with that restriction.

+1

I agree there are too many cool use cases to restrict the bag to only
including data/. If the bag spec were changed to allow directories
other than data/, it might be a good idea to add something normative
that says "when you put stuff in directories other than data/, that
stuff is not part of the bag, and bag tools are only required to grab
stuff they find in the manifest."

What I'm proposing is that yes, we would allow other directories, but
no, they are not "part of the bag" (other than being conveniently
located in the same parent directory).

Does that make sense to anyone else? My general idea there is that
other specs could take advantage of the same structure (e.g. enclose
the information that is of curatorial interest in data/, but use the
surrounding directory for other important stuff) as BagIt without
needing to encumber the BagIt spec.

-db.

Ed Summers

unread,
Mar 23, 2010, 2:49:04 PM3/23/10
to digital-...@googlegroups.com
On Tue, Mar 23, 2010 at 6:43 AM, David Brunton <dbru...@gmail.com> wrote:
> What I'm proposing is that yes, we would allow other directories, but
> no, they are not "part of the bag" (other than being conveniently
> located in the same parent directory).
>
> Does that make sense to anyone else?  My general idea there is that
> other specs could take advantage of the same structure (e.g. enclose
> the information that is of curatorial interest in data/, but use the
> surrounding directory for other important stuff) as BagIt without
> needing to encumber the BagIt spec.

Yes that makes sense to me. I think it would be nice if a Bag is
comprised of only the files and directories that the BagIt spec
enumerates. Additional files and directories are not considered part
of the Bag. Bearing in mind that additional files and directories
could invalidate the Bag, as in the case when additional files are
dropped into the data directory.

Generally I'm curious to know if anyone thinks we're straying by
considering the Bag as a unit for digital preservation. Some of us at
LC have this mental shorthand of thinking of Bags as not only useful
for helping bits travel in space (e.g. CDL -> LC) but also in time
(03/23/2010 -> 03/23/2020). So the ability to layer administrative
stuff into the bag, without making it an invalid bag becomes
important.

I was also wondering if the California Digital Library folks have
moved away from talking about Bags and towards talking about D-flats
[1] for a similar reason. It seems like much of the functionality of
Bags has been subsumed by D-flat and Checkm [3] in that each version
directory in a d-flat is effectively a Bag.

It seems like we here at LC are arriving at our notion of a D-flat,
but don't want it to interfere with our notion of a Bag.

//Ed

[1] http://www.cdlib.org/services/uc3/curation/storage.html
[2] https://confluence.ucop.edu/display/Curation/D-flat
[3] https://confluence.ucop.edu/display/Curation/Checkm

Reply all
Reply to author
Forward
0 new messages