BagIt Versioning

100 views
Skip to first unread message

Eisenhauer, Stephen

unread,
Aug 16, 2013, 5:15:21 PM8/16/13
to digital-...@googlegroups.com
Hello all,

Some colleagues and I are looking at a way of structuring BagIt bags that represent a set of changes (a diff) to another existing bag.  The basic idea is that our archival units are all treated as immutable, and any modifications to what's already in the archive are to be treated as forward deltas.

There are plenty of ways of implementing a scheme for this, and I've done some light scripting so far to experiment with the process of generating/applying these "diff" bags, but I wanted to get some outside input and recommendations to see how other BagIt enthusiasts would go about this.  (Who knows; Maybe we'll have invented a new RFC by the time this thread is finished!)

What we've come up with so far:
  • New versions of a bag are represented by additional bags ("diff bags") containing only the changes
  • Files are atomic. That is, we won't try to generate deltas for individual files; We'll just replace the file whole.
  • The diff bags should be named in a consistent way, e.g. TITLE_changes001/, TITLE_changes002/
  • The diff bag's payload ('data' directory) will correspond to the root of the original bag
  • The diff bag's payload will contain only the new and changed files, and their directory structure should be preserved for easy merging
  • Some special bag-info.txt tags should be invented to signal that the bag is a "diff bag", to enumerate deleted files, and possibly facilitate other additional metadata related to the versioning.  Perhaps some of this should be offloaded to a completely new tag file to avoid abusing the purpose of bag-info.txt too much.
I've attached a work-in-progress proof-of-concept for a Python script that demonstrates the use cases we're sketching out. Please excuse any bugs, omissions, and bad design patterns therein for the time being.  Have a look at the first 30 lines of the file in your favorite editor for a good description, or invoke the script with -h for usage help.

Any thoughts, opinions, suggestions, interest?

Stephen Eisenhauer
Programmer for Strategic Projects
Libraries, University of North Texas

bag-diff.py

David Brunton

unread,
Aug 19, 2013, 12:58:15 PM8/19/13
to digital-...@googlegroups.com, stephen.e...@unt.edu
Hi Stephen,

I hadn't been following, but got this forwarded by a colleague who knew I had dome some thinking about versions and variants a while back that was similar to what you described here.  I've included the salient points from my writeup below in this message. Please feel free to use it or ignore it at your own sole discretion- I have not implemented it.

Best,
David.

This writeup presumes that directories other than "data/" are allowed at the top level of abag.  Either both "version/" and "variant/" are allowed, or any directory name is allowed.

Variants

Inside the variant/ subdirectory is a collection of named bags that receive special treatment. Each named variant must be a valid bag. The manifest is a full manifest of some variant of this bag, which may not successfully validate against the files in the bag's main payload.

This special manifest may include zero or more files from bag's main payload, as well as zero or more files from the named variant payload. File paths from the main payload are specified relative to the top level of the bag, and file paths from the variant are specified relative to the named variant’s own top-level. Name collisions always resolve variant first, main payload second, and failure third.

Files shared between multiple variants should be included in the main payload, rather than multiple times in multiple variants. There is no other functionality for sharing payload between variants.

Versions

The version/ subdirectory behaves in a way related to the variant/ subdirectory. Each subdirectory contains a partial bag, which must include a full manifest of all its contents (not the partial contents it actually contains). However, instead of named variants, the name of the subdirectories in version/ must be a timestamp of the form YYYYMMDDHHMMSS, and instead of being entirely independent of one another, they are sequentially chained.

As such, the precedence for a manifest in version/ is to look for the file first in the current directory, then temporally sequentially through each remaining version/ subdirectory payload, and finally in the bag's main payload. The manifest in a version should be the valid manifest for the completely reassembled version.

Paul Clough

unread,
Aug 22, 2013, 7:50:08 PM8/22/13
to digital-...@googlegroups.com, stephen.e...@unt.edu
Hi Stephen -

Thanks for sharing your thoughts and code. I don't have any specific comments or suggestions, but a general one: have you seen git-annex? See: <http://git-annex.branchable.com>

From the "how it works" page:

Git's man page calls it "a stupid content tracker". With git-annex, git is instead "a stupid filename and metadata" tracker. The contents of large files are not stored in git, only the names of the files and some other metadata remain there.

It might be nice to piggyback on an existing project with similar goals. Apologies if you've already seen/evaluated it ...

- Paul

Michael J. Giarlo

unread,
Aug 22, 2013, 9:42:58 PM8/22/13
to digital-...@googlegroups.com, stephen eisenhauer
Hi,

FWIW, Richard Anderson at Stanford has done some work analyzing various options for version control in this context, including git. His code4lib journal publication is here:

http://journal.code4lib.org/articles/8482

His earlier work discusses git-annex in particular (somewhere in here):

https://lib.stanford.edu/stanford-digital-repository/investigations-storage-and-versioning-digital-objects

-Mike
> * New versions of a bag are represented by additional bags ("diff
> bags") containing only the changes
> * Files are atomic. That is, we won't try to generate deltas for
> individual files; We'll just replace the file whole.
> * The diff bags should be named in a consistent way, e.g.
> TITLE_changes001/, TITLE_changes002/
> * The diff bag's payload ('data' directory) will correspond to
> the root of the original bag
> * The diff bag's payload will contain only the new and changed
> files, and their directory structure should be preserved for
> easy merging
> * Some special bag-info.txt tags should be invented to signal
> that the bag is a "diff bag", to enumerate deleted files, and
> possibly facilitate other additional metadata related to the
> versioning. Perhaps some of this should be offloaded to a
> completely new tag file to avoid abusing the purpose of
> bag-info.txt too much.
>
> I've attached a work-in-progress proof-of-concept for a Python script
> that demonstrates the use cases we're sketching out. Please excuse
> any bugs, omissions, and bad design patterns therein for the time
> being. Have a look at the first 30 lines of the file in your
> favorite editor for a good description, or invoke the script with -h
> for usage help.
>
> Any thoughts, opinions, suggestions, interest?
>
>
>
>
> Stephen Eisenhauer
> Programmer for Strategic Projects
> Libraries, University of North Texas
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Digital Curation" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to digital-curati...@googlegroups.com.
> To post to this group, send email to
> digital-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/digital-curation .
> For more options, visit https://groups.google.com/groups/opt_out .
>
Reply all
Reply to author
Forward
0 new messages