Digital Object Storage and Versioning

Richard Anderson

unread,

Oct 5, 2011, 9:19:58 AM10/5/11

to Digital Curation

On behalf of the Stanford Digital Repository (SDR), I have been
investigating open source storage solutions that include the ability
to efficiently and securely preserve multiple-version digital objects
which contain large binary files. This research follows up on
previous discussions on the Digital-Curation mailing list and at
recent CurateCamps. I have posted a report containing an overview
and summary of my findings, along with 3 detailed reports on my
analysis of Git, Boar, and CDL Micro-services. You can view them on
the SDR blog at:

https://lib.stanford.edu/stanford-digital-repository/investigations-storage-and-versioning-digital-objects

I welcome your feedback, but will likely be delayed in making follow-
up responses as I am leaving in a couple of days for a 2-week vacation.

Erik Hetzner

unread,

Oct 5, 2011, 12:51:50 PM10/5/11

to digital-...@googlegroups.com, Richard Anderson

At Wed, 5 Oct 2011 06:19:58 -0700,

Thanks for sharing this, Richard. This is a great summary of the
available tools.

We are working at CDL on addressing the issue of a version update
requiring a complete resubmission.

I realized looking through these docs that our work at CDL combines
the “working directory” and the repository, which differentiates it
from boar, git, hg, etc. This makes it a little more space efficient,
but a little more complicated to do simple things.

best, Erik

Richard Anderson

unread,

Oct 6, 2011, 8:48:02 AM10/6/11

to digital-...@googlegroups.com

At Wednesday, October 05, 2011 10:51 AM
Erik Hetzner wrote:

> At Wed, 5 Oct 2011 06:19:58 -0700,
> Richard Anderson wrote:
> >

> > ... I have posted a report containing an overview

> > and summary of my findings, along with 3 detailed reports on my
> > analysis of Git, Boar, and CDL Micro-services. You can view them on
> > the SDR blog at:
> > https://lib.stanford.edu/stanford-digital-repository/investigations-storage-and-versioning-digital-objects
>

> ...

> This is a great summary of the available tools.

Thank you, Erik.

I am interested in learning of any other open source tools that I might have
overlooked. I intentionally avoided looking at commercial version control
software, although I did find quite a few references to Perforce, by
developers working in the computer game arena. It apparently handles large
binary files well, but is not so well liked for other aspects of its
behavior.

I also avoided looking at commercial software in the Document Management,
Content Management, Digital Asset Management, etc. categories, such as
Documentum, ContentDM, Ex Libris. In 2007 the MIDESS Project
(http://ludos.leeds.ac.uk/midess/) included some of these products in its
review along with DSpace and Fedora. Can anyone refer me to any more recent
comparitive evaluations of the market leaders in this field?

> We are working at CDL on addressing the issue of a version update
> requiring a complete resubmission.

In reviewing your software code, I could see that the Dflat code is sent a
manifest, which it parses and uses to pull the individual files from a
submission location to the new temporary version's "full" directory. You
have probably realized that some of your delta logic could be applied at
that point, so that files with no change of name or checksum could be merely
transferred from the previous version instead of across the network.

> I realized looking through these docs that our work at CDL combines
> the “working directory” and the repository, which differentiates it
> from boar, git, hg, etc. This makes it a little more space efficient,
> but a little more complicated to do simple things.

Yeh, I was not quite clear on the source location of the files being
submitted. We are using Bagit bags to transfer our submission packages from
the workflow that assembles the files and processes the metadata to the
workflow that adds the object to preservation storage.

--Richard

Richard N Anderson
Software Engineer
Digital Library Systems and Services
Stanford University Libraries & Academic Information Systems

peterVG

unread,

Oct 6, 2011, 11:48:56 AM10/6/11

to Digital Curation

Thanks so much for sharing this excellent research Richard. While it
does pour some cold water on the "let's just use Git to solve our
versioning problems" chatter, it presents a very practical baseline of
requirements and design options/considerations to move forward. Please
do update the list as your design work continues.

Cheers,

--peter

On Oct 5, 9:19 am, Richard Anderson <rnand...@stanford.edu> wrote:
> On behalf of the Stanford Digital Repository (SDR), I have been
> investigating open source storage solutions that include the ability
> to efficiently and securely preserve multiple-version digital objects
> which contain large binary files. This research follows up on
> previous discussions on the Digital-Curation mailing list and at
> recent CurateCamps. I have posted a report containing an overview
> and summary of my findings, along with 3 detailed reports on my
> analysis of Git, Boar, and CDL Micro-services. You can view them on
> the SDR blog at:
>

> https://lib.stanford.edu/stanford-digital-repository/investigations-s...

Erik Hetzner

unread,

Oct 7, 2011, 1:22:50 PM10/7/11

to digital-...@googlegroups.com, Richard Anderson

At Thu, 6 Oct 2011 06:48:02 -0600,

Richard Anderson wrote:
>
> > We are working at CDL on addressing the issue of a version update
> > requiring a complete resubmission.
>
> In reviewing your software code, I could see that the Dflat code is sent a
> manifest, which it parses and uses to pull the individual files from a
> submission location to the new temporary version's "full" directory. You
> have probably realized that some of your delta logic could be applied at
> that point, so that files with no change of name or checksum could be merely
> transferred from the previous version instead of across the network.

This issue will be addressed in the ingest stage, before proceeding to
storage. I didn’t write any of this code, so I am not too sure how it
is all done.

> > I realized looking through these docs that our work at CDL combines
> > the “working directory” and the repository, which differentiates it
> > from boar, git, hg, etc. This makes it a little more space efficient,
> > but a little more complicated to do simple things.
>
> Yeh, I was not quite clear on the source location of the files being
> submitted. We are using Bagit bags to transfer our submission packages from
> the workflow that assembles the files and processes the metadata to the
> workflow that adds the object to preservation storage.

The source location can be a zip, remote URL, etc. This is all handled
in the ingest part of Merritt.

But I think I was unclear. What I was remarking was something that I
had not realized before. In most source control, and apparently in
boar as well, there is a distinction between the “working copy” and
the repository. Users make changes to the working copy, and the
permanent copy is in the repository. In centralized systems (cvs, svn)
this distinction is clear, but in distributed source control it gets a
little mixed up, because the repository is in the same directory as
the working copy (in a sub-directory, e.g. .git, or .hg). This is very
useful for source control because users need to make changes to the
code, and we wouldn’t want them operating on the repository directly.
But it is not a space efficient as it could be, because the working
directory is redundant. (This is why distributed source control also
allows the bare repository, without a working copy.)

But in Dflat these are mixed up; the latest “copy” is stored under
vXXX, while diffs to earlier copies are stored under v001...vXXX-1.
There is no distinction between the working copy & the repository.
This is more space efficient, but it might complicate things. Anyhow,
it was just something I realized recently.

best, Erik

Richard Anderson

unread,

Jan 18, 2012, 3:48:00 PM1/18/12

to digital-...@googlegroups.com

This past October I posted a link to some reports I had written evaluating
Git, Boar, and CDL Micro-services against the functional requirements of the
Stanford Digital Repository. Since that time I have expanded the scope of
my investigations to include a new design (called 'Moab') that Stanford is
working on, which is an attempt to cherry-pick the best aspects of all three
designs.

In an effort to make my presentation more digestible from a narrative point
of view, I have created a MS Word rendition, and generated a PDF from that
copy:
Digital Object Storage and Versioning.doc
Digital Object Storage and Versioning.pdf
The original PowerPoint slides, containing essentially the same content as
bullet points, is also available:
Digital-Object-Versioning-Design-Options.ppt

Fresh copies of all 3 files can be downloaded from:
https://lib.stanford.edu/stanford-digital-repository/investigations-storage-and-versioning-digital-objects

I welcome your feedback as this is still a work in progress. Implementation
of portions of this design are underway, but are by no means finalized.
When I have some code worth sharing I will make a follow-up posting.

I am extremely grateful to Michael Giarlo, who reviewed a draft of my
presentation and made many useful suggestions.

Reply all

Reply to author

Forward