Thanks for sharing this, Richard. This is a great summary of the
available tools.
We are working at CDL on addressing the issue of a version update
requiring a complete resubmission.
I realized looking through these docs that our work at CDL combines
the “working directory” and the repository, which differentiates it
from boar, git, hg, etc. This makes it a little more space efficient,
but a little more complicated to do simple things.
best, Erik
> At Wed, 5 Oct 2011 06:19:58 -0700,
> Richard Anderson wrote:
> >
> > ... I have posted a report containing an overview
> > and summary of my findings, along with 3 detailed reports on my
> > analysis of Git, Boar, and CDL Micro-services. You can view them on
> > the SDR blog at:
> > https://lib.stanford.edu/stanford-digital-repository/investigations-storage-and-versioning-digital-objects
>
> ...
> This is a great summary of the available tools.
Thank you, Erik.
I am interested in learning of any other open source tools that I might have
overlooked. I intentionally avoided looking at commercial version control
software, although I did find quite a few references to Perforce, by
developers working in the computer game arena. It apparently handles large
binary files well, but is not so well liked for other aspects of its
behavior.
I also avoided looking at commercial software in the Document Management,
Content Management, Digital Asset Management, etc. categories, such as
Documentum, ContentDM, Ex Libris. In 2007 the MIDESS Project
(http://ludos.leeds.ac.uk/midess/) included some of these products in its
review along with DSpace and Fedora. Can anyone refer me to any more recent
comparitive evaluations of the market leaders in this field?
> We are working at CDL on addressing the issue of a version update
> requiring a complete resubmission.
In reviewing your software code, I could see that the Dflat code is sent a
manifest, which it parses and uses to pull the individual files from a
submission location to the new temporary version's "full" directory. You
have probably realized that some of your delta logic could be applied at
that point, so that files with no change of name or checksum could be merely
transferred from the previous version instead of across the network.
> I realized looking through these docs that our work at CDL combines
> the “working directory” and the repository, which differentiates it
> from boar, git, hg, etc. This makes it a little more space efficient,
> but a little more complicated to do simple things.
Yeh, I was not quite clear on the source location of the files being
submitted. We are using Bagit bags to transfer our submission packages from
the workflow that assembles the files and processes the metadata to the
workflow that adds the object to preservation storage.
--Richard
Richard N Anderson
Software Engineer
Digital Library Systems and Services
Stanford University Libraries & Academic Information Systems
This issue will be addressed in the ingest stage, before proceeding to
storage. I didn’t write any of this code, so I am not too sure how it
is all done.
> > I realized looking through these docs that our work at CDL combines
> > the “working directory” and the repository, which differentiates it
> > from boar, git, hg, etc. This makes it a little more space efficient,
> > but a little more complicated to do simple things.
>
> Yeh, I was not quite clear on the source location of the files being
> submitted. We are using Bagit bags to transfer our submission packages from
> the workflow that assembles the files and processes the metadata to the
> workflow that adds the object to preservation storage.
The source location can be a zip, remote URL, etc. This is all handled
in the ingest part of Merritt.
But I think I was unclear. What I was remarking was something that I
had not realized before. In most source control, and apparently in
boar as well, there is a distinction between the “working copy” and
the repository. Users make changes to the working copy, and the
permanent copy is in the repository. In centralized systems (cvs, svn)
this distinction is clear, but in distributed source control it gets a
little mixed up, because the repository is in the same directory as
the working copy (in a sub-directory, e.g. .git, or .hg). This is very
useful for source control because users need to make changes to the
code, and we wouldn’t want them operating on the repository directly.
But it is not a space efficient as it could be, because the working
directory is redundant. (This is why distributed source control also
allows the bare repository, without a working copy.)
But in Dflat these are mixed up; the latest “copy” is stored under
vXXX, while diffs to earlier copies are stored under v001...vXXX-1.
There is no distinction between the working copy & the repository.
This is more space efficient, but it might complicate things. Anyhow,
it was just something I realized recently.
best, Erik
In an effort to make my presentation more digestible from a narrative point
of view, I have created a MS Word rendition, and generated a PDF from that
copy:
Digital Object Storage and Versioning.doc
Digital Object Storage and Versioning.pdf
The original PowerPoint slides, containing essentially the same content as
bullet points, is also available:
Digital-Object-Versioning-Design-Options.ppt
Fresh copies of all 3 files can be downloaded from:
https://lib.stanford.edu/stanford-digital-repository/investigations-storage-and-versioning-digital-objects
I welcome your feedback as this is still a work in progress. Implementation
of portions of this design are underway, but are by no means finalized.
When I have some code worth sharing I will make a follow-up posting.
I am extremely grateful to Michael Giarlo, who reviewed a draft of my
presentation and made many useful suggestions.