We're using bags to store "digital objects" on disk. In addition to the
usual manifests, tag files, and data directory, we're also stuffing a
couple other files and directories in the root of each bag (above the
data directory) for serialization of metadata, for holding a JHOVE2
characterization, and for version control (a .git directory).
We're currently using Ed Summers's Python-based bag library and when we
call the validate() method, it barfs on any such dirs or files above the
data directory. I've poked around the spec a bit and didn't find
anything obvious governing this behavior, so I'm wondering: are these
files and directories strictly disallowed in the spec, or is this left
open for tool implementers to decide?
I'd like to be able to "overload" bags, if possible, though I wouldn't
be surprised to hear criticism that we're using the spec in ways for
which it wasn't intended. :)
-Mike
"The tags consist of one or more files named "manifest-
_algorithm_.txt", a file named "bagit.txt", and zero or more
additional files."
Kevin
> --
> You received this message because you are subscribed to the Google Groups
> "Digital Curation" group.
> To post to this group, send email to digital-...@googlegroups.com.
> To unsubscribe from this group, send email to
> digital-curati...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/digital-curation?hl=en.
>
>
"Note that tag files (including tag manifest files) can be added to or
removed from a bag without impacting the completeness or validity of
the bag as long as the tag files do not appear in a tag manifest."
So this seems to me that the bag should be considered valid still as
long as the serializations aren't in the manifest. But maybe that they
aren't .txt and encoded similar is an issue?
I've only used the ruby gem but it ignores most anything that isn't in
data or isn't in the manifest. But that might be a shortcoming rather
than proper?
Eby
Maybe I'll start by loosening up the bagit.validate method? Would that
work for you Mike?
//Ed
On Tue, Feb 22, 2011 at 11:39 AM, Kevin S. Clarke <kscl...@gmail.com> wrote:
http://oxfordrepo.blogspot.com/2008/12/archive-file-resiliences.html
Mark A. Matienzo
Digital Archivist, Manuscripts and Archives
Yale University Library
On Tue, Feb 22, 2011 at 1:20 PM, Robert R. Downs
<rdo...@ciesin.columbia.edu> wrote:
> Please share any justification for choosing a particular format to serialize
> bags for preservation purposes. Pointers to any articles, reports, or
> descriptions of criteria used for choosing one format, either zip, gzip, or
> tar, instead of the others would be appreciated.
>
> Thanks,
>
> Bob Downs
-Mike
+1
> Maybe I'll start by loosening up the bagit.validate method? Would that
> work for you Mike?
Thinking aloud here, there are only three errors we'd care about in our
use case:
1. Checksum in manifest does not match checksum of payload file
2. Entry in manifest does not exist in payload
3. File in payload that does not have an entry in the manifest
Allowing extraneous files and dirs in the root bag directory would be
great, Ed!
-Mike
It's an interesting topic, with a number of angles...For instance, if one were to concatenate a bag into a single file, and compress the resulting file, is there any reason to think that running a checksum on the resulting file isn't enough for at least a preservation sanity check?It also begs the question about which costs are highest over the long-term: disk costs for storing, ram and cpu costs for compressing and uncompressing, or something else entirely?My intuition is that whatever is sitting on disk should have some checksums on disk, too.-db.
Brian
Brian
On Feb 22, 2011, at 12:50, Ed Summers <e...@pobox.com> wrote:
> On Tue, Feb 22, 2011 at 2:45 PM, Brian Vargas <br...@ardvaark.net> wrote:
>> Couple that with some error correcting codes (like par2) and it seems likely
>> you could achieve greater reliability and recoverability in the same (or
>> less) disk space.
>
> Uncouple it from error correcting codes (like par2) then compression
> arguably increases the likelihood of data loss:
>
> The process of compressing (encoding) a file has
> profound consequences for attempts to mitigate against
> loss. A consequence of removal of redundancy is that
> the remaining data is all very significant – because a
> compression process is entirely an attempt to eliminate
> insignificant data. If one byte of the resultant file is then
> damaged, that byte is then very likely to be used
> involved in computations (the decoding or
> decompressing process) that will affect many other
> bytes. Encoding a file severely affects the ability to use
> corrupted data as a method of reducing the impact of
> error. [1]
>
> //Ed
>
> [1] http://www.bl.uk/ipres2008/presentations_day1/21_Wright.pdf
And if you use a self-healing filesystem like (ZFS, HDFS, etc) it
could be argued that the ECC work is already being done for you I
guess.
It definitely would make for an interesting comparison. How would you
run the experiment if you were to do it?
//Ed
Those seem like reasonable suggestions to me, Brian.
I can't think of any good rationale in favor of either SHOULD vs. MAY,
so I'm ambivalent -- I'm sure others on this list will have an opinion,
though. :)
-Mike
Uncouple it from error correcting codes (like par2) then compression
Just a suggestion for using the keywords in IETF RFC 2119: Key words
for use in RFCs to Indicate Requirement Levels
( http://www.ietf.org/rfc/rfc2119.txt ). This standard clearly
distinguishes between SHOULD and MAY.
(And maybe this comment is irrelevant, I have not been following all
the discussion in detail).
- WLA
Specifically, I would expect a 10 byte file to be 10x more likely to
encounter an uncorrelated bit error than a 1 byte file. But if one is
just looking at a pool of storage as a blob, maybe it doesn't matter
if it's one ten byte file or ten one byte files.
Anyone have some intuition about that?
-db.
I'd vote for the spec not being silent. Saying that a transfer tool is
only required to transfer the tag files and files listed in the
manifest seems a like a good way to encourage people to explicitly
list anything they consider content.
Chris
Use case:
I would like to be able to lock a bag by placing a lock
in the bag root during validation and post-validation
processing.
In addition to the three errors listed by Mike, I find
the Oxum very handy for initial payload integrity checks,
and tagmanifests give me that extra special data-integrity
feeling ;-) .
All best-
Joe
--
Joseph Pawletko
j...@nyu.edu
Software Systems Architect
Digital Library Technology Services (DLTS)
Bobst Library, New York University
(212) 992-9999
AIM: sn4jgp
http://tools.ietf.org/html/draft-kunze-bagit-05#section-2.1.2
"The base directory MUST contain exactly one sub-directory, called the
payload directory. The payload directory MUST be named "data".
There MUST NOT be any other sub-directories under the base directory."
I'm not a fan of this either and would like to see the restriction removed. Arguments for keeping it in the latest version of the spec that I recall included "it was implied earlier, now it's explicit, and that'll force the issue - if implementors don't want it, they'll say that" and "somebody's going to go and create a separate 'metadata' directory because they think 'data' and 'metadata' should be separate, and that will mean the metadata doesn't have a checksum manifest, or will be handled differently, and the goal is to handle all the data the same way."
Those aren't exact quotes.
The first concern seems to have proven accurate - sounds like several people might want to see the restriction removed.
-Dan (last tweaker of the spec)
I'd also be interested in the question of how file size correlates to failure rate, and whether it even matters for digital preservation purposes.
Specifically, I would expect a 10 byte file to be 10x more likely to encounter an uncorrelated bit error than a 1 byte file. But if one is just looking at a pool of storage as a blob, maybe it doesn't matter if it's one ten byte file or ten one byte files.
Hi,
I think that your files must be highly similar, which is why you are
seeing such high compression levels.
As a similar experiment, we have many ARC [1] files lying around. ARC
files consist of concatenated gzip streams. This allows random access
to the documents within the file while still storing them in a
compressed format. (You can seek to the beginning of a stream, then
start reading until the gzip stream ends.)
As an experiment, I uncompressed one ARC file, then gzipped it
normally. I would expect this to give somewhat similar results to your
comparison of individually compressed files v. tarred then compressed
files. But it does not:
100575814 original ARC file
100131159 gzipped as one stream
99657521 bzip2ed as one stream
Note, however, the documents that the ARC file contains is highly
heterogeneous. So the results you get should depend greatly on the
content that you have.
best, Erik
1. http://crawler.archive.org/articles/developer_manual/arcs.html
I remain unconvinced by this argument.
To begin with, it assumes some things that I have not seen evidence
for. One, that we are dealing with file formats in which a single bit
error can be silently ignored. (This is, I suppose, largely true of
XML; but what about, e.g. TIFF files?) Two, we assume bit errors are
independent (the model seems to be a single bit flipping). All the
evidence that we have suggests otherwise. (See, e.g., [1, 2, 3]).
Furthermore, if we are using checksums to verify data, we have no way
(at least with the cryptographic checksums that are most popular) of
knowing how many errors there were. A single bit error is
indistinguishable from an error in which every single bit is changed.
Any error at all, without error correcting coding, will mean that the
checksummed bytestream should be considered completely unreliable,
whether or not it is compressed or not.
Even without error correcting codes, I don’t think the arguments for
storing uncompressed data only as a matter of policy are strong at
all.
When we take error correcting codes into account, not compressing your
data as a policy in order to keep a higher level of redundancy seems
like the worst way to increase the redundancy of the data. Smart
people have figured out how to make codes which can reliably correct
limited errors in bytestreams. Why not use them?
As an example, take an ARC file [4]. Its original, uncompressed size
(in bytes) is:
130710084
Let’s compress it, then use par2 to generate some redundant data which
can actually be used to correct errors:
100399393 compressed
6158320 par2 recovery files
---------
106557713
That is still significantly smaller than the uncompressed file.
If I understand the par2 man page correctly (somewhat doubtful) this
means that we can have up to 100 errors in the original file,
comprising ~5MB of errors. All of this for less space that we were
using on uncompressed data! And this for data which does not compress
very well (presumably there are a lot of JPEG and similar files in
that ARC file)
I see no reason to store, as a matter of policy, uncompressed files on
our disks. In fact, I think we should be more aggressive about
compressing files.
best, Erik
1. Lakshmi N. Bairavasundaram et al., “An analysis of data corruption
in the storage stack,” 6th USENIX Conference on File and Storage
Technologies, San Jose, Ca., 2008.
2. Jim Gray and Catharine van Ingen, Empirical Measurements of Disk
Failure Rates and Error Rates, Microsoft Research Technical Report
(Microsoft Research, January 25, 2007),
http://arxiv.org/abs/cs/0701166.
3. Lakshmi N. Bairavasundaram et al., “An Analysis of Latent Sector
Errors in Disk Drives,” SIGMETRICS’07, San Diego, California, 2007,
http://www.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.pdf.
4. http://crawler.archive.org/articles/developer_manual/arcs.html
I concur with Erik. JPEG compression is typically 10:1 or better.
What will yield the greatest reliability: one copy of an uncompressed
image, or 10 separate copies of a compressed image?
-Greg
As a similar experiment, we have many ARC [1] files lying around. ARC files consist of concatenated gzip streams. This allows random access to the documents within the file while still storing them in a
compressed format. (You can seek to the beginning of a stream, then start reading until the gzip stream ends.)
As an experiment, I uncompressed one ARC file, then gzipped it normally. I would expect this to give somewhat similar results to your comparison of individually compressed files v. tarred then compressed
files. But it does not:
100,575,814 original ARC file
100,131,159 gzipped as one stream
99,657,521 bzip2ed as one stream
Note, however, the documents that the ARC file contains is highly heterogeneous. So the results you get should depend greatly on the content that you have.
Hi Simon,
Thanks for the detailed response. To be perfectly honest, I just
wanted to raise the point that the composition of your data will have
a huge effect on the amount that your data can be compressed,
including the relative gain of compressing a tar file of the contents
as compared with compressing each file by itself.
best, Erik
Thanks for the detailed response. To be perfectly honest, I just wanted to raise the point that the composition of your data will have a huge effect on the amount that your data can be compressed, including the relative gain of compressing a tar file of the contents as compared with compressing each file by itself.