Unexpected files/dirs in BagIt

82 views
Skip to first unread message

Michael J. Giarlo

unread,
Feb 22, 2011, 11:24:41 AM2/22/11
to digital-...@googlegroups.com
Hi,

We're using bags to store "digital objects" on disk. In addition to the
usual manifests, tag files, and data directory, we're also stuffing a
couple other files and directories in the root of each bag (above the
data directory) for serialization of metadata, for holding a JHOVE2
characterization, and for version control (a .git directory).

We're currently using Ed Summers's Python-based bag library and when we
call the validate() method, it barfs on any such dirs or files above the
data directory. I've poked around the spec a bit and didn't find
anything obvious governing this behavior, so I'm wondering: are these
files and directories strictly disallowed in the spec, or is this left
open for tool implementers to decide?

I'd like to be able to "overload" bags, if possible, though I wouldn't
be surprised to hear criticism that we're using the spec in ways for
which it wasn't intended. :)

-Mike

Kevin S. Clarke

unread,
Feb 22, 2011, 11:39:21 AM2/22/11
to digital-...@googlegroups.com
I took this (perhaps incorrectly?) from the "BagIt directory layout"
section to mean that other files were permitted:

"The tags consist of one or more files named "manifest-
_algorithm_.txt", a file named "bagit.txt", and zero or more
additional files."

Kevin

> --
> You received this message because you are subscribed to the Google Groups
> "Digital Curation" group.
> To post to this group, send email to digital-...@googlegroups.com.
> To unsubscribe from this group, send email to
> digital-curati...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/digital-curation?hl=en.
>
>

Ryan Eby

unread,
Feb 22, 2011, 12:20:21 PM2/22/11
to digital-...@googlegroups.com
My reading of it is that all files above the data dir need to be .txt
tag files which would mean the examples given by mjgiarlo would be
invalid. However the presence or not of any of those files shouldn't
affect valid/completeness:

"Note that tag files (including tag manifest files) can be added to or
removed from a bag without impacting the completeness or validity of
the bag as long as the tag files do not appear in a tag manifest."

So this seems to me that the bag should be considered valid still as
long as the serializations aren't in the manifest. But maybe that they
aren't .txt and encoded similar is an issue?

I've only used the ruby gem but it ignores most anything that isn't in
data or isn't in the manifest. But that might be a shortcoming rather
than proper?

Eby

Ed Summers

unread,
Feb 22, 2011, 1:21:18 PM2/22/11
to digital-...@googlegroups.com, Kevin S. Clarke
Personally, I would like to see the BagIt spec adjusted to make it
clear that other files and directories are allowed at the same level
as the payload directory. I know there was some discussion about this
in the past, which I never quite understood.

Maybe I'll start by loosening up the bagit.validate method? Would that
work for you Mike?

//Ed

On Tue, Feb 22, 2011 at 11:39 AM, Kevin S. Clarke <kscl...@gmail.com> wrote:

David Brunton

unread,
Feb 22, 2011, 1:30:46 PM2/22/11
to digital-...@googlegroups.com
Ed already knows this, but I'm with him on this one.  I would like to see a directory structure like this:

bag-1/
  bag-info.txt
  bagit.txt
  data/
    file-1
    file-2
  .git/
    crazy-git-stuff-1
    crazy-git-stuff-2

be treated just like any other bag, with the caveat that any given utility for moving bags around may or may not promise to deal with those files gracefully.  So, for instance, if I wrote a bag transferring tool that looked in the manifest to decide what to transfer, it might not transfer the .git/ directory.  Or it might.  I would be in favor of leaving that out of the spec.

This would have the advantage of being able to e.g. git clone bag-1, and it would act just like a valid bag and just like a valid git repo (or hg or svn or darcs or whatever).

Just my $0.02.

-db.

Mark A. Matienzo

unread,
Feb 22, 2011, 1:52:14 PM2/22/11
to digital-...@googlegroups.com
Ben O'Steen wrote a blog post on the resiliency tar, zip, and tar/gzip
a few years ago:

http://oxfordrepo.blogspot.com/2008/12/archive-file-resiliences.html

Mark A. Matienzo
Digital Archivist, Manuscripts and Archives
Yale University Library

On Tue, Feb 22, 2011 at 1:20 PM, Robert R. Downs
<rdo...@ciesin.columbia.edu> wrote:
> Please share any justification for choosing a particular format to serialize
> bags for preservation purposes. Pointers to any articles, reports, or
> descriptions of criteria used for choosing one format, either zip, gzip, or
> tar, instead of the others would be appreciated.
>
> Thanks,
>
> Bob Downs

Michael J. Giarlo

unread,
Feb 22, 2011, 2:18:57 PM2/22/11
to digital-...@googlegroups.com
I'll raise a related question: has anyone thought through the
preservation costs and benefits of compressing a bag vs. leaving it laid
out on disk? One benefit of leaving the bag uncompressed is you can
check manifests without having to uncompress and recompress the bag.

-Mike

Michael J. Giarlo

unread,
Feb 22, 2011, 2:24:07 PM2/22/11
to digital-...@googlegroups.com
On 02/22/2011 01:21 PM, Ed Summers wrote:
> Personally, I would like to see the BagIt spec adjusted to make it
> clear that other files and directories are allowed at the same level
> as the payload directory.

+1

> Maybe I'll start by loosening up the bagit.validate method? Would that
> work for you Mike?

Thinking aloud here, there are only three errors we'd care about in our
use case:

1. Checksum in manifest does not match checksum of payload file
2. Entry in manifest does not exist in payload
3. File in payload that does not have an entry in the manifest

Allowing extraneous files and dirs in the root bag directory would be
great, Ed!

-Mike

Brian Vargas

unread,
Feb 22, 2011, 2:45:02 PM2/22/11
to digital-...@googlegroups.com, digital-...@googlegroups.com
Assuming your data is actually amenable to compression...

Checksums of compressed data have the nice side effect of requiring fewer overall resources (IO, CPU, RAM) to check validity, at the tradeoff of additional resources up front to compress the data, compute the new checksums, and verify valid compression. So for static bags that will be written once and verified indefinitely into the future, you would be certain to pass the point of positive return eventually.

Couple that with some error correcting codes (like par2) and it seems likely you could achieve greater reliability and recoverability in the same (or less) disk space.

Brian


On Feb 22, 2011, at 12:25, David Brunton <dbru...@gmail.com> wrote:

It's an interesting topic, with a number of angles...

For instance, if one were to concatenate a bag into a single file, and compress the resulting file, is there any reason to think that running a checksum on the resulting file isn't enough for at least a preservation sanity check?

It also begs the question about which costs are highest over the long-term: disk costs for storing, ram and cpu costs for compressing and uncompressing, or something else entirely?

My intuition is that whatever is sitting on disk should have some checksums on disk, too.

-db.

Brian Vargas

unread,
Feb 22, 2011, 2:55:37 PM2/22/11
to digital-...@googlegroups.com
I remember vociferous debates on the issue of top-level files, but for the life of me I don't recall now what the salient points were. In my current state of mind, I would think it reasonable for implementors to simply ignore what they can't handle at the "protocol" layer (i.e. the top-level bag directory). What if we add a new section 2.2.5 Other Files and Directories which defines that implementations SHOULD (or maybe MAY?) ignore any files or folders not defined in the spec, except for verification if they appear in a tag manifest? We would also need to add some corresponding verbage in Section 2.

Brian

Brian Vargas

unread,
Feb 22, 2011, 3:07:43 PM2/22/11
to digital-...@googlegroups.com, digital-...@googlegroups.com
Indeed. That was the kind of thinking that lead to the fully uncompressed TIFFs for NDNP, a choice made correctly. But ECCs change the equation. I think there is a research paper there: Compare the reliability of compression combined with ECCs versus bare files versus no-ECC compression.

Brian


On Feb 22, 2011, at 12:50, Ed Summers <e...@pobox.com> wrote:

> On Tue, Feb 22, 2011 at 2:45 PM, Brian Vargas <br...@ardvaark.net> wrote:
>> Couple that with some error correcting codes (like par2) and it seems likely
>> you could achieve greater reliability and recoverability in the same (or
>> less) disk space.
>

> Uncouple it from error correcting codes (like par2) then compression
> arguably increases the likelihood of data loss:
>
> The process of compressing (encoding) a file has
> profound consequences for attempts to mitigate against
> loss. A consequence of removal of redundancy is that
> the remaining data is all very significant – because a
> compression process is entirely an attempt to eliminate
> insignificant data. If one byte of the resultant file is then
> damaged, that byte is then very likely to be used
> involved in computations (the decoding or
> decompressing process) that will affect many other
> bytes. Encoding a file severely affects the ability to use
> corrupted data as a method of reducing the impact of
> error. [1]
>
> //Ed
>
> [1] http://www.bl.uk/ipres2008/presentations_day1/21_Wright.pdf

Ed Summers

unread,
Feb 22, 2011, 3:20:49 PM2/22/11
to digital-...@googlegroups.com
On Tue, Feb 22, 2011 at 3:07 PM, Brian Vargas <br...@ardvaark.net> wrote:
> Indeed. That was the kind of thinking that lead to the fully uncompressed TIFFs for NDNP, a choice made correctly. But ECCs change the equation. I think there is a research paper there: Compare the reliability of compression combined with ECCs versus bare files versus no-ECC compression.

And if you use a self-healing filesystem like (ZFS, HDFS, etc) it
could be argued that the ECC work is already being done for you I
guess.

It definitely would make for an interesting comparison. How would you
run the experiment if you were to do it?

//Ed

Michael J. Giarlo

unread,
Feb 22, 2011, 3:28:35 PM2/22/11
to digital-...@googlegroups.com
On 02/22/2011 02:55 PM, Brian Vargas wrote:
> I remember vociferous debates on the issue of top-level files, but
> for the life of me I don't recall now what the salient points were.
> In my current state of mind, I would think it reasonable for
> implementors to simply ignore what they can't handle at the
> "protocol" layer (i.e. the top-level bag directory). What if we add
> a new section 2.2.5 Other Files and Directories which defines that
> implementations SHOULD (or maybe MAY?) ignore any files or folders
> not defined in the spec, except for verification if they appear in a
> tag manifest? We would also need to add some corresponding verbage in
> Section 2.
>

Those seem like reasonable suggestions to me, Brian.

I can't think of any good rationale in favor of either SHOULD vs. MAY,
so I'm ambivalent -- I'm sure others on this list will have an opinion,
though. :)

-Mike

Ed Summers

unread,
Feb 22, 2011, 2:50:53 PM2/22/11
to digital-...@googlegroups.com
On Tue, Feb 22, 2011 at 2:45 PM, Brian Vargas <br...@ardvaark.net> wrote:
> Couple that with some error correcting codes (like par2) and it seems likely
> you could achieve greater reliability and recoverability in the same (or
> less) disk space.

Uncouple it from error correcting codes (like par2) then compression

Bill Anderson

unread,
Feb 22, 2011, 4:27:20 PM2/22/11
to digital-...@googlegroups.com
On Tue, Feb 22, 2011 at 2:28 PM, Michael J. Giarlo <mic...@psu.edu> wrote:
> On 02/22/2011 02:55 PM, Brian Vargas wrote:
>> [ ... ]

>
> I can't think of any good rationale in favor of either SHOULD vs. MAY, so
> I'm ambivalent -- I'm sure others on this list will have an opinion, though.
> :)
>
> -Mike
>

Just a suggestion for using the keywords in IETF RFC 2119: Key words
for use in RFCs to Indicate Requirement Levels
( http://www.ietf.org/rfc/rfc2119.txt ). This standard clearly
distinguishes between SHOULD and MAY.

(And maybe this comment is irrelevant, I have not been following all
the discussion in detail).

- WLA

David Brunton

unread,
Feb 22, 2011, 3:48:16 PM2/22/11
to digital-...@googlegroups.com
I'd also be interested in the question of how file size correlates to
failure rate, and whether it even matters for digital preservation
purposes.

Specifically, I would expect a 10 byte file to be 10x more likely to
encounter an uncorrelated bit error than a 1 byte file. But if one is
just looking at a pool of storage as a blob, maybe it doesn't matter
if it's one ten byte file or ten one byte files.

Anyone have some intuition about that?

-db.

Chris Adams

unread,
Feb 22, 2011, 3:01:18 PM2/22/11
to digital-...@googlegroups.com
On Tue, Feb 22, 2011 at 1:30 PM, David Brunton <dbru...@gmail.com> wrote:
> gracefully.  So, for instance, if I wrote a bag transferring tool that
> looked in the manifest to decide what to transfer, it might not transfer the
> .git/ directory.  Or it might.  I would be in favor of leaving that out of
> the spec.

I'd vote for the spec not being silent. Saying that a transfer tool is
only required to transfer the tag files and files listed in the
manifest seems a like a good way to encourage people to explicitly
list anything they consider content.

Chris

David Brunton

unread,
Feb 22, 2011, 2:25:48 PM2/22/11
to digital-...@googlegroups.com
It's an interesting topic, with a number of angles...

For instance, if one were to concatenate a bag into a single file, and compress the resulting file, is there any reason to think that running a checksum on the resulting file isn't enough for at least a preservation sanity check?

It also begs the question about which costs are highest over the long-term: disk costs for storing, ram and cpu costs for compressing and uncompressing, or something else entirely?

My intuition is that whatever is sitting on disk should have some checksums on disk, too.

-db.

j.g. pawletko

unread,
Feb 22, 2011, 5:43:06 PM2/22/11
to digital-...@googlegroups.com
+1

Use case:
I would like to be able to lock a bag by placing a lock
in the bag root during validation and post-validation
processing.

In addition to the three errors listed by Mike, I find
the Oxum very handy for initial payload integrity checks,
and tagmanifests give me that extra special data-integrity
feeling ;-) .

All best-
Joe

--
Joseph Pawletko
j...@nyu.edu

Software Systems Architect
Digital Library Technology Services (DLTS)
Bobst Library, New York University
(212) 992-9999
AIM: sn4jgp


Dan Chudnov

unread,
Feb 22, 2011, 9:36:51 PM2/22/11
to digital-...@googlegroups.com
Sorry if i missed somebody pointing this out already, but to be clear, the latest version of the spec explicitly disallows additional subdirectories under the base directory:

http://tools.ietf.org/html/draft-kunze-bagit-05#section-2.1.2

"The base directory MUST contain exactly one sub-directory, called the
payload directory. The payload directory MUST be named "data".
There MUST NOT be any other sub-directories under the base directory."


I'm not a fan of this either and would like to see the restriction removed. Arguments for keeping it in the latest version of the spec that I recall included "it was implied earlier, now it's explicit, and that'll force the issue - if implementors don't want it, they'll say that" and "somebody's going to go and create a separate 'metadata' directory because they think 'data' and 'metadata' should be separate, and that will mean the metadata doesn't have a checksum manifest, or will be handled differently, and the goal is to handle all the data the same way."

Those aren't exact quotes.

The first concern seems to have proven accurate - sounds like several people might want to see the restriction removed.

-Dan (last tweaker of the spec)

Simon Spero

unread,
Feb 22, 2011, 10:47:13 PM2/22/11
to digital-...@googlegroups.com, David Brunton
On Tue, Feb 22, 2011 at 3:48 PM, David Brunton <dbru...@gmail.com> wrote:
I'd also be interested in the question of how file size correlates to failure rate, and whether it even matters for digital preservation purposes.

Specifically, I would expect a 10 byte file to be 10x more likely to encounter an uncorrelated bit error than a 1 byte file.  But if one is just looking at a pool of storage as a blob, maybe it doesn't matter if it's one ten byte file or ten one byte files.

This isn't quite right, but is close for low error rates - if the Byte Error Rate is p, then the error rate for n bytes p_n is 1 - (1 - p) ^ n . (assuming independence).

This is similar to estimating the survivability  of dispersed, redundant  replicas 

If there are r replicas, then the probability of at least one replica being ok   is 1 - (1 - p_n)^r.  So, if there were a  50% chance of a given replica being corrupt, then chances of at least one of three replicas being ok is  is 1-0.5^3, or 7/8

This gives us an  alternative way of looking at the original question.. 

First, some base-line numbers: 

As a sample data set, I'll use a bundle of  files from the NARA ARC; specifically, ArchivalDescriptionsPart10.tar (because it's the smallest).  The data files are all XML documents. 

There are 63,665 files, containing 392,579,177 bytes  . bagging overhead can be ignored for the moment.

Under MacOS X, the following on disk sizes were observed:

524M ArchivalDescriptionsPart10
   21.0   (unpacked, uncompressed files) 
421M ArchivalDescriptionsPart10.tar
   16.9   (tarred, uncompressed file)
251M ArchivalDescriptionsPart10-bzed
   10.0   (unpacked, bzip2 compressed files)
159M ArchivalDescriptionsPart10-bzed.tar
    6.4   (tarred bzip2 compressed files)
 25M ArchivalDescriptionsPart10.tar.bz2
    1.0   (bzip2 compressed tar file)

For the same storage cost as one uncompressed tar file, you could have sixteen copies of the bzipped archive. 

One problem with using the compressed tar file is that it can only be accessed sequentially. That means that in order to access any individual item in the bag, you have to uncompress the bag until you reach that item; potentially right at the very bottom.  

The tar of compressed files doesn't have thsi problem, but is over six times the size, as the compression dictionary is not shared between files. 

It is possible to use zlib with a pre-configured dictionary to allow for better compression whilst still retaining the ability to serve files with random access.  This would be a handy thing to stick in a bag.

Even without this approach, it is still better to store compressed files in a bag with checksums taken over the compressed files (only the bagger needs to decompress and checksum to make sure that the compression process didn't introduce errors).  

[One of the performance mistakes in iRODS makes is forcing complete tar files to be unpacked to a separate caching system, instead of saving seek pointer and seeking and returning the data direct from the archive.]

Simon  

Robert R. Downs

unread,
Feb 22, 2011, 1:20:54 PM2/22/11
to digital-...@googlegroups.com

Simon Spero

unread,
Feb 24, 2011, 10:06:03 PM2/24/11
to digital-...@googlegroups.com
On Tue, Feb 22, 2011 at 10:47 PM, Simon Spero <s...@unc.edu> wrote:
[ Some rough numbers using Part10 of the NARA ARC records as a data set]

Not a retraction of my previous numbers, but I wanted to take  a look at the size distributions of the files, and noticed some serious outliers.  Some of these were records for RecordGroups that have huge numbers of duplicate physical mailing addresses. These give unrealistically high compression ratios (99% +) 

There is also one 1.5M file Series record that contains a complete copy of Homer's Oddessy translated into 133t speak. Really. http://arcweb.archives.gov/arc/action/ExternalIdSearch?id=513053 , choose 'Archived copies' tab, then select  view container list.

I think I might use a different data set to generate metrics, or I may just trim the upper percentiles of sizes; I was looking for a collection with a large number of small XML files, as that set up is the best case for packaged bags, and is a difficult case for compression algorithms, as by the time they have a decent compression dictionary, the file is over. 

Storing a file at the end of  a tar-bag with offsets to the start of each entry allows rapid access to individual entries without breaking backwards compatibility; this is no mere bag-o-tells
 
Simon

Ferran Jorba

unread,
Feb 25, 2011, 6:06:13 AM2/25/11
to Digital Curation
Hi,

[...]
> It is possible to use zlib with a pre-configured dictionary to allow for
> better compression whilst still retaining the ability to serve files with
> random access.  This would be a handy thing to stick in a bag.

dictzip(1) from the dict (http://dict.org) project implements this.
From
the man page:

dictzip compresses files using the gzip(1) algorithm (LZ77) in
a man‐
ner which is completely compatible with the gzip file
format. An
extension to the gzip file format (Extra Field, described in
2.3.1.1
of RFC 1952) allows extra data to be stored in the header of
a com‐
pressed file. Programs like gzip and zcat will ignore
this extra
data. However, dictd(8), the DICT protocol dictionary
server will
make use of this data to perform pseudo-random access on
the file.
Files in the dictzip format should end in ".dz" so that they
may be
distinguished from common gzip files that do not contain the
special
header information.

Debian packages it independently of the dictd server:

http://packages.debian.org/dictzip

Ferran

Erik Hetzner

unread,
Feb 25, 2011, 1:56:43 PM2/25/11
to digital-...@googlegroups.com
At Tue, 22 Feb 2011 22:47:13 -0500,
Simon Spero wrote:
>
> […]

>
> This gives us an alternative way of looking at the original question..
>
> First, some base-line numbers:
>
> As a sample data set, I'll use a bundle of files from the NARA ARC;
> specifically, ArchivalDescriptionsPart10.tar (because it's the smallest).
> The data files are all XML documents.
>
> There are 63,665 files, containing 392,579,177 bytes . bagging overhead can
> be ignored for the moment.
>
> Under MacOS X, the following on disk sizes were observed:
>
> 524M ArchivalDescriptionsPart10
> 21.0 (unpacked, uncompressed files)
> 421M ArchivalDescriptionsPart10.tar
> 16.9 (tarred, uncompressed file)
> 251M ArchivalDescriptionsPart10-bzed
> 10.0 (unpacked, bzip2 compressed files)
> 159M ArchivalDescriptionsPart10-bzed.tar
> 6.4 (tarred bzip2 compressed files)
> 25M ArchivalDescriptionsPart10.tar.bz2
> 1.0 (bzip2 compressed tar file)
>
> […]

Hi,

I think that your files must be highly similar, which is why you are
seeing such high compression levels.

As a similar experiment, we have many ARC [1] files lying around. ARC
files consist of concatenated gzip streams. This allows random access
to the documents within the file while still storing them in a
compressed format. (You can seek to the beginning of a stream, then
start reading until the gzip stream ends.)

As an experiment, I uncompressed one ARC file, then gzipped it
normally. I would expect this to give somewhat similar results to your
comparison of individually compressed files v. tarred then compressed
files. But it does not:

100575814 original ARC file
100131159 gzipped as one stream
99657521 bzip2ed as one stream

Note, however, the documents that the ARC file contains is highly
heterogeneous. So the results you get should depend greatly on the
content that you have.

best, Erik

1. http://crawler.archive.org/articles/developer_manual/arcs.html

Erik Hetzner

unread,
Feb 25, 2011, 2:42:47 PM2/25/11
to digital-...@googlegroups.com
At Tue, 22 Feb 2011 14:50:53 -0500,

I remain unconvinced by this argument.

To begin with, it assumes some things that I have not seen evidence
for. One, that we are dealing with file formats in which a single bit
error can be silently ignored. (This is, I suppose, largely true of
XML; but what about, e.g. TIFF files?) Two, we assume bit errors are
independent (the model seems to be a single bit flipping). All the
evidence that we have suggests otherwise. (See, e.g., [1, 2, 3]).

Furthermore, if we are using checksums to verify data, we have no way
(at least with the cryptographic checksums that are most popular) of
knowing how many errors there were. A single bit error is
indistinguishable from an error in which every single bit is changed.
Any error at all, without error correcting coding, will mean that the
checksummed bytestream should be considered completely unreliable,
whether or not it is compressed or not.

Even without error correcting codes, I don’t think the arguments for
storing uncompressed data only as a matter of policy are strong at
all.

When we take error correcting codes into account, not compressing your
data as a policy in order to keep a higher level of redundancy seems
like the worst way to increase the redundancy of the data. Smart
people have figured out how to make codes which can reliably correct
limited errors in bytestreams. Why not use them?

As an example, take an ARC file [4]. Its original, uncompressed size
(in bytes) is:

130710084

Let’s compress it, then use par2 to generate some redundant data which
can actually be used to correct errors:

100399393 compressed
6158320 par2 recovery files
---------
106557713

That is still significantly smaller than the uncompressed file.

If I understand the par2 man page correctly (somewhat doubtful) this
means that we can have up to 100 errors in the original file,
comprising ~5MB of errors. All of this for less space that we were
using on uncompressed data! And this for data which does not compress
very well (presumably there are a lot of JPEG and similar files in
that ARC file)

I see no reason to store, as a matter of policy, uncompressed files on
our disks. In fact, I think we should be more aggressive about
compressing files.

best, Erik

1. Lakshmi N. Bairavasundaram et al., “An analysis of data corruption
in the storage stack,” 6th USENIX Conference on File and Storage
Technologies, San Jose, Ca., 2008.

2. Jim Gray and Catharine van Ingen, Empirical Measurements of Disk
Failure Rates and Error Rates, Microsoft Research Technical Report
(Microsoft Research, January 25, 2007),
http://arxiv.org/abs/cs/0701166.

3. Lakshmi N. Bairavasundaram et al., “An Analysis of Latent Sector
Errors in Disk Drives,” SIGMETRICS’07, San Diego, California, 2007,
http://www.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.pdf.

4. http://crawler.archive.org/articles/developer_manual/arcs.html

Greg Janée

unread,
Feb 25, 2011, 4:10:24 PM2/25/11
to digital-...@googlegroups.com
Erik Hetzner wrote:
> When we take error correcting codes into account, not compressing your
> data as a policy in order to keep a higher level of redundancy seems
> like the worst way to increase the redundancy of the data. Smart
> people have figured out how to make codes which can reliably correct
> limited errors in bytestreams. Why not use them?

I concur with Erik. JPEG compression is typically 10:1 or better.
What will yield the greatest reliability: one copy of an uncompressed
image, or 10 separate copies of a compressed image?

-Greg

Simon Spero

unread,
Feb 25, 2011, 6:08:18 PM2/25/11
to digital-...@googlegroups.com, Erik Hetzner
On Fri, Feb 25, 2011 at 1:56 PM, Erik Hetzner <erik.h...@ucop.edu> wrote:
As a similar experiment, we have many ARC [1] files lying around. ARC files consist of concatenated gzip streams. This allows random access to the documents within the file while still storing them in a
compressed format. (You can seek to the beginning of a stream, then start reading until the gzip stream ends.)

As an experiment, I uncompressed one ARC file, then gzipped it normally. I would expect this to give somewhat similar results to your comparison of individually compressed files v. tarred then compressed
files. But it does not:

100,575,814 original ARC file
100,131,159 gzipped as one stream
 99,657,521 bzip2ed as one stream

Note, however, the documents that the ARC file contains is highly heterogeneous. So the results you get should depend greatly on the content that you have.

There are possible explanations for these results, which may bear further examination: 

1) The individual files are not highly compressible
      Useful metric:  uncompressed size; file types ; mean + sd of compression ratios for individual files 

2) The files are individually large enough to achieve populate compression dictionary efficiently. 
    Useful metric:  distribution statistics for file sizes

3) The files are dissimilar
    Useful metric:  ( n-gramed cosine  for some a random sampling of pairs of sequential files?)  

The similar sizes for the bzip2 and gzipped files would seem to indicate relatively incompressible data.

For just the Series records from the NARA Archival Catalog records showed a  compression ratio for individual files of around 2.85 (n= 60,757, sd=0.36) (I bootstrapped to avoid the outliers).   

These are XML encoded records, and XML is usually extremely redundant, and that redundancy is often common amongst files of a similar type.

Many XML records can be relatively short; the sample set I used has a mean length of 4,994 (ignoring outliers with sizes > 3sd above the mean for the full data set).

Simon



Erik Hetzner

unread,
Feb 25, 2011, 7:23:21 PM2/25/11
to digital-...@googlegroups.com
At Fri, 25 Feb 2011 18:08:18 -0500,

Simon Spero wrote:
> There are possible explanations for these results, which may bear further
> examination:
>
> 1) The individual files are not highly compressible
> Useful metric: uncompressed size; file types ; mean + sd of
> compression ratios for individual files
>
> 2) The files are individually large enough to achieve populate compression
> dictionary efficiently.
> Useful metric: distribution statistics for file sizes
>
> 3) The files are dissimilar
> Useful metric: ( n-gramed cosine for some a random sampling of pairs
> of sequential files?)
>
> The similar sizes for the bzip2 and gzipped files would seem to indicate
> relatively incompressible data.
>
> For just the Series records from the NARA Archival Catalog records showed a
> compression ratio for individual files of around 2.85 (n= 60,757, sd=0.36)
> (I bootstrapped to avoid the outliers).
>
> These are XML encoded records, and XML is usually extremely redundant, and
> that redundancy is often common amongst files of a similar type.
>
> Many XML records can be relatively short; the sample set I used has a mean
> length of 4,994 (ignoring outliers with sizes > 3sd above the mean for the
> full data set).

Hi Simon,

Thanks for the detailed response. To be perfectly honest, I just
wanted to raise the point that the composition of your data will have
a huge effect on the amount that your data can be compressed,
including the relative gain of compressing a tar file of the contents
as compared with compressing each file by itself.

best, Erik

Simon Spero

unread,
Feb 25, 2011, 9:33:44 PM2/25/11
to digital-...@googlegroups.com, Erik Hetzner
On Fri, Feb 25, 2011 at 7:23 PM, Erik Hetzner <erik.h...@ucop.edu> wrote:
Thanks for the detailed response. To be perfectly honest, I just wanted to raise the point that the composition of your data will have a huge effect on the amount that your data can be compressed, including the relative gain of compressing a tar file of the contents as compared with compressing each file by itself.

Right-  one of the things I was trying to point out in my original message is that  some the loss in compression can be reduced by pre-seeding the compression dictionary; this is built in to zlib - I'm not sure if this is part of the  libbzip2 api.

On a related note, compressed streams that support quasi-random access reset their dictionary at seekable sync points (bzip2 resets every 100,000 bytes per level of compression - e.g. every 900,000 bytes at bzip2 -9) 

Simon
Reply all
Reply to author
Forward
0 new messages