Zipping preservation TIFF files

563 views
Skip to first unread message

Bernadette Houghton

unread,
Sep 23, 2013, 10:03:06 PM9/23/13
to digital-...@googlegroups.com
What is best practice for storing archival copies of TIFF files? More specifically, is it acceptable to store a ZIPed TIFF as an archival copy?

As we are talking thousands of TIFF files, we are looking to save space, hence the query.

TIA

Bernadette Houghton

Nathan Tallman

unread,
Sep 25, 2013, 1:36:07 PM9/25/13
to digital-...@googlegroups.com
Hi Bernadette,

I think people are generally using uncompressed TIFF. I know of a few who use TIFF with LZW compression, which is lossless, but not terribly efficient. ZIP is also a lossless compression, so a TIFF that has been zipped may be okay. LZW would be more software-friendly though, as it retains the TIFF extension and most image editing software can handle them. The zipped TIFF would probably need to be unzipped to be viewed.

There's also JPEG2000, which can use lossless compression. It's probably got the biggest following after uncompressed TIFF and it's what my institution is starting to adopt.

Best,
Nathan


--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digital-curati...@googlegroups.com.
To post to this group, send email to digital-...@googlegroups.com.
Visit this group at http://groups.google.com/group/digital-curation.
For more options, visit https://groups.google.com/groups/opt_out.

Kevin Hawkins

unread,
Sep 25, 2013, 2:31:28 PM9/25/13
to digital-...@googlegroups.com
I agree with Nathan that we should be careful to distinguish between
using compression in an individual TIFF file and compressing a set of
one or more files (such as TIFFs) using a standard such as ZIP.

There are various ways of compressing TIFF files themselves, including
LZW and CCITT Group 4 compression, both of which are lossless, and some
lossy compression schemes.

As for compressing sets of files outside of the standard itself, there
are also various standards for this, including ZIP. As he says, the ZIP
standard itself uses lossless compression, which is why once you unzip
something, the file is identical to what went into the ZIP rather than a
degraded version.

--Kevin
> <mailto:digital-curation%2Bunsu...@googlegroups.com>.
> To post to this group, send email to
> digital-...@googlegroups.com
> <mailto:digital-...@googlegroups.com>.

Kevin Hawkins

unread,
Sep 25, 2013, 2:33:07 PM9/25/13
to digital-...@googlegroups.com
Sorry, when I wrote:

As for compressing sets of files outside of the standard itself,

I meant:

As for compressing one or more files regardless of file format,

--K.

adam brin

unread,
Sep 25, 2013, 3:43:49 PM9/25/13
to digital-...@googlegroups.com
I would add that there are definite challenges with each option mentioned, it'd be nice if there were more community guidance, discussion, and / or documentation to help with decisions.  I can add some "data" to the discussion, we're in the process of evaluating what methods we use on a cost vs. archival requirements benefit.  To do so, we evaluated 29,000 JPG image files to see what the costs and implications are:
  • uncompressed TIFFs are great, but they take up a TON of space. In our research up to 20x the size of the original JPG.
  • LZW compressed TIFFs, while nice because LZW compression is lossless, does add complexity to the file format, and while some have expressed challenges with the rate of compression, we've found they're about 7.5x larger than the original JPG
  • JPEG2000 has the best compression rate (even when lossless) of about 1.5x the original JPG, but there are a lot of discussions of the challenges of JPEG2000 from the setup of the software, to unencumbered licenses, etc. (a quick google scholar search will illustrate some of these). 
  • compressing files after the fact, e.g. using BZIP is a bit better than LZW compression, but not exclusively (6.5x larger).
  • We also looked at PNG, which is on average 8x larger than the JPG, but we ultimately decided it's not useful in most cases because it does not support all of the formats that JPG does.
We've not made a decision about what we're going to do, but budgeting for 20x the space-per-file is quite a bit.  One final not about the rates quoted, they're roughly both the "mean" and "median" values. The max's are quite a bit higher, for example TIFF's max is 134x the size of the JPG.   




<mailto:digital-curation%2Bunsu...@googlegroups.com>.
To post to this group, send email to

Visit this group at http://groups.google.com/group/digital-curation.
For more options, visit https://groups.google.com/groups/opt_out.


--
You received this message because you are subscribed to the Google
Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to digital-curation+unsubscribe@googlegroups.com.
To post to this group, send email to digital-curation@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digital-curation+unsubscribe@googlegroups.com.
To post to this group, send email to digital-curation@googlegroups.com.

Erway,Ricky

unread,
Sep 25, 2013, 5:04:55 PM9/25/13
to digital-...@googlegroups.com

Hi Adam,

 

What is your “original JPG”?  If it’s a compressed JEPG, why would your further compress it?  If it’s uncompressed, then is it already a JPEG2000?

 

Ricky

 

Ricky Erway

Senior Program Officer

OCLC Research

San Mateo, CA USA

+1 (650) 287-2125

erw...@oclc.org


<mailto:digital-curation%2Bunsu...@googlegroups.com>.
To post to this group, send email to


Visit this group at http://groups.google.com/group/digital-curation.
For more options, visit https://groups.google.com/groups/opt_out.


--
You received this message because you are subscribed to the Google
Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send

an email to digital-curati...@googlegroups.com.
To post to this group, send email to digital-...@googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.

To unsubscribe from this group and stop receiving emails from it, send an email to digital-curati...@googlegroups.com.
To post to this group, send email to digital-...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.

To unsubscribe from this group and stop receiving emails from it, send an email to digital-curati...@googlegroups.com.
To post to this group, send email to digital-...@googlegroups.com.

Nathan Tallman

unread,
Sep 25, 2013, 5:30:31 PM9/25/13
to digital-...@googlegroups.com

On Wed, Sep 25, 2013 at 5:04 PM, Erway,Ricky <erw...@oclc.org> wrote:
If it’s a compressed JEPG, why would your further compress it?

I can think of one reason to convert a JPEG to JPEG2000, to "freeze" it in place and prevent further compression. No chances of someone re-saving and re-compressing the JPEG losslessly.

Nathan

Peter Murray

unread,
Sep 25, 2013, 4:23:53 PM9/25/13
to digital-...@googlegroups.com
I would also add that minor corruption of an LZW-compressed file likely means the loss of the entire image where as minor corruption of a JPEG2000 file would result in imperceptible-to-the-eye differences in the output.


Peter
> send an email to digital-curati...@googlegroups.com
> <mailto:digital-curation%2Bunsu...@googlegroups.com>.
> To post to this group, send email to
> digital-...@googlegroups.com
> <mailto:digital-...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/digital-curation.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Digital Curation" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to digital-curati...@googlegroups.com.
> To post to this group, send email to digital-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/digital-curation.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Digital Curation" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to digital-curati...@googlegroups.com.
> To post to this group, send email to digital-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/digital-curation.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Digital Curation" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to digital-curati...@googlegroups.com.
> To post to this group, send email to digital-...@googlegroups.com.
Peter Murray
Assistant Director, Technology Services Development
LYRASIS
Peter....@lyrasis.org
+1 678-235-2955
800.999.8558 x2955

adam brin

unread,
Sep 25, 2013, 7:47:27 PM9/25/13
to digital-...@googlegroups.com
Hi Ricky,
In our case "original jpg" means that a user has provided a compressed JPG when submitting files. If most archives are not converting these files as they're in an open format (regardless of lossy-ness) that's useful to know. We've been mainly following the guidance of the LOC (digitalpreservation.gov ) and others, which show a strong preference for not using the JPG format as an archival format. Are most folks now using JPEG2k?

- adam
> To unsubscribe from this group and stop receiving emails from it, send an email todigital-curat...@googlegroups.com.

Erway,Ricky

unread,
Sep 25, 2013, 8:28:11 PM9/25/13
to digital-...@googlegroups.com
If they are compressed JPEGs, the damage has been done and is non-reversible, so reformatting as uncompressed TIFFs just buys added file size. I imagine the LC guidance is expressing a strong preference against creating JPEGs, but if you already have them, I doubt they would recommend converting to TIFF. That said, if you otherwise have JPEG2000 or TIFFs and you want everything to be the same, well then I guess it might make sense to reformat them.

-----Original Message-----
From: digital-...@googlegroups.com [mailto:digital-...@googlegroups.com] On Behalf Of adam brin
Sent: Wednesday, September 25, 2013 4:47 PM
To: digital-...@googlegroups.com
Subject: Re: [digital-curation] Zipping preservation TIFF files

Hi Ricky,
In our case "original jpg" means that a user has provided a compressed JPG when submitting files. If most archives are not converting these files as they're in an open format (regardless of lossy-ness) that's useful to know. We've been mainly following the guidance of the LOC (digitalpreservation.gov ) and others, which show a strong preference for not using the JPG format as an archival format. Are most folks now using JPEG2k?

- adam


On Sep 25, 2013, at 2:04 PM, "Erway,Ricky" <erw...@oclc.org> wrote:

> Hi Adam,
>
> What is your "original JPG"? If it's a compressed JEPG, why would your further compress it? If it's uncompressed, then is it already a JPEG2000?
>
> Ricky
>
> Ricky Erway
> Senior Program Officer
> OCLC Research
> San Mateo, CA USA
> +1 (650) 287-2125
> erw...@oclc.org
>
>
>
> From: digital-...@googlegroups.com
> [mailto:digital-...@googlegroups.com] On Behalf Of adam brin
> Sent: Wednesday, September 25, 2013 12:44 PM
> To: digital-...@googlegroups.com
> Subject: Re: [digital-curation] Zipping preservation TIFF files
>
> I would add that there are definite challenges with each option mentioned, it'd be nice if there were more community guidance, discussion, and / or documentation to help with decisions. I can add some "data" to the discussion, we're in the process of evaluating what methods we use on a cost vs. archival requirements benefit. To do so, we evaluated 29,000 JPG image files to see what the costs and implications are:
> * uncompressed TIFFs are great, but they take up a TON of space. In our research up to 20x the size of the original JPG.
> * LZW compressed TIFFs, while nice because LZW compression is lossless, does add complexity to the file format, and while some have expressed challenges with the rate of compression, we've found they're about 7.5x larger than the original JPG
> * JPEG2000 has the best compression rate (even when lossless) of about 1.5x the original JPG, but there are a lot of discussions of the challenges of JPEG2000 from the setup of the software, to unencumbered licenses, etc. (a quick google scholar search will illustrate some of these).
> * compressing files after the fact, e.g. using BZIP is a bit better than LZW compression, but not exclusively (6.5x larger).
> * We also looked at PNG, which is on average 8x larger than the JPG, but we ultimately decided it's not useful in most cases because it does not support all of the formats that JPG does.

Nick Krabbenhoeft

unread,
Sep 26, 2013, 4:50:58 PM9/26/13
to digital-...@googlegroups.com
Once a file has been compressed with a lossy algorithm, the information is lost forever. Converting a lossy compressed image to a lossless format is like converting a text written in the Latin alphabet into the dots and dashes of Morse code. You'll have the exact the same information, it will just take up a lot more space. The LC recommends using TIFF or JPEG2000 for when you are creating images, for instance, digitizing newspaper collections.
> To unsubscribe from this group and stop receiving emails from it, send an email todigital-curation+unsub...@googlegroups.com.

Nick Krabbenhoeft

unread,
Sep 26, 2013, 4:56:36 PM9/26/13
to digital-...@googlegroups.com
Returning to the original question, this is the same problem with compressing groups of files with zip and other compression schemes. If the bitstream is corrupted, the corruption will damage not just one file but every file downstream from the original corruption. The tradeoff with file compression is that you decrease file size while increasing the total potential damage by any single corruption.

-Nick

Edward M. Corrado

unread,
Sep 26, 2013, 6:09:07 PM9/26/13
to digital-...@googlegroups.com
On Wed, Sep 25, 2013 at 7:47 PM, adam brin <ad...@brin.org> wrote:
Hi Ricky,
  In our case "original jpg" means that a user has provided a compressed JPG when submitting files.  If most archives are not converting these files as they're in an open format (regardless of lossy-ness) that's useful to know. We've been mainly following the guidance of the LOC (digitalpreservation.gov ) and others, which show a strong preference for not using the JPG format as an archival format.  Are most folks now using JPEG2k?


I can't say what most people do in the situation where the original files they get are JPG files, but we are keeping them as JPG.  The reasons include:

1) Every time you convert a file, you risk losing some information (even if you are going to a "better" format).
2) The converted file will be bigger, thus more space is needed (not to mention that we would probably preserve both the original and the converted, so that is even more space).
3) Although this could be somewhat automated, we would need staff time to compare the files to make sure the conversion worked.
4) I don't believe the JPG format is at any more risk of obsolescence than TIFF is.

That said, when we create images (via scanning or other methods) we are still using uncompressed TIFF. As part of our ongoing digital preservation activities, we are planning on reevaluating this soon and will consider JPEG2000, compressed TIFF, and other formats. At this point I would be surprised if we changed our default scanning format for images from uncompressed TIFF but I am going into the evaulation with an open mind.

Edward

Erik Hetzner

unread,
Sep 27, 2013, 12:55:17 PM9/27/13
to digital-...@googlegroups.com, Nick Krabbenhoeft
At Thu, 26 Sep 2013 13:56:36 -0700,
Nick Krabbenhoeft wrote:
>
> Returning to the original question, this is the same problem with
> compressing groups of files with zip and other compression schemes. If the
> bitstream is corrupted, the corruption will damage not just one file but
> every file downstream from the original corruption. The tradeoff with file
> compression is that you decrease file size while increasing the total
> potential damage by any single corruption.

Hi Nick

I don’t think this is always true. I just did an experiment where I
zipped two random files in a zip file, then wrote a null byte to a
random byte in the zip file. One file was damaged, but the other was
fine.

I have attached a python program that you can use to overwrite a
random byte in a test.zip file with a null byte and see the results
for yourself.

We discussed this earlier; I still think that if we are concerned with
bit flipping there are much better methods to achieve redundancty than
storing data uncompressed.

https://groups.google.com/d/msg/digital-curation/lcuYD5-sXEU/DJyPGLChh7QJ

best, Erik

test.py

Chris Adams

unread,
Sep 27, 2013, 2:01:09 PM9/27/13
to digital-...@googlegroups.com
On Fri, Sep 27, 2013 at 12:55 PM, Erik Hetzner <erik.h...@ucop.edu> wrote:
> We discussed this earlier; I still think that if we are concerned with
> bit flipping there are much better methods to achieve redundancty than
> storing data uncompressed.

I agree with Erik on this: it's really just a question of using
appropriate tools for two separate tasks. If you're worried about data
loss, some sort of error correcting code will produce predictable
results whereas storing uncompressed data is simply making a large
gamble about the likely nature and number of bit errors. I've read
enough post-mortems over the years to be uncomfortable with any
prediction that failures will follow one of the few patterns where
uncompressed files offer a real robustness benefit.

I strongly prefer to solve these problems separately as there is no
one format which satisfies everyone's needs and, given the widespread
availability of highly-reliable storage-level solutions to this
problem, it doesn't make sense to prioritize bit-level errors over
either cost or usability when other risks are going to require you to
address the lower-level risks in any case. If you have a redundancy
and verification system, trust it and let curatorial concerns dictate
the storage formats used.

Chris

Mark A. Matienzo

unread,
Sep 27, 2013, 2:12:23 PM9/27/13
to digital-...@googlegroups.com, Nick Krabbenhoeft
In regards to archive/compressed file resiliencies, this too was a past thread on the digital-curation list. Ben O'Steen wrote up a post a while back about some his findings: <http://oxfordrepo.blogspot.com/2008/12/archive-file-resiliences.html>

Mark

--
Mark A. Matienzo <ma...@matienzo.org>
Digital Archivist, Manuscripts and Archives, Yale University Library
Technical Architect, ArchivesSpace


On Fri, Sep 27, 2013 at 12:55 PM, Erik Hetzner <erik.h...@ucop.edu> wrote:
Sent from my free software system <http://fsf.org/>.

Nick Krabbenhoeft

unread,
Sep 27, 2013, 2:41:21 PM9/27/13
to digital-...@googlegroups.com
I have to admit my reluctance to compress multiple files together is based on bad experiences with zips and tars in the past, not on an understanding of the specs. Thanks for advice/links.

Chris, what kind of error correcting codes are you talking about? I'm not very familiar with them. Are you recommending coding systems built into storage hardware or ones that can be placed into an existing workflow.

Nick

L Snider

unread,
Sep 27, 2013, 3:13:43 PM9/27/13
to digital-...@googlegroups.com
Hi Mark,

Thanks for posting this, it was useful information for me because we currently use tar and jar...Anyone know whether there is similar information about jars (I think moonshine ever time I write that word!)?

Cheers

Lisa

-- 
Lisa Snider
Electronic Records Archivist
Harry Ransom Center
The University of Texas at Austin
P.O. Box 7219
Austin, Texas 78713-7219
P: 512-232-4616
www.hrc.utexas.edu

Erik Hetzner

unread,
Sep 27, 2013, 3:26:23 PM9/27/13
to digital-...@googlegroups.com, L Snider
At Fri, 27 Sep 2013 14:13:43 -0500,
L Snider wrote:
>
> Hi Mark,
>
> Thanks for posting this, it was useful information for me because we
> currently use tar and jar...Anyone know whether there is similar
> information about jars (I think moonshine ever time I write that word!)?

Hi Lisa,

jars are just zips with some extra restrictions, so what Ben wrote in
his fine article about zips should apply:

http://oxfordrepo.blogspot.com/2008/12/archive-file-resiliences.html

best, Erik

L Snider

unread,
Sep 27, 2013, 3:37:08 PM9/27/13
to Erik Hetzner, digital-...@googlegroups.com
Hi Erik,

I always thought they differed in minor ways (I haven't dealt directly with jars for a long time though). That does answer my question-thanks!

Cheers

Lisa




On Fri, Sep 27, 2013 at 2:26 PM, Erik Hetzner <erik.h...@ucop.edu> wrote

L Snider

unread,
Sep 27, 2013, 3:57:24 PM9/27/13
to Erik Hetzner, digital-...@googlegroups.com
I knew I remembered something about jar/zip differences:
http://geekexplains.blogspot.com/2008/08/zip-vs-jar-when-to-use-what-which-one.html

It was the Manifest file that I was thinking about, how that would be impacted by change. I would gather that the checksum would pick up any change to it...

We need another name other than .jar, now I am thinking jam-so it must be time for a break!

Cheers

Lisa


Chris Adams

unread,
Sep 27, 2013, 8:20:03 PM9/27/13
to digital-...@googlegroups.com
On Fri, Sep 27, 2013 at 2:41 PM, Nick Krabbenhoeft <ni...@metaarchive.org> wrote:
> Chris, what kind of error correcting codes are you talking about? I'm not
> very familiar with them. Are you recommending coding systems built into
> storage hardware or ones that can be placed into an existing workflow.

There's a long history of academic research into ways to encode
information such that you store n chunks of data and can provably
reconstruct the original file from percentage of the chunks – see
https://en.wikipedia.org/wiki/Erasure_code and
https://en.wikipedia.org/wiki/Forward_error_correction – which has the
nice property of giving you good error recovery without needing to
make two full copies.

Erick mentioned the par2 tool (https://en.wikipedia.org/wiki/Parchive)
and there are other implementations (e.g. zfec) which fit into the
classic command-line pattern: create an archive and then use erasure
coding to ensure that you can always reconstruct the archive file.

There are filesystem implementors incorporating this feature into
distributed filesystems – e.g. the Tahoe-LAFS project created zfec –
because they also have the great property of making it easier to
spread data across as many nodes as possible without requiring anyone
to store a full copy and recover from failed nodes (or even data
centers) faster. I know that the Microsoft Azure cloud storage team
has been doing original work in this area and others no doubt are as
well.

Tying this together, if your goal is to support distribution across
heterogenous systems and you cannot depend on any one storage system
being resilient against corruption you might want to pick an archive
tool to build this workflow around. On the other hand, if you're using
a filesystem or cloud provider which is providing a suitably high
guarantee against bit-loss (e.g. every object stored in on ZFS, btrfs,
Amazon's S3, OpenStack's Swift, etc. has built-in strong fixity
checks), you would see little benefit from the extra precautions and
could focus your efforts on threats like human error or malice, lack
of geographic redundancy, or software bugs, etc. which apply to
everything.

Chris

Andrea Goethals

unread,
Sep 28, 2013, 12:08:44 PM9/28/13
to digital-...@googlegroups.com
Another thing to consider that I haven't seen discussed in this thread is whether or not you will need to access these files. I understand that these are archival copies and we usually only talk about the accessing of deliverable copies but there are reasons you may need to access archival copies as well. By 'access' I am talking broadly to also include processing for format identification / validation / metadata extraction, spot-checking over time, use as the input into format migrations, etc. In the repository I manage for some kinds of archival content they are stored in various container formats (zip, gzip, ARCs) and whenever we need to access them for any reason there is a bit of a barrier to get to the content because tools don't natively expect to have to go through the unpacking to get to the content. In practice this means that this content isn't as easily preservable as the content that isn't stored in these container formats.

Andrea

Andrea Goethals
Harvard Library

Simon Spero

unread,
Sep 28, 2013, 2:20:06 PM9/28/13
to digital-...@googlegroups.com

Gzip recovery tool : https://github.com/arenn/gzrt

Suggests using gnu cpio to extract data from recovered tgz streams, as it will skip unrecovered bytes.

FEC worth looking at: RAPTORQ, which has some nice theoretical properties.

Access vs. Compression &archiving:

Zip files do allow for random access; however, all compression is done on a per file basis, which means that for big collections of small files, the compression algorithms are barely getting started before the file is over. This really becomes significant if the files have a lot of commonality (e.g. individual XML or JSON records, or emails stored in individual files).

Small files that are bundled together will still save space as any partially used blocks are a more significant part of the overall size. Also, samfs (and other near line systems that keep part of the file data around) is happier when it has fewer files to migrate.

Compressed streams only allow random access to the start of compressed blocks, which can be somewhat large. If you record the offset of the desired file, you can seek to the start of the compressed block, and decode sequentially from there. Given how much more expensive seeks are relative to sequential reads (on HDD at least), this may not be too big a penalty.

Reply all
Reply to author
Forward
0 new messages