Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Archiving

0 views
Skip to first unread message

J

unread,
Jul 17, 2008, 5:01:33 PM7/17/08
to

Two beginner questions about archiving on Unix:

1. How reliable are tar and gzip compared to commercial archiving
products?

2. Assuming space is not an issue, is there any benefit in storing
compressed data? In other words, would my text, pdf, audio files
"keep" any better being tar'd and gzip'd first?

Thanks in advance.

Todd H.

unread,
Jul 17, 2008, 5:45:05 PM7/17/08
to
J <skyli...@yahoo.com> writes:

> Two beginner questions about archiving on Unix:
>
> 1. How reliable are tar and gzip compared to commercial archiving
> products?

tar and gzip probably better in terms of reliability as they're open
source, widely available on a large number of bootable distributions,
and extremely mature.

But there are issues other than reliability... Commercial products
primarily bring convenience such that it becomes mroe likely backups
actually get performed, but if you take that out of the equation, tar,
gzip, rsync and cron can all be used for a wonderful backup solution
for a *nix or nix-like environment.

> 2. Assuming space is not an issue, is there any benefit in storing
> compressed data? In other words, would my text, pdf, audio files
> "keep" any better being tar'd and gzip'd first?

They'd keep better copied in their original hierarchy, as they'd be
more easily accessible, and there'd be no processing time associated
with accessing the archived data. But gzip is lossless
compression and barring media issues corrupting the archive it's just
a space tradeoff.

You don't mention any security aspects... if you're concerned about
maintaining the privacy of your data on backup media, you might
consider tar, gzip (or the more aggressive/intensive bzip2) and then
gpg encrypting the final archive.

--
Todd H.
http://www.toddh.net/

jpd

unread,
Jul 17, 2008, 5:48:46 PM7/17/08
to
Begin <e52449f7-fdf0-4ea0...@34g2000hsf.googlegroups.com>

On Thu, 17 Jul 2008 14:01:33 -0700 (PDT), J <skyli...@yahoo.com> wrote:
>
> Two beginner questions about archiving on Unix:
>
> 1. How reliable are tar and gzip compared to commercial archiving
> products?

tar -- Tape Archiver, method for storing files on a tape, does what it
does pretty well modulo its many versions and file format variations.
Compare star, `bsd tar', pax, cpio, gnu tar, and so on and so forth.
Not all are compatible.

gzip is a stream compression tool like compress and does not do
archiving. Since tar does not do compression, the two are often
combined. See also other such tools like compress (older than gzip) and
bzip2 (newer ~ ~).

Please elaborate: what do _you_ mean with reliability? How do you
expect commercial and non-commercial software to differ?


> 2. Assuming space is not an issue, is there any benefit in storing
> compressed data?

Given that the point of compression is reducing space requirements,
not directly, no.


> In other words, would my text, pdf, audio files "keep" any better
> being tar'd and gzip'd first?

As noted, tar does not itself provide compression. Any tar that does,
does so by invoking some stream compressor over its output stream.

Both tar and gzip trade complexity for something (combinding files into
one bytestream and compression respectively) which introduces various
risks. The need for specialised tools to extract them again, for one,
and inability to restore the original files from the stream due to
corruption for another. This is more finnicky for compression tools, so
uncompressed tape archives stand a somewhat better chance of recovery
beyond a bad spot (eg. due to media failure).

Then again, if you store your files for 30 years and after 30 years it
turns out you have no programs and/or hardware left to read them....

So you'll need to spell out more clearly what you mean with `keep better'.


--
j p d (at) d s b (dot) t u d e l f t (dot) n l .
This message was originally posted on Usenet in plain text.
Any other representation, additions, or changes do not have my
consent and may be a violation of international copyright law.

jellybean stonerfish

unread,
Jul 17, 2008, 10:08:24 PM7/17/08
to
On Thu, 17 Jul 2008 21:48:46 +0000, jpd wrote:

> Both tar and gzip trade complexity for something (combinding files into
> one bytestream and compression respectively) which introduces various
> risks. The need for specialised tools to extract them again, for one,
> and inability to restore the original files from the stream due to
> corruption for another. This is more finnicky for compression tools, so
> uncompressed tape archives stand a somewhat better chance of recovery
> beyond a bad spot (eg. due to media failure).

Hmmmm... I wonder if gnu tar could be fiddled with to add an option to
compress each file first, then wrap the compressed/encrypted files in the
tar archive. This could mean losing only one file, instead of the whole
archive if there is a flipped bit.

stonerfish

jpd

unread,
Jul 18, 2008, 5:28:34 AM7/18/08
to
On Fri, 18 Jul 2008 02:08:24 GMT,
jellybean stonerfish <stone...@geocities.com> wrote:
>
> Hmmmm... I wonder if gnu tar could be fiddled with to add an option to
> compress each file first, then wrap the compressed/encrypted files in the
> tar archive. This could mean losing only one file, instead of the whole
> archive if there is a flipped bit.

Personally I'd take a different tar (perhaps `bsd tar', or `star' ) if
I wanted to fiddle, as gnu tar for all its ubiquitousness (hah[1]) is
a source of interesting compatability problems itself. But you don't
really need to touch the source: Simply gzip'ing the files first, then
tar, would work, though it'd be inconvenient without at least some
wrapper shell script.

OTOH I wouldn't go there in the first place. Such a solution exists
already, and is called 'zip', or 'arj', or 'lha', or 'rar', or even any
of a bunch of obscure ones (anybody remember 'uc2', or 'hap/pah'?).
I recall that zip stores its file index at the end of the archive so
that may not be the best choice, though it does come with some index
reconstructing utility (at least the dos utility pkzip did).

In this context (compressing first, then) adding redundancy is an
interesting option and one example is the `parchive' (also apparently
a fork called `par2') software that will let you create `parity files'
with which missing chunks of the original can be reconstructed up to
some (choosable) limit. Integrating this into an archiver is fairly
obvious as a goal and it seems rar has a (ISTR not very efficient)
implementation of this.


[1] People in the gnu world like to believe that and advocate it saying
all the cool kids are doing it, but that doesn't make it strictly true.
Or maybe there just are a lot of uncool people around. Who knows?

J

unread,
Jul 18, 2008, 9:54:15 AM7/18/08
to

Thanks to all for the informative responses.

> Please elaborate: what do _you_ mean with reliability? How do you
> expect commercial and non-commercial software to differ?

I work with commercial software routinely, and I'm often surprised at
how many people view Unix as some antiquated, useless form of
technology. I discovered BSD about 10 years ago and I've yet to find
any task (with maybe the exception of audio/video edit and mix
operations) which doesn't perform better on this platform. I also
despise the one-size-fits-all cookie-cutter "improvements" that
continually pour out of new releases of popular commercial products.
Thank you very much, but I'd like to choose which letters get
capitalized in my sentences. So I guess my expectations are pretty
high with tools such as tar and gzip. I've done a lot more unpacking
than packing up though, and before I institute some kind of automatic
data backup process at home, I'm trying to get more familiar with the
available tools.

> So you'll need to spell out more clearly what you mean with `keep better'.

I didn't know if data which *wasn't* archived and compressed had a
higher susceptibility to data corruption. Maybe this sounds like a
silly question, since one might generally assume that corruption would
more likely be introduced by a process rather than no process at all.
But I wondered about this because often I receive things (PDFs, DAT
files, images, etc.) compressed when there is very little size
difference between the compressed version and the original. But I
take it that the answer is "no", and again I appreciate the responses.

jpd

unread,
Jul 18, 2008, 10:23:12 AM7/18/08
to
Begin <80a1b3a8-b36e-45e1...@j33g2000pri.googlegroups.com>

On Fri, 18 Jul 2008 06:54:15 -0700 (PDT), J <skyli...@yahoo.com> wrote:
> I work with commercial software routinely, and I'm often surprised at
> how many people view Unix as some antiquated, useless form of
> technology.

Probably a (perceived) case of ``the evil you know'', showing the
success of marketing to the uninitiated. ``Unix'' does come with a steep
learning curve, and the usual sales tactic is ``requires no training''
-- even if that turns out to be a bald-faced lie.


> I discovered BSD about 10 years ago and I've yet to find any task
> (with maybe the exception of audio/video edit and mix operations)
> which doesn't perform better on this platform.

I'm told that a certain package on a certain platform (and for a very
limited time, I supported a few people using that combination) does
very well in that regard. The platform involved is perhaps the permier
graphics-oriented one in widespread use, and nowadays builds on a solid
BSD-family based foundation.

Of course, it has always been a bit of a niche if not cult product and
this may taint it in the eyes of the abovementioned uninitiated.


> I also despise the one-size-fits-all cookie-cutter "improvements" that
> continually pour out of new releases of popular commercial products.

It's what the frantic search for ``innovation'' drives one to, without
something substantial to add[1].


> [...], I'm trying to get more familiar with the available tools.

Well, look around. There are many. :-)

For example, I'm using FreeBSD, and its collection of ported
software has a whole category called ``archivers'' (see for example
freshports.org). Or you could troll freshmeat for interesting tidbits,
if you don't mind stumbling over loads of alpha-quality software.

But if you want more useful comments, you may want to try and enumerate
the things you're trying to achieve, or the directions you're looking in.


>> So you'll need to spell out more clearly what you mean with `keep better'.
>
> I didn't know if data which *wasn't* archived and compressed had a
> higher susceptibility to data corruption.

Well, if you're interested, find an introductory text on information
theory. The trick compression relies on is finding creative ways to
re-code the data such that the information contained is ``packed more
densely'', but that also means that bitrot has a a bigger impact.

Also, the wrong algorithm can easily cause data expansion.


> Maybe this sounds like a silly question, since one might generally
> assume that corruption would more likely be introduced by a process
> rather than no process at all.

Storing the data itself is a process already. In fact, filesystems tend
to be fairly non-trivial things, and then there's the controllers and
the actual storage, and whatnot. I don't think there's anything you can
do with information stored as bits and bytes which isn't itself some
``process'', and often stacks and stacks of layers of them.

We tend to focus on but a few layers at a time and ignore the rest,
but sometimes you have to consider at least that there are many of them.


> But I wondered about this because often I receive things (PDFs, DAT
> files, images, etc.) compressed when there is very little size
> difference between the compressed version and the original.

I don't know what you mean with ``DAT files'', as to me that is a
generic indication of data files without information as to what is
stored or what format it is in except for ``likely something binary''.

PDF-the-format does have compression support, so compressing the file
again is not likely to yield more. In fact, it can cause an expansion of
the data. That doesn't stop uninformed people from going ahead with it
anyway.

``Adding more compression is always better, isn't it?'' -- Clearly, no.


[1] Which, I might add, does not automatically equate ``technical
advances'' aka inventions. But mere marketing hype is not enough.

0 new messages