counter-intuitive gzip compression ratio

Buxte Hude

unread,

Nov 2, 2009, 5:28:46 PM11/2/09

to

hello all,

I have an interesting case of counter-intuitive packing ratios with gzip
that I would be interesting in knowing if anyone else has observed or
would have an idea about the origin - or if it is just a random
epiphenomenon.

I have this 4.8 GB hdd with a single 340 MB partition (for historical and
emulation reasons - don't ask). That 340 MB partition was filled up to
about 155 MB. For archival purposes, I copied the partition to a file
and also the entire hdd to another file.

Intending to compress both files and knowing that lots of zeroes
compress better :), before capturing the data, I had first mounted the
partition in order to fill all available space with a zero-filled file
that I then deleted, and also zero-filled the unpartitioned space by
creating a partition in the available space, zeroing the device, and
deleting the partition.

So I then dumped both the single partition and the entire disk, using
dd, to a 64-bit filesystem of course. So there I got my 340 MB file, and
my 4.8 GB file that has inside the partition table, the contents of the
340 MB file and then 4.5 GB of zeroes. I then gzipped both files.

Suprise: the 4.8 GB file compresses down to 77 MB, whereas the 340 MB
file compresses to 125 MB. Also lzma and xz get the 340 MB file down to
about 90 MB.

The 75 MB file is correct, as evidenced by inflating it back and taking
an md5sum and comparing to the md5sum of the original hdd.

How is that possible? How adding vast amounts of low-entropy crap can
improve a compression ratio??

Tom St Denis

unread,

Nov 2, 2009, 9:29:19 PM11/2/09

to

It's late and I might misunderstand, is the 340MB file from a
partition that is resized, zeroed, then resized back to 340? Or did
you

1. copy the 340MB out
2. then inside the 4.8G disk resize/zero/resize

Because if you copied it out first made the slack space on the 340
isn't zeroed?

Tom

Mark Adler

unread,

Nov 2, 2009, 10:34:17 PM11/2/09

to

On 2009-11-02 14:28:46 -0800, Buxte Hude <buxt...@buxtehude.no> said:
> So there I got my 340 MB file, and
> my 4.8 GB file that has inside the partition table, the contents of the
> 340 MB file and then 4.5 GB of zeroes. I then gzipped both files.
>
> Suprise: the 4.8 GB file compresses down to 77 MB, whereas the 340 MB
> file compresses to 125 MB.

...
> How is that possible?

It's not. gzip compression is relatively local, so it effectively
doesn't know anything about all those zeros far away. Adding the 4.5
GB of zeros should have added about 4 MB to the compressed file.

Have you verified your assertion, that the 340 MB file is exactly
contained in the 4.8 GB file?

Mark

Niels Fröhling

unread,

Nov 4, 2009, 12:56:07 AM11/4/09

to

Buxte Hude wrote:

> How is that possible? How adding vast amounts of low-entropy crap can
> improve a compression ratio??

gzip moves a context-window over your data. If you have a large enough
zero-area the entire context will be filled with zeroes (in practice this is
almost like a context-flush). If your defragmenter had something to say on your
old disk, we may assume that you got some related contents clustered and
seperated from other content by voids.
As such the compressor won't try to apply learned statistics from cluster-1 to
cluster-2 and so on (because the statistics get flushed/cleared by the zero areas).
While the compacted partition-data causes the compressor to apply a much
denser and probably much less related statistical model.

You can try to verify this in that you have a file/directory-based compressor
run on the partitions content (like rar with "solid" option enabled). It should
compress even be better.

Ciao
Niels

jules Gilbert

unread,

Nov 13, 2009, 10:13:42 PM11/13/09

to

On Nov 2, 10:34 pm, Mark Adler <mad...@alumni.caltech.edu> wrote:

gzip has some unusual properties -- well, unusual for a compressor.
For instance, it produces many more one's than zero's. And that's
just plain wrong.

--jg

Mark Adler

unread,

Nov 14, 2009, 2:59:42 AM11/14/09

to

On 2009-11-13 19:13:42 -0800, jules Gilbert <jules....@gmail.com> said:
> gzip has some unusual properties -- well, unusual for a compressor.
> For instance, it produces many more one's than zero's.

An interesting assertion. However it's wrong. From a large gzip file I got:

140008610 ones out of 280097016 bits

That one was the other way around, with more zeros than ones. But not
many more. Considering a random distribution, one standard deviation
from exactly one-half is about the square root of the expected number
of ones, which is ~12,000. So the number of ones is about 3.4 sigma
low from what is expected.

3.4 sigma does seem a little unlikely, so there might be a very slight
tendency to produce more ones than zeros, which might be the case in
the Huffman code descriptors at the start of each dynamic block.

Or it might just be chance.

Mark

Mark Adler

unread,

Nov 14, 2009, 3:00:48 AM11/14/09

to

On 2009-11-13 23:59:42 -0800, Mark Adler <mad...@alumni.caltech.edu> said:
> 3.4 sigma does seem a little unlikely, so there might be a very slight
> tendency to produce more ones than zeros,

Oops -- I meant more zeros than ones.

Mark

stan

unread,

Nov 14, 2009, 5:24:21 PM11/14/09

to

gzip is actual code I can examine and use. Where is your's? You would
have much more credibility if you weren't the perpetual RSN guy. You
are well aware that the reaction to you around here is at best
sceptical and runs to outright total disbelief. Your thoughts about
proven code we all use daily can't be realistically considered by
rational minds.

jules Gilbert

unread,

Dec 16, 2009, 9:45:55 PM12/16/09

to

First. both of you are completely wrong in your assertions, and in
more ways than I can quickly enumerate. So I'm just going to stick to
the topic -- here goes, years ago I both used gzip as my in-house
compressor and also developed compressor's around it, programs that
would compress it's output, at least one time, other programs did this
for several cycles.

And in the course of this work I learned quite a bit about the
statistical composition of the typical gzip output file.

I repeat, *all* of the files I examined, (using mechanical means, that
would be in the hundreds or perhaps thousands,) *every* gzip file
examined had many more 1's than 0's. (Which is not a good thing if
you're trying to represent information in minimal space.)

Now Mark's remark, (which I am copying below,) is mathematically
accurate but it has very little to do with the subject at hand.
Unfortunately for Mark I have certain knowledge I choose not to
disclose at present so Mark will have to remain ignorant. But yes
Mark, your math logic is correct, you've just omitted something
important.

Mark, just write yourself a little counter program and turn it loose
in a directory. Of course if you're going to do this, find an older
CD-ROM distribution. Test every gzip file! You'll see.

One of the first re-compressor's I wrote (it only compressed about
half the files it looked at -- not a great selling point,) was
designed to read gzip files exclusively. So I know what I'm talking
about Mark.

Today my program's XOR whatever input they get with a pseudo-random
number generator.

glen herrmannsfeldt

unread,

Dec 17, 2009, 12:54:47 AM12/17/09

to

jules Gilbert <jules....@gmail.com> wrote:
(snip)

> First. both of you are completely wrong in your assertions, and in
> more ways than I can quickly enumerate. So I'm just going to stick to
> the topic -- here goes, years ago I both used gzip as my in-house
> compressor and also developed compressor's around it, programs that
> would compress it's output, at least one time, other programs did this
> for several cycles.

> And in the course of this work I learned quite a bit about the
> statistical composition of the typical gzip output file.

> I repeat, *all* of the files I examined, (using mechanical means, that
> would be in the hundreds or perhaps thousands,) *every* gzip file
> examined had many more 1's than 0's. (Which is not a good thing if
> you're trying to represent information in minimal space.)

Well, LZW usually only writes in whole bits. That is, the code word
written at any step is a whole number of bits, often between 9 and 16
bits long. It would be more optimal, but slower, to divide up the
bits better. The average code word might be 14 or 15 bits, and on
average 0.5 bits wasted per code word, so about 3%.

Arithmetic coders might be more efficient in bit usage, but are
also significantly slower.

(snip)

-- glen

biject

unread,

Dec 17, 2009, 10:39:46 AM12/17/09

to

On Dec 16, 10:54 pm, glen herrmannsfeldt <g...@ugcs.caltech.edu>
wrote:

> jules Gilbert <jules.sto...@gmail.com> wrote:
>
> (snip)
>
> > First. both of you are completely wrong in your assertions, and in
> > more ways than I can quickly enumerate. So I'm just going to stick to
> > the topic -- here goes, years ago I both used gzip as my in-house
> > compressor and also developed compressor's around it, programs that
> > would compress it's output, at least one time, other programs did this
> > for several cycles.
> > And in the course of this work I learned quite a bit about the
> > statistical composition of the typical gzip output file.
> > I repeat, *all* of the files I examined, (using mechanical means, that
> > would be in the hundreds or perhaps thousands,) *every* gzip file
> > examined had many more 1's than 0's. (Which is not a good thing if
> > you're trying to represent information in minimal space.)
>

If a file is lopsided as to have either a large excess of
ones or zeroes. Then arb2x which is my bijective adaptive
coder based on only the current total of ones and zeroes
would compress the file nicely.

David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"

Mark Adler

unread,

Dec 23, 2009, 4:20:16 PM12/23/09

to

On 2009-11-13 23:59:42 -0800, Mark Adler <mad...@alumni.caltech.edu> said:

> Considering a random distribution, one standard deviation from exactly

> one-half is about the square root of the expected number of ones, ...

On 2009-12-16 18:45:55 -0800, jules Gilbert <jules....@gmail.com> said:
> Now Mark's remark, (which I am copying below,) is mathematically

> accurate ...

Despite Mr. Gilbert's valuable endorsement, on review my math was off
by a factor of the square root of two.

I was recalling the case that for k occurences in n trials, that for k
<< n, and large enough k, one standard deviation on k is about the
square root of k.

However in this case k ~ n / 2, and so k is not << n. The general
formula in C notation is:

sqrt(k * (1 - k / n))

For k << n, this reduces to sqrt(k). However for k ~ n/2, this becomes:

sqrt(k / 2)

So for those looking at a distribution of ones and zeros and wondering
if it matches a probability of a half for each, you should use sqrt(k /
2) for one standard deviation, not sqrt(k).

I apologize for the interruption. You may now return to your regularly
scheduled program of discussions of the compression of random data.

Mark

biject

unread,

Dec 23, 2009, 9:02:17 PM12/23/09

to

On Dec 23, 2:20 pm, Mark Adler <mad...@alumni.caltech.edu> wrote:
> On 2009-11-13 23:59:42 -0800, Mark Adler <mad...@alumni.caltech.edu> said:
>
> > Considering a random distribution, one standard deviation from exactly
> > one-half is about the square root of the expected number of ones, ...
>

> On 2009-12-16 18:45:55 -0800, jules Gilbert <jules.sto...@gmail.com> said:
>
> > Now Mark's remark, (which I am copying below,) is mathematically
> > accurate ...
>
> Despite Mr. Gilbert's valuable endorsement, on review my math was off
> by a factor of the square root of two.
>
> I was recalling the case that for k occurences in n trials, that for k
> << n, and large enough k, one standard deviation on k is about the
> square root of k.
>
> However in this case k ~ n / 2, and so k is not << n. The general
> formula in C notation is:
>
> sqrt(k * (1 - k / n))
>
> For k << n, this reduces to sqrt(k). However for k ~ n/2, this becomes:
>
> sqrt(k / 2)
>

To add more clarity into the formula sqrt(k * (1 - k / n))

one should realize that except at k = n/2 the following
could be used. At k = n/2 they all reduce to the same thing
I use the ">" since I am looking at all values except at
k = n/2

sqrt(n)/2 > sqrt(k * (1 - k / n)) > sqrt(k / 2)

there the value of sqrt(k / 2) will be always to small
except at the exact point where k = n/2 if n is odd
then it always returns to a value that is to small.