compressor blind spots

tromp

unread,

Apr 27, 2010, 1:17:16 PM4/27/10

to

I have a file of roughly 6 million strings each up to 16 letters over
the alphabet {0,1,2,3,4,5,6}
I sorted the file reverse-alphabetically. It starts

1101101000000100
121101000000100
131101000000100
141101000000100
151101000000100
161101000000100
11011010000100
1101111010000100

Now when I bzip2 this file it becomes around 16MB:
16308503 data0.bz2

I tried applying a simple transformation which keeps only the changed
prefix of succesive strings:
(the changing prefix is always non-empty, so I use one less '<' than
its length)

1101101000000100
<<12
<13
<14
<15
<16
<<<<<<<1101101
<<11011

I didn't expect bzip2 to do much better since it should have already
noticed the repetitive tails
of strings, but imagine my shock when seeing the result:

16308503 data0.bz2
1322712 data1.bz2

That's over 12 times smaller!
With gzip the difference is smaller, but still a factor of 5.5.

Is this a well known blind spot of compressors?
Are there any compressors that recognize all the repetitiveness in the
original file?

regards,
-John

Tom St Denis

unread,

Apr 27, 2010, 1:30:55 PM4/27/10

to

On Apr 27, 1:17 pm, tromp <john.tr...@gmail.com> wrote:
> Is this a well known blind spot of compressors?
> Are there any compressors that recognize all the repetitiveness in the
> original file?
>
> regards,
> -John

That's pretty easy. Consider a cryptographic prng. I can describe a
terabyte of output by a 16 byte seed. But given a TB file bzip2 [for
instance] wouldn't compress it a lick.

In your case it's just a pre-transform, that's not unheard of. I mean
that's all the BWT is anyways. It sorts data so RLE+entropy coders
can work on it.

Same argument applies to natural images. Take a photo and apply BZIP2
to the raw BGR values. Now apply a lossless binDCT to it. Magic, it
compresses better even though it's the same data!

Tom

Chi

unread,

May 1, 2010, 12:00:24 PM5/1/10

to

On Apr 27, 7:17 pm, tromp <john.tr...@gmail.com> wrote:
> I have a file of roughly 6 million strings each up to 16 letters over
> the alphabet {0,1,2,3,4,5,6}
> I sorted the file reverse-alphabetically. It starts
>
> 1101101000000100
> 121101000000100
> 131101000000100
> 141101000000100
> 151101000000100
> 161101000000100

How is that number is possible? I thought it is binary!? 11011...
1211... 1311...

Ernst

unread,

May 26, 2010, 10:29:43 PM5/26/10

to

On Tue, 27 Apr 2010 10:17:16 -0700, tromp wrote:

> I have a file of roughly 6 million strings each up to 16 letters over
> the alphabet {0,1,2,3,4,5,6}

What size is the Alphabet that is 16 over?

I am not understanding that sentence.

> I sorted the file reverse-alphabetically. It starts
>
> 11 01101000000100
> 12 1101000000100
> 13 1101000000100
> 14 1101000000100
> 15 1101000000100
> 16 1101000000100
> 11 011010000100
> 11 01111010000100
>
> Now when I bzip2 this file it becomes around 16MB:
> 16308503 data0.bz2
>
> I tried applying a simple transformation which keeps only the changed

> prefix of successive strings:

Interesting.. Prefix?

> (the changing prefix is always non-empty, so I use one less '<' than its
> length)

What does non-empty mean?

> 1101101000000100
> <<12
> <13
> <14
> <15
> <16
> <<<<<<<1101101
> <<11011
>
> I didn't expect bzip2 to do much better since it should have already
> noticed the repetitive tails
> of strings, but imagine my shock when seeing the result:

Well.. Someone pointed out to me that just because I can see pattern
doesn't mean statistically there is pattern.

Maybe 2what you have done allows Bzip to see. Congratulations.

>
> 16308503 data0.bz2
> 1322712 data1.bz2
>
> That's over 12 times smaller!
> With gzip the difference is smaller, but still a factor of 5.5.

> Is this a well known blind spot of compressors? Are there any
> compressors that recognize all the repetitiveness in the original file?
>
> regards,
> -John

Good work.. I admit I am not seeing this in a proper framework. But I
get the idea..

--
-Ernst

-*-Math:Symbolism:Logic ~ Three_Amigos-*-