1101101000000100
121101000000100
131101000000100
141101000000100
151101000000100
161101000000100
11011010000100
1101111010000100
Now when I bzip2 this file it becomes around 16MB:
16308503 data0.bz2
I tried applying a simple transformation which keeps only the changed
prefix of succesive strings:
(the changing prefix is always non-empty, so I use one less '<' than
its length)
1101101000000100
<<12
<13
<14
<15
<16
<<<<<<<1101101
<<11011
I didn't expect bzip2 to do much better since it should have already
noticed the repetitive tails
of strings, but imagine my shock when seeing the result:
16308503 data0.bz2
1322712 data1.bz2
That's over 12 times smaller!
With gzip the difference is smaller, but still a factor of 5.5.
Is this a well known blind spot of compressors?
Are there any compressors that recognize all the repetitiveness in the
original file?
regards,
-John
That's pretty easy. Consider a cryptographic prng. I can describe a
terabyte of output by a 16 byte seed. But given a TB file bzip2 [for
instance] wouldn't compress it a lick.
In your case it's just a pre-transform, that's not unheard of. I mean
that's all the BWT is anyways. It sorts data so RLE+entropy coders
can work on it.
Same argument applies to natural images. Take a photo and apply BZIP2
to the raw BGR values. Now apply a lossless binDCT to it. Magic, it
compresses better even though it's the same data!
Tom
How is that number is possible? I thought it is binary!? 11011...
1211... 1311...
> I have a file of roughly 6 million strings each up to 16 letters over
> the alphabet {0,1,2,3,4,5,6}
What size is the Alphabet that is 16 over?
I am not understanding that sentence.
> I sorted the file reverse-alphabetically. It starts
>
> 11 01101000000100
> 12 1101000000100
> 13 1101000000100
> 14 1101000000100
> 15 1101000000100
> 16 1101000000100
> 11 011010000100
> 11 01111010000100
>
> Now when I bzip2 this file it becomes around 16MB:
> 16308503 data0.bz2
>
> I tried applying a simple transformation which keeps only the changed
> prefix of successive strings:
Interesting.. Prefix?
> (the changing prefix is always non-empty, so I use one less '<' than its
> length)
What does non-empty mean?
> 1101101000000100
> <<12
> <13
> <14
> <15
> <16
> <<<<<<<1101101
> <<11011
>
> I didn't expect bzip2 to do much better since it should have already
> noticed the repetitive tails
> of strings, but imagine my shock when seeing the result:
Well.. Someone pointed out to me that just because I can see pattern
doesn't mean statistically there is pattern.
Maybe 2what you have done allows Bzip to see. Congratulations.
>
> 16308503 data0.bz2
> 1322712 data1.bz2
>
> That's over 12 times smaller!
> With gzip the difference is smaller, but still a factor of 5.5.
> Is this a well known blind spot of compressors? Are there any
> compressors that recognize all the repetitiveness in the original file?
>
> regards,
> -John
Good work.. I admit I am not seeing this in a proper framework. But I
get the idea..
--
-Ernst
-*-Math:Symbolism:Logic ~ Three_Amigos-*-