strange compression ratios

10,707 views
Skip to first unread message

Vitor Oliveira

unread,
Jul 2, 2012, 2:57:36 PM7/2/12
to lz...@googlegroups.com, António Pina
Hi,

I've been using zlib to store some very large files generated by an application of mine. 
Since I need to process data quickly and it was taking a long time to compress the data I switched to LZ4 and the compression speed improved amazingly - but that surely is no surprise! :-)

But my data is kind of peculiar: large binary log file that can go to to 100s of GB, but that, after a first compression, can be compressed again and decrease in size quite a bit.
Using zlib (default mode) with a 3.5GB test file compression reduces to 54MB on the first pass and reduces to 17MB on the second! With LZ4, the same file goes to 56MB on the first pass and to 9.5MB on the second pass; a third pass leaves it in just over 9MB. With LZ4HC, the file goes to 44MB on the first pass, then 2MB, then 1MB and stays at just over 750K after 5 rounds!

Although space and IO time for such massive files is a great concern, the problem I was addressing was faster compression. 
With the default settings zlib makes my program spend 265ms per step, while LZ4HC takes 130ms (zlib with fast compression takes 110ms, but compression is much worse). LZ4, on the other hand, takes 6ms. Since most of the data does away on the first run, I ended up running LZ4 on the first run and then do two rounds of LC4HC, which ended up taking 16ms and getting a 4MB file.

I tried also with some text files and the results are not as good, although two rounds, one with LZ4 and one with LZ4HC, ended up producing a file about the same size as a single LZ4HC but in 1/5 the time it took a LZ4HC.

This has me quite puzzled. I believe that the regular structure of those files may have some redundancy that is out of reach until the file is compressed. The file also has 24 byte records where two 64bit values vary from line to line but repeat a lot along the way. 

I know my data is not a "good example", but does this make sense to anyone? 
This is also a heads-up, maybe someone else has data as "special" as mine and can get files to 1/6th the size in 1/16th the time of zlib!!

Regards and thanks again for LZ4,
Vitor

PS: I really did check integrity of the files and they do carry useful information.

Yann Collet

unread,
Jul 2, 2012, 11:52:43 PM7/2/12
to lz...@googlegroups.com
Hi Vitor

Yes, it does makes sense.
Compressing several times highly redundant data such as log files is expected to improve ratio.

The fundamental reason is that these highly repetitive byte sequences, with very small and regular differences,
produce repetitive compressed sequences, which can therefore be compressed further.

You made a pretty thorough investigation, and the results you share with us are both very interesting and better than expected.

It could be that LZ4 main weakness, which is its byte-oriented compressed structure,
turns out to become an advantage in such scenario :
the next rounds still have to work with byte-aligned input.
In contrast, Zlib is bit-oriented, due to its Huffman back end, which results in unaligned compressed structure, 
and therefore less potential for further compression rounds.
Stronger compression algorithms use range encoding, destroying any bit/byte correlation, so i wouldn't expect any gain in trying to  compress them.
At least, that's the main idea which comes to mind.

I was not expecting such result to transfer beyond log files, 
but your experiment seems to show that even for standard text files, there is an advantage in multi-pass compression.
This is a very interesting and unexpected outcome.
Maybe i should start looking into it too.

Best Regards


2012/7/2 Vitor Oliveira <vitor.s.p...@gmail.com>

Christiane Ruetten

unread,
Jul 3, 2012, 4:55:28 AM7/3/12
to lz...@googlegroups.com, António Pina
Hi all,

 I suspect that the real reason for this is the blocking which makes the algorithms blind for redundancies beyond the blocksize. In highly structured data like logfiles, however, redundancy persists over the whole data set.

The first round takes care of that limitation. Since the compression rate is about 1:64 there, the first run produces highly repetetive compressed blocks with 1/64th the original size. Thus the second round lets the compressor see about 64 blocks of the original data, and it will happily get rid of the redundancies there.

Best,
Christiane

Yann Collet

unread,
Jul 3, 2012, 12:48:02 PM7/3/12
to lz...@googlegroups.com
Correct, this is an important part of the explanation.

Also, let's underline the main idea here :
a fast byte-aligned compressor can be used as a kind of preprocessor for highly redundant input,
resulting in a still compressible but much smaller data stream,
which can then be compressed using a stronger algorithm for a "final pass".
The stronger algorithm will work with much less data, and therefore complete its job much faster than with the initial raw data stream.

From a practical perspective, this method seems to save a lot of CPU power (with Vitor measurements as example), which is the main benefit to keep in mind.


2012/7/3 Christiane Ruetten <christian...@googlemail.com>

Yann Collet

unread,
Jul 4, 2012, 6:37:21 PM7/4/12
to lz...@googlegroups.com, António Pina
Hi Vitor

I have finally found a tiny bit of time to test your multi-pass compression methodology on some available sample files.
And i must say the results are not as good as yours.

First, on text file (standard english),
i do not get any benefit from a second pass.
Not even when lz4hc follows lz4.
Which means i do not get even close to single-pass lz4hc with 2 passes, lz4+lz4hc

Second, on log files :
in such case, yes, there are some benefits in using multiple passes.
But they are not as large as yours.
Typically, the second pass provides an additional 10% to 30% compression improvement.
Not bad, but not as stellar as yours.

So, my current guess is that your log file structure may be especially suitable to lz4.
I don't exactly know why, but my first guess is that your log file is probably an aligned structure.
This would make the delta codes extremely similar, and therefore very suitable for a second pass compression. 


Regards

Yann

Vitor Oliveira

unread,
Jul 5, 2012, 3:11:30 AM7/5/12
to lz...@googlegroups.com, António Pina
Hi Yann,

The text files I used for my tests were based on the same kind of data I was using in the other files, so they were also large and highly redundant.
As for my log files, they are indeed aligned binary files. They are composed of many sections, with different types of data, but each type is a variable sized table of fixed sized records; but the largest structure by far is a table with up to 1000000 records with 24 bytes each. The sections are repeated many times.

Im certain that some fortunate alignment must be appearing in my data, but It would be nice if this applies to more cases. The information contents of this file doesn't seem to be much lower than many other log files.
The blocksize and the byte alignment are certainly the essential components here, and it seems to me that for it to be more general the size the block must adjust well to the redundancy on the data, on the various rounds of compression.

Regards,
Vitor

PS: I anyone wants to try I can send the 3.5GB test log file by e-mail on request to (vitor.s.p.oliveira [-] gmail.com).
It will take me about... 750K!! :-)

Alexander Novoselov

unread,
Feb 2, 2013, 3:43:39 AM2/2/13
to lz...@googlegroups.com, António Pina
For my log files:
- best speed c0+c0 (5-10 times faster than c11)
- best compression c1+c1 (5-15 time smaller than c00)
Other steps gets non significant improvements, but if third step is simple zip (deflate)
- c0+c0+zip speed is +50% from first step c0 time (but still good), size less for 10-50% from c00
- c1+c1+zip speed is +2% from first step c1 time, size less for 10% from c11
And you can use zip as container :)

Yann Collet

unread,
Feb 4, 2013, 1:01:20 PM2/4/13
to lz...@googlegroups.com, António Pina
Hi Alexander

This is quite normal and expected.

Thanks to Vitor investigations, i suggest the following method to provide the better speed/ratio trade-off :

1) First pass : use c0, for speed
2) All subsequent intermediate passes : use c1, for iterative compression. 
Speed is less important because the first pass (c0) already reduced size to compress by a sizable amount.
3) Last pass : use a "scrambling" compressor, such as gzip, or 7zip. It will compress better. 
The main point here is that you can't compress further after using such compressor, because it doesn't respect byte boundaries.

In your use case, i would therefore recommend :
- c0+c1+zip

Regards

Yann
Reply all
Reply to author
Forward
0 new messages