Google Groups

strange compression ratios

Vitor Oliveira Jul 2, 2012 11:57 AM
Posted in group: LZ4c

I've been using zlib to store some very large files generated by an application of mine. 
Since I need to process data quickly and it was taking a long time to compress the data I switched to LZ4 and the compression speed improved amazingly - but that surely is no surprise! :-)

But my data is kind of peculiar: large binary log file that can go to to 100s of GB, but that, after a first compression, can be compressed again and decrease in size quite a bit.
Using zlib (default mode) with a 3.5GB test file compression reduces to 54MB on the first pass and reduces to 17MB on the second! With LZ4, the same file goes to 56MB on the first pass and to 9.5MB on the second pass; a third pass leaves it in just over 9MB. With LZ4HC, the file goes to 44MB on the first pass, then 2MB, then 1MB and stays at just over 750K after 5 rounds!

Although space and IO time for such massive files is a great concern, the problem I was addressing was faster compression. 
With the default settings zlib makes my program spend 265ms per step, while LZ4HC takes 130ms (zlib with fast compression takes 110ms, but compression is much worse). LZ4, on the other hand, takes 6ms. Since most of the data does away on the first run, I ended up running LZ4 on the first run and then do two rounds of LC4HC, which ended up taking 16ms and getting a 4MB file.

I tried also with some text files and the results are not as good, although two rounds, one with LZ4 and one with LZ4HC, ended up producing a file about the same size as a single LZ4HC but in 1/5 the time it took a LZ4HC.

This has me quite puzzled. I believe that the regular structure of those files may have some redundancy that is out of reach until the file is compressed. The file also has 24 byte records where two 64bit values vary from line to line but repeat a lot along the way. 

I know my data is not a "good example", but does this make sense to anyone? 
This is also a heads-up, maybe someone else has data as "special" as mine and can get files to 1/6th the size in 1/16th the time of zlib!!

Regards and thanks again for LZ4,

PS: I really did check integrity of the files and they do carry useful information.