Compression level with python-LZ4

757 views
Skip to first unread message

Basj

unread,
Mar 15, 2014, 5:48:19 AM3/15/14
to lz...@googlegroups.com
Hello,

Do you know how to change the compression level with Python-LZ4 (https://github.com/steeve/python-lz4) ?

When using :

import lz4 lz4.compress(data)

it seems that it's not possible (yet) to change the compression level (I see that -1 to -9 level is possible here : http://fastcompression.blogspot.fr/p/lz4.html)
Any idea on how to do this ?


Best regards,
basj

Yann Collet

unread,
Mar 15, 2014, 6:16:47 AM3/15/14
to lz...@googlegroups.com
Hi Basj


Steeve Morin's python version is based on LZ4 r91.
The capability to select compression level is relatively new, and only available since r113.
So it would require Steeve to upgrade his port to get this capability.


Regards

Yann

Francesc Alted

unread,
Mar 15, 2014, 7:03:16 AM3/15/14
to lz...@googlegroups.com
There is also Blosc:

https://github.com/Blosc/c-blosc
https://github.com/Blosc/python-blosc

that includes support for recent LZ4 r113 and different compression levels:

In []: import numpy as np

In []: w = np.fromfile('call_being_recorded.wav', dtype='i2')

In []: import lz4

In []: wlz4 = lz4.compress(w)

In []: import blosc

In []: wblosc1 = blosc.pack_array(w, clevel=1, cname='lz4')

In []: wblosc5 = blosc.pack_array(w, clevel=5, cname='lz4')

In []: wblosc9 = blosc.pack_array(w, clevel=9, cname='lz4')

In []: len(w.tostring()), len(wlz4), len(wblosc1), len(wblosc5),
len(wblosc9)
Out[]: (433598, 350521, 305565, 303147, 302138)

In this case, Blosc can compress more than stock LZ4 because it is meant
for binary data (like the 16-bit .wav file in this case).

And it is pretty fast too:

In []: %timeit lz4.compress(w)
1000 loops, best of 3: 1.01 ms per loop

In []: %timeit blosc.pack_array(w, clevel=5, cname='lz4')
1000 loops, best of 3: 1.64 ms per loop

although for maximum speed, the compress_ptr function is best:

In []: %timeit blosc.compress_ptr(w.__array_interface__['data'][0],
len(w), 2, clevel=5, cname='lz4')
1000 loops, best of 3: 880 µs per loop

Blosc also comes with native support for other compressors, like
'blosclz', 'lz4hc', 'snappy' and 'zlib', so you can choose which one
adapts better to your use case.

Francesc

On 3/15/14, 11:16 AM, Yann Collet wrote:
> Hi Basj
>
>
> Steeve Morin's python version is based on LZ4 r91.
> The capability to select compression level is relatively new, and only
> available since r113.
> So it would require Steeve to upgrade his port to get this capability.
>
>
> Regards
>
> Yann
>
> Le samedi 15 mars 2014 10:48:19 UTC+1, Basj a écrit :
>
> Hello,
>
> Do you know how to change the compression level with Python-LZ4
> (https://github.com/steeve/python-lz4
> <https://github.com/steeve/python-lz4>) ?
>
> When using :
>
> import lz4
> lz4.compress(data)
>
> it seems that it's not possible (yet) to change the compression level (I see that -1 to -9 level is possible here :http://fastcompression.blogspot.fr/p/lz4.html <http://fastcompression.blogspot.fr/p/lz4.html>)
> Any idea on how to do this ?
>
> Best regards,
> basj
>
> --
> Vous recevez ce message, car vous êtes abonné au groupe Google Groupes
> "LZ4c".
> Pour vous désabonner de ce groupe et ne plus en recevoir les messages,
> envoyez un e-mail à l'adresse lz4c+uns...@googlegroups.com
> <mailto:lz4c+uns...@googlegroups.com>.
> Pour obtenir davantage d'options, consultez la page
> https://groups.google.com/d/optout.


--
Francesc Alted

Basj

unread,
Mar 15, 2014, 9:55:32 AM3/15/14
to lz...@googlegroups.com
Hello Francesc,

It is amazing ! How is it possible that :
blosc.pack_array(w, cname='lz4')
does better than :
lz4.compress(w)
? Isn't it the same algorithm ?

Indeed, here we have approx. 14% more compression with blosc.pack_array(w, cname='lz4')  than with lz4.compress :

> In []: len(w.tostring()), len(wlz4), len(wblosc1), len(wblosc5),
> len(wblosc9)
> Out[]: (433598, 350521, 305565, 303147, 302138)

As it is the same algorithm, can we improve it directly in python-lz4, so that this module (import lz4) can also benefit from this 14% more compression available in blosc?

More generally, blosc is really impressive too, as well as LZ4 !
Such a shame I haven't discovered these wonderful tools sooner...

Best regards, basj

PS : I noticed a small issue in blosc, *only* when using lz4 + the clevel=... flag, *only* with big files like 1 GB :
error: Error -1 while compressing data
Have you experienced it too ?
If I can reproduce the error more easily, I'll try post to post this issue on github.

Valentin Haenel

unread,
Mar 15, 2014, 3:37:48 PM3/15/14
to lz...@googlegroups.com

Hi,

I would like to chime in at this point, I am the author of Bloscpack:

https://pypi.python.org/pypi/bloscpack

Bloscpack is a serialization format for and a command line interface to Blosc, which also has a Python API for serializing Numpy arrays.

Anyway, to offer an answer to your question the reason why Blosc may achieve better performance both in terms of ratio and speed is twofold. The speed may come from the fact that Blosc can operate in a multi-threaded mode. I believe that LZ4 is also mulit-thread capable (something I haven't looked into yet) but I didn't see that the python bindings support this. Although I might be mistaken, since I didn't look very hard and know very little about the multi-thread support of LZ4. The second reason responsible for the improved compression ratio is a shuffle filter which reorders the bytes for multi-byte elements which effectively can reduce the Lempel-Ziv complexity for certain kinds of data (but might increase it for other kinds of data).

For more information about the shuffle filter, check slide 17 of:

http://www.slideshare.net/PyData/blosc-py-data-2014#

So, It would seem that you might be able to add the multi-threading capabilities to the Python bindings, but for the shuffle filter, you might have to resort to using Blosc.

Regarding the large files issue. Blosc has a limitation of apprx 2GB, and this is where Bloscpack comes in. Bloscpack internally uses Blosc compressed chunks and so does not suffer from this limitation. Although If you say you see this only for files of apprx 1 GB it might be something else and is definitely worth looking into.

Additionally, I should mention that Bloscpack allows you to use LZ4 from the command line by virtue of Blosc supporting it:

$ blpk compress --codec lz4 data.dat

Hope that helps,

V-

Basj

unread,
Mar 15, 2014, 4:41:50 PM3/15/14
to lz...@googlegroups.com
Hello Valentin,

Thanks for your answer and for these explanations.

So the *only* difference between
import lz4
lz4.compress(my_array)
and
blosc.pack_array(my_array, cname='lz4')
that could change the compression ratio is the shuffle filter? Or could there be another reason / small improvement responsible for the compression ratio ?

Francesc, was it this (=the shuffle filter) that you were talking about when you said :
> In this case, Blosc can compress more than stock LZ4 because it is meant
> for binary data (like the 16-bit .wav file in this case).
?

Best regards.

Francesc Alted

unread,
Mar 16, 2014, 4:39:55 PM3/16/14
to lz...@googlegroups.com
On 3/15/14, 9:41 PM, Basj wrote:
> Hello Valentin,
>
> Thanks for your answer and for these explanations.
>
> So the *only* difference between
> import lz4
> lz4.compress(my_array)
> and
> blosc.pack_array(my_array, cname='lz4')
> that could change *the compression ratio* is the shuffle filter? Or
> could there be another reason / small improvement responsible for the
> compression ratio ?

Yes, the shuffle filter is the main reason for the improvement in the
compression ratio for this case. Also, Blosc chooses different block
sizes for different compression levels; smaller compression levels means
small blocks, and hence, the compressor has less opportunities to find
redundancies.

>
> Francesc, was it this (=the shuffle filter) that you were talking
> about when you said :
> > In this case, Blosc can compress more than stock LZ4 because it is meant
> > for binary data (like the 16-bit .wav file in this case).
> ?

Yeah, I was actually referring to shuffle. As Valentin said, it usually
works well for binary data, but in some situations it can lead to worse
compression ratios too, so it is always worth trying to disable it (it
is active by default, but you can deactivate it by setting the parameter
`shuffle=False` in the Python wrappers) and see the difference.

Cheers,

--
Francesc Alted

Basj

unread,
Mar 21, 2014, 5:23:01 AM3/21/14
to lz...@googlegroups.com
Hi Yann,

About compression levels, I see in http://fastcompression.blogspot.fr/p/lz4.html that :

Compression levels of 0 to 2 translates into "fast compression" (default).
Compression levels of 3 to 9 translates into "High compression".

Just to be sure, this means that there are only 2 compression levels ? normal and HC ?

All the best,
basj

Yann Collet

unread,
Mar 21, 2014, 9:25:56 AM3/21/14
to lz...@googlegroups.com
Good point

The current Windows binary is based on an older version of LZ4 source code,
and therefore doesn't yet support compression levels. So it's either fast or Full HC; no middle ground.

It will be updated into a future version, although I don't know when.
Reply all
Reply to author
Forward
0 new messages