Comparison of LZF vs GZIP uncompression speed

Tatu Saloranta

unread,

Dec 15, 2009, 2:05:57 AM12/15/09

to project-...@googlegroups.com

I ran some tests on compress and decompression speed between LZF and
Deflate (GZIP/zlib) codecs.

Here are raw numbers (wish I had nice graphs, but these have to do for
now) for documents I tried (if anyone is interested in test code, just
let me know -- it's just 2 stand-alone classes and 2 supporting
classes)

---
LZF GZIP
db100.xml (20k) 195/155 32/64
Hamlet.xml (280k) 35/56 6/43
Soap_mid.xml (128k) 220/160 30/50
wstx.prof.txt (112k) 95/110 18/48
ns-openoffice.xml (701k) 66/100 14/42
---

where numbers are compression/decompression speed in megabytes per
second (for uncompressed input data). So, for example for 'db100.xml'
file (very repetitive xml dump from xmltest), which has size of 20 kB,
LZF can compress data at rate of 195 megabytes per second (on my slow
Athlon desktop), decompress at 155; and GZIP respectively 32 and 64
MB/s. In general, better data compresses, faster processing is (wrt
original data).

Based on these numbers, LZF is a bit faster for decompression (ranging
from +50% to +200% througput increase). But it truly mops the floor
with compression speed -- whereas gzip tends to take 3-5 times as long
to compress data as decompress, LZF seems to be almost as fast (... or
occasionally faster? that is odd; but code does verify correctness of
compression/decompression). And as result, LZF is _5 to 7 times_ as
fast as GZIP, for these test documents. That is rather impressive to
me.

The only caveat regarding compression is that it is crucial that some
parts of encoder are recycled if at all possible (decoding is much
simpler and does not working buffers). Without this numbers are more
modest, although still better than gzip.

As to compression rates, gzip is bit better (since it uses statistical
huffman encoding after regular lempel-ziv), but whether that matters
really depends on what kinds of trade-off one wants. For many web
services I would think that processing overhead might not be worth it.

-+ Tatu +-

Jay Kreps

unread,

Dec 15, 2009, 3:45:02 AM12/15/09

to project-...@googlegroups.com

Hey Tatu & Ismael,

This is pretty cool. I don't think most people know about the
compression support, or the new LZF implementation in voldemort
because it is not very well documented. Any chance you would be
willing to convert this write up into a blog post with an additional
paragraph on how to enable it? (http://project-voldemort.com/blog).

Let me know if either of you would be willing and I will set up a blog account.

Cheers,

-Jay

> --
>
> You received this message because you are subscribed to the Google Groups "project-voldemort" group.
> To post to this group, send email to project-...@googlegroups.com.
> To unsubscribe from this group, send email to project-voldem...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/project-voldemort?hl=en.
>
>
>

Jay Kreps

unread,

Dec 15, 2009, 4:07:18 AM12/15/09

to project-...@googlegroups.com, cricc...@gmail.com

Also, one thing LinkedIn does a lot of is serve read-only data that
comes from Hadoop (described here:
http://project-voldemort.com/blog/2009/06/). We are not currently
using any compression on the voldemort side, but we are using the
block compression built in to sequence files in hadoop. We have
noticed that this compression is really effective even when we
basically have just a bunch of integers, we see about a 25%
compression ratio.

Is this kind of block compression fundamentally more effective because
it works on data larger than a single value? If so, is there any kind
of compression that allows decompression from an arbitrary position in
the compressed file without decompressing the whole thing? The data
format for the read-only data files is

<size><value>
<size><value>
...
<size><value>

The index has byte offset pointers into this file giving the location
to read from. Is it possible to GZIP or LZF compress the whole file
during the build of the data files and then decompress from the offset
of the value when a GET request comes in without decompressing the
whole file? Perhaps one could break the file into chunks of some fixed
size X, and then decompress just the needed chunks (indeed, i think
that is maybe how some of the compression algorithms work internally)?

Suffice it to say I don't know much about compression, but somehow I
suspect the read-only nature of the data means there is a way to
compress it even more.

Cheers,

-Jay

On Mon, Dec 14, 2009 at 11:05 PM, Tatu Saloranta <tsalo...@gmail.com> wrote:

ijuma

unread,

Dec 15, 2009, 5:12:29 AM12/15/09

to project-voldemort

On Dec 15, 8:45 am, Jay Kreps <jay.kr...@gmail.com> wrote:
> Hey Tatu & Ismael,
>
> This is pretty cool. I don't think most people know about the
> compression support, or the new LZF implementation in voldemort
> because it is not very well documented. Any chance you would be
> willing to convert this write up into a blog post with an additional
> paragraph on how to enable it? (http://project-voldemort.com/blog).
>
> Let me know if either of you would be willing and I will set up a blog account.

I think it makes more sense for Tatu to do it as he did most of the
non-plumbing work for LZF, but I could do it if he's not interested.
Tatu, let us know.

Best,
Ismael

ijuma

unread,

Dec 15, 2009, 5:13:21 AM12/15/09

to project-voldemort

On Dec 15, 7:05 am, Tatu Saloranta <tsalora...@gmail.com> wrote:
> The only caveat regarding compression is that it is crucial that some
> parts of encoder are recycled if at all possible (decoding is much
> simpler and does not working buffers). Without this numbers are more
> modest, although still better than gzip.

Hmm, do we do this for Voldemort? And if not, what do we have to do to
make it happen?

Ismael

Tatu Saloranta

unread,

Dec 15, 2009, 3:43:35 PM12/15/09

to project-...@googlegroups.com

I think it really depends on level at which it is to be documented.
I suspect users are more interested at slightly higher-level
description (how to enable, what does it do for me), so I might not be
optimal writer for that part. But I can definitely help with lower
level pieces, or with data and some interpretation thereof?

-+ Tatu +-

Tatu Saloranta

unread,

Dec 15, 2009, 3:48:59 PM12/15/09

to project-...@googlegroups.com

On Tue, Dec 15, 2009 at 1:07 AM, Jay Kreps <jay....@gmail.com> wrote:
> Also, one thing LinkedIn does a lot of is serve read-only data that
> comes from Hadoop (described here:
> http://project-voldemort.com/blog/2009/06/). We are not currently
> using any compression on the voldemort side, but we are using the
> block compression built in to sequence files in hadoop. We have
> noticed that this compression is really effective even when we
> basically have just a bunch of integers, we see about a 25%
> compression ratio.

Absolutely.

> Is this kind of block compression fundamentally more effective because
> it works on data larger than a single value? If so, is there any kind
> of compression that allows decompression from an arbitrary position in
> the compressed file without decompressing the whole thing? The data
> format for the read-only data files is

I think it is important to compress more than single value/row at a
time -- repetition is what mostly matters (and for LZF, only thing
that matters, without huffman etc). But block size need not be
astronomical.
And in fact LZF uses block size of at most 64k anyway (gzip has
similar lower level blocking structure too); and back references can
be at most 8k in the past anyway.

Now: it is definitely possible to handle sub-sections at a time;
ideally low-level blocks. That can be exposed via Hadoop compressor
abstractions, but it is rather involved process (I tried reading
through it and my head hurts now :) ).
But that does allow you to skip large sections potentially efficiently.

You can also build segmenting/framing above compressor level as well,
so it is not strictly mandatory for compressor to do that.

> <size><value>
> <size><value>
> ...
> <size><value>
>
> The index has byte offset pointers into this file giving the location
> to read from. Is it possible to GZIP or LZF compress the whole file
> during the build of the data files and then decompress from the offset
> of the value when a GET request comes in without decompressing the
> whole file? Perhaps one could break the file into chunks of some fixed
> size X, and then decompress just the needed chunks (indeed, i think
> that is maybe how some of the compression algorithms work internally)?

... indeed, that is actually something I am planning on working (in
future if I have time).

You will need to figure out proper segmentation/framing structure to
support it, but it is possible.
And I think very useful: if you can use some CPU to
compress/decompress data, you can effectively get bigger fleet by
added cpu overhead. But it needs balancing; depends on how well data
compresses, size of entries and so on.

> Suffice it to say I don't know much about compression, but somehow I
> suspect the read-only nature of the data means there is a way to
> compress it even more.

I think it is the key indeed. When you start modifying compressed data
things get complicated and slow.
For read-only (or read-by-key; update in bulk) cases compression works
much better.

-+ Tatu +-

Jay Kreps

unread,

Dec 15, 2009, 11:45:45 PM12/15/09

to project-...@googlegroups.com

Well how to enable it I think is just add the compression tag to the
store definition, so that should be pretty easy. I think some more
details on how the different compression types compare--both in
approach and performance--would be what would be most interesting to
people. Let me know if you are up for it.

Cheers,

-Jay

Tatu Saloranta

unread,

Dec 16, 2009, 1:23:54 AM12/16/09

to project-...@googlegroups.com

On Tue, Dec 15, 2009 at 8:45 PM, Jay Kreps <jay....@gmail.com> wrote:
> Well how to enable it I think is just add the compression tag to the
> store definition, so that should be pretty easy. I think some more
> details on how the different compression types compare--both in
> approach and performance--would be what would be most interesting to
> people. Let me know if you are up for it.

Ok then I could do it. The only thing (beyond access and finding time
:) ) would be to maybe just create a skeleton page, and I could start
filling in the blanks. It would also give more reason to wrap up my
tests in better packages to allow reproducing the results.

-+ Tatu +-

Reply all

Reply to author

Forward