LZF compression support.

153 views
Skip to first unread message

ijuma

unread,
Nov 20, 2009, 2:09:14 AM11/20/09
to project-voldemort
Hey guys,

Voldemort has supported compression for some time now, but GZIP is not
ideal for many scenarios. LZF is a compression algorithm optimised for
speed and more suited to Voldemort. Thanks to the H2 guys it was
trivial to add support for this, so I've done so:

http://code.google.com/p/project-voldemort/issues/detail?id=172

Let me know if you find any issues.

Best,
Ismael
Message has been deleted

Tatu Saloranta

unread,
Nov 24, 2009, 2:42:40 AM11/24/09
to project-...@googlegroups.com
On Thu, Nov 19, 2009 at 11:09 PM, ijuma <ism...@juma.me.uk> wrote:
> Hey guys,
>
> Voldemort has supported compression for some time now, but GZIP is not
> ideal for many scenarios. LZF is a compression algorithm optimised for
> speed and more suited to Voldemort. Thanks to the H2 guys it was
> trivial to add support for this, so I've done so:

I had some time to play with code, and found results quite
encouraging. Using JSON data, I found that LZF can indeed be much
faster -- 2x faster compression compared to deflate (gzip), and maybe
30-50% faster decompress. But it is quite sensitive with respect to
choosing good buffer and hash array sizes (too large hash is bad as it
needs to be cleared for each block); and benefits a lot from buffer
reuse (which is true for gzip too). With defaults, for example, LZF
compress was slightly slower than gzip, but when adjusting things, it
became much faster.

As to compression ratio, numbers I saw for JSON documents of roughly
2k/20k/200k size were (in percentage compression):

* LZF: 66%/72%/72%
* Deflate: 81%/93%/97%

so that is not the strong point of LZF, but that was expected. Plus it
can be slightly improved by choice of hash table and compression block
sizes. But I don't think that's something one would usually optimize
for, given that gzip can still give better rates if that's the goal.

Anyway: I thought this might be useful data point.

-+ Tatu +-

ps. One thing I still haven't figured out is whether the 4-byte magic
cookie (header) being used is "standard" for LZF, or just one that H2
project uses. For interoperability purposes it'd be good to know for
sure.
Message has been deleted

Tatu Saloranta

unread,
Nov 24, 2009, 1:35:07 PM11/24/09
to project-...@googlegroups.com
On Mon, Nov 23, 2009 at 11:57 PM, ijuma <ism...@juma.me.uk> wrote:
> Hey Tatu,
>
> Thanks for the update.
>
> On Nov 24, 7:42 am, Tatu Saloranta <tsalora...@gmail.com> wrote:
>> With defaults, for example, LZF compress was slightly slower than gzip, but when adjusting things, it
>> became much faster.
>
> Any suggestions for changes we should make?

I have not looked at how easy this would be to do, but if caller could
reuse byte arrays being given, this would help with small/medium sized
content (like 4k or less).

Another possibility would be to allow caller to specify hash size,
which defaults to (1 << 13 == 8k ints, i.e. 32k).
Since this needs to be allocated or cleared for each block, it is kind
of big for small content. At least reducing that to 4k would seem
sensible.

Simple benchmark could help to test variations out. And I think it
should also test gzip alternative as baseline. One could then just
give file or file(s) to test, and see how settings fare.
Perhaps with this setting one could manually find good baseline
settings; and users could also run it for their data if they care
enough?

Also: I thing I noticed was the decompression part does not actually
depend on any state in compressor object. Method could thus be made
static (or copied to LZFInputStream), to avoid any allocations
compressor does (I forget if it does those eagerly).

>> ps. One thing I still haven't figured out is whether the 4-byte magic
>> cookie (header) being used is "standard" for LZF, or just one that H2
>> project uses. For interoperability purposes it'd be good to know for
>> sure.
>
> Good point.

I will build LZF library on my linux box and see what it uses, if
anything. I hope it does use something -- main reason for H2 to use
their own marker would be if core lib did not use anything (there is a
disclaimer saying wrapper that comes with lib is not really mature).

-+ Tatu +-
Message has been deleted
Message has been deleted

Tatu Saloranta

unread,
Nov 30, 2009, 2:30:52 PM11/30/09
to project-...@googlegroups.com
On Tue, Nov 24, 2009 at 9:21 PM, ijuma <ism...@juma.me.uk> wrote:
> On Nov 25, 4:09 am, Tatu Saloranta <tsalora...@gmail.com> wrote:
>> I think it would be good idea to actually follow reference
>> implementation; and I can modify sources appropriately. I will try to
>> see if someone at h2 mailing list might know more as to why there is
>> this discrepancy.
>
> Thomas Mueller (the author) is usually very responsive, so that sounds
> like a good idea:
>
> http://groups.google.com/group/h2-database

I got a reply from Thomas, sounds like change was not intentional. So
if that's ok, I can look at changing code slightly to produce format
compatible with the command-line tool. That could make debugging
simpler, and make it easier for non-java clients to do
compression/uncompressing if it is exposed to clients.
I will also try to see if there are things that can help to limit
overhead for small payloads.

-+ Tatu +-
Message has been deleted
Message has been deleted
Message has been deleted

Alex Feinberg

unread,
Nov 30, 2009, 4:24:50 PM11/30/09
to project-...@googlegroups.com
>> Yes good point. How soon do you think next release will be?
> I think middle of the month, but Alex would know. Alex, can you please
> clarify?

Next release will be on the fifteenth of December. Starting with
November's (0.57) we are doing regular monthly releases on either the
fifteenth of the month, or (if it falls on a weekend/holiday) the
closest Monday.

Thanks,
- Alex

Tatu Saloranta

unread,
Nov 30, 2009, 5:43:50 PM11/30/09
to project-...@googlegroups.com
On Mon, Nov 30, 2009 at 12:41 PM, ijuma <ism...@juma.me.uk> wrote:
> On Nov 30, 8:03 pm, Tatu Saloranta <tsalora...@gmail.com> wrote:
...
> One question is how to include these changes. Right now, we include
> the compression files as a third-party dependency jar. Would we now
> include it in source form? The license (dual-licensed EPL/LGPL) may
> present an issue there.

It probably needs to be packaged separately. That is, just make ant
build produce separate jar?
As derivative work, should be ok to apply existing licenses for that re-build.
One possibility would be to move this out in future as a separate
library (would be good if other projects would like to use it too).
But that would take bit more time, so to get it changed, local build
would seem simplest way.

-+ Tatu +-
Message has been deleted

Tatu Saloranta

unread,
Nov 30, 2009, 10:32:35 PM11/30/09
to project-...@googlegroups.com
On Mon, Nov 30, 2009 at 3:50 PM, ijuma <ism...@juma.me.uk> wrote:
> Hey Tatu,
>
> On Nov 30, 10:43 pm, Tatu Saloranta <tsalora...@gmail.com> wrote:
>> It probably needs to be packaged separately. That is, just make ant
>> build produce separate jar?
>> As derivative work, should be ok to apply existing licenses for that re-build.
>> One possibility would be to move this out in future as a separate
>> library (would be good if other projects would like to use it too).
>
> I've created a repository here:
>
> http://github.com/ijuma/h2-lzf
>
> Feel free to do the work in whichever way is easiest for you and I'll
> be happy to integrate it. I'll use that repository so that it's easy
> to point to the exact source version that is being used by Voldemort
> without adding source code with a different license to the main
> codebase.


Thank you. That makes sense, I will work from that then.

> By the way, did you see the following?
>
> http://code.google.com/p/h2database/source/detail?r=2104
>
> Not sure if that was coincidental or a result of a conversation
> between you and Thomas as it seems related to some of the points you
> mentioned earlier in the thread.

Interesting. Perhaps this is related, that would make sense. I had not
noticed this update but will have a look to make sure I start work
from up to date sources.

One question regarding compressors: looking at CompressionStrategy,
there is a comment saying that inflate can return input array is. That
makes sense for NopCompressionStrategy. But can it be used for
anything else? Basically both gzip and lzf codecs could conceivably
give up and return input as is, if they can not compress it. But that
would require auto-detection on uncompressing, so I assume this is not
intended.

-+ Tatu +-

ijuma

unread,
Dec 1, 2009, 2:52:40 AM12/1/09
to project-voldemort
On Dec 1, 3:32 am, Tatu Saloranta <tsalora...@gmail.com> wrote:
> One question regarding compressors: looking at CompressionStrategy,
> there is a comment saying that inflate can return input array is. That
> makes sense for NopCompressionStrategy.

That is the only class that takes advantage of that, yes.

> But can it be used for anything else? Basically both gzip and lzf codecs could conceivably
> give up and return input as is, if they can not compress it. But that
> would require auto-detection on uncompressing, so I assume this is not
> intended.

The idea was to give enough flexibility for these cases, but at the
moment, the compression algorithm has to do whatever is needed to
detect it and do the right thing. I would be open to introducing
utility classes that provide infrastructure for this, if desirable.

Best,
Ismael

Tatu Saloranta

unread,
Dec 1, 2009, 12:10:19 PM12/1/09
to project-...@googlegroups.com
On Mon, Nov 30, 2009 at 11:52 PM, ijuma <ism...@juma.me.uk> wrote:
> On Dec 1, 3:32 am, Tatu Saloranta <tsalora...@gmail.com> wrote:
...
>
>> But can it be used for anything else? Basically both gzip and lzf codecs could conceivably
>> give up and return input as is, if they can not compress it. But that
>> would require auto-detection on uncompressing, so I assume this is not
>> intended.
>
> The idea was to give enough flexibility for these cases, but at the
> moment, the compression algorithm has to do whatever is needed to
> detect it and do the right thing. I would be open to introducing
> utility classes that provide infrastructure for this, if desirable.

I think I will assume for now that this should not be done (that is,
will always use compression as things work currently).
If dynamic choosing becomes desireable, it can be changed but will
require changes in multiple places (adding some kind of compression
type detection method in lookup). This could also work for simple
threshold scheme, where only payload of N bytes or above is
compressed, and smaller ones left as is, on assumption that smaller
data does not compress as well.

-+ Tatu +-
Reply all
Reply to author
Forward
Message has been deleted
0 new messages