On Tue, Dec 15, 2009 at 1:07 AM, Jay Kreps <
jay....@gmail.com> wrote:
> Also, one thing LinkedIn does a lot of is serve read-only data that
> comes from Hadoop (described here:
>
http://project-voldemort.com/blog/2009/06/). We are not currently
> using any compression on the voldemort side, but we are using the
> block compression built in to sequence files in hadoop. We have
> noticed that this compression is really effective even when we
> basically have just a bunch of integers, we see about a 25%
> compression ratio.
Absolutely.
> Is this kind of block compression fundamentally more effective because
> it works on data larger than a single value? If so, is there any kind
> of compression that allows decompression from an arbitrary position in
> the compressed file without decompressing the whole thing? The data
> format for the read-only data files is
I think it is important to compress more than single value/row at a
time -- repetition is what mostly matters (and for LZF, only thing
that matters, without huffman etc). But block size need not be
astronomical.
And in fact LZF uses block size of at most 64k anyway (gzip has
similar lower level blocking structure too); and back references can
be at most 8k in the past anyway.
Now: it is definitely possible to handle sub-sections at a time;
ideally low-level blocks. That can be exposed via Hadoop compressor
abstractions, but it is rather involved process (I tried reading
through it and my head hurts now :) ).
But that does allow you to skip large sections potentially efficiently.
You can also build segmenting/framing above compressor level as well,
so it is not strictly mandatory for compressor to do that.
> <size><value>
> <size><value>
> ...
> <size><value>
>
> The index has byte offset pointers into this file giving the location
> to read from. Is it possible to GZIP or LZF compress the whole file
> during the build of the data files and then decompress from the offset
> of the value when a GET request comes in without decompressing the
> whole file? Perhaps one could break the file into chunks of some fixed
> size X, and then decompress just the needed chunks (indeed, i think
> that is maybe how some of the compression algorithms work internally)?
... indeed, that is actually something I am planning on working (in
future if I have time).
You will need to figure out proper segmentation/framing structure to
support it, but it is possible.
And I think very useful: if you can use some CPU to
compress/decompress data, you can effectively get bigger fleet by
added cpu overhead. But it needs balancing; depends on how well data
compresses, size of entries and so on.
> Suffice it to say I don't know much about compression, but somehow I
> suspect the read-only nature of the data means there is a way to
> compress it even more.
I think it is the key indeed. When you start modifying compressed data
things get complicated and slow.
For read-only (or read-by-key; update in bulk) cases compression works
much better.
-+ Tatu +-