compression algorithms

41 views

Skip to first unread message

Mateusz Berezecki

unread,

Mar 14, 2009, 7:02:35 PM3/14/09

to hyperta...@googlegroups.com

I've been trying to figure out what kind of compression algorithm is
BMZ and failed. So could someone please give me some references or
pointers to literature (can be online) to the BMZ algorithm
explanation, etc?

The second thought I had was to ask if LZMA was considered for
compression? What was the original criterion for selecting supported
compression algorithms?

Mateusz

Luke

unread,

Mar 14, 2009, 10:23:26 PM3/14/09

to Hypertable Development

Did you see the top comments in bmz.c:

/**
* An effective/efficient block compressor for input containing long
common
* strings (e.g. web pages from a website)
*
* cf. Bentley & McIlroy, "Data Compression Using Long Common
Strings", 1999
* cf. BMDiff & Zippy mentioned in the Bigtable paper
*/

The B&M paper is available online if you search for it. BMZ by default
is essentially the BM algorithm plus LZO. But the library is flexible
enough allow other combinations.

The main criteria is the throughput for encode/decode typical commit
log and cellstore blocks (default compressed block size is 64KB, about
100-200KB raw size). LZMA (much slower than bzip2, which is much
slower than gzip, which is much slower than bmz and lzo) and bzip2 are
considered too slow and their data compression advantage is not that
big for relatively small blocks as both LZMA and bzip2 take advantage
of large (many MBs) buffers. Of course, you're welcome to experiment
with other compression options (I hope our BlockCompressionCodec API
is easy enough for you to extend :)

My BM implementation is experimental (but seems stable enough from
random tests) in nature and hardly tuned (except for avoiding using
modulo in Rabin-Karp hash table lookups), I think profiling and tuning
would make it a lot faster (it's already about 4-5x faster than gzip
on various input.)

__Luke

Reply all

Reply to author

Forward

0 new messages