Did you see the top comments in bmz.c:
/**
* An effective/efficient block compressor for input containing long
common
* strings (e.g. web pages from a website)
*
* cf. Bentley & McIlroy, "Data Compression Using Long Common
Strings", 1999
* cf. BMDiff & Zippy mentioned in the Bigtable paper
*/
The B&M paper is available online if you search for it. BMZ by default
is essentially the BM algorithm plus LZO. But the library is flexible
enough allow other combinations.
The main criteria is the throughput for encode/decode typical commit
log and cellstore blocks (default compressed block size is 64KB, about
100-200KB raw size). LZMA (much slower than bzip2, which is much
slower than gzip, which is much slower than bmz and lzo) and bzip2 are
considered too slow and their data compression advantage is not that
big for relatively small blocks as both LZMA and bzip2 take advantage
of large (many MBs) buffers. Of course, you're welcome to experiment
with other compression options (I hope our BlockCompressionCodec API
is easy enough for you to extend :)
My BM implementation is experimental (but seems stable enough from
random tests) in nature and hardly tuned (except for avoiding using
modulo in Rabin-Karp hash table lookups), I think profiling and tuning
would make it a lot faster (it's already about 4-5x faster than gzip
on various input.)
__Luke