Protocol Buffers using Lzip

Jacob Rief

unread,

Dec 8, 2009, 7:17:29 AM12/8/09

to Brian Olson, Kenton Varda, Protocol Buffers

Hello Brian, hello Kenton, hello list,
as an alternative to GzipInputStream and GzipOutputStream I have
written a compression and an uncompression stream class which are
stackable into Protocol Buffers streams. They are named
LzipInputStream and LzipOutputStream and use the Lempel-Ziv-Markov
chain algorithm, as implemented by LZIP
http://www.nongnu.org/lzip/lzip.html

An advantage for using Lzip instead of Gzip is, that Lzip supports
multi member compression. So one can jump into the stream at any
position, forward up to the next synchronization boundary and start
reading from there.
Using the default compression level, Lzip has a better compression
ratio at the cost of being slower than Gzip, but when Lzip is used
with a low compression level, speed and output size of Lzip are
comparable to that of Gzip.

I would like to donate these classes to the ProtoBuf software
repository. They will be released under an OSS license, compatible to
LZIP and Google's. Could someone please check them and tell me in what
kind of repository I can publish them. In Google's license agreements
there is a passage telling: "Neither the name of Google Inc. nor the
names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission."
Since I have to use the name "google" in the C++ namespace of
LzipIn/OutputStream, hereby I ask for permission to do so.

Comments are appreciated,
Jacob

lzip_stream.cc

lzip_stream.h

Kenton Varda

unread,

Dec 9, 2009, 3:21:43 PM12/9/09

to Jacob Rief, Brian Olson, Protocol Buffers

Thanks for writing this code! I'm sure someone will find it useful.

That said, I'm wary of adding random stuff to the protobuf library. gzip made sense because practically everyone has zlib, but lzlib is less universal. Also, if we have lzip support built-in, should we also support bzip, rar, etc.?

Also note that libprotobuf is already much larger (in terms of binary footprint) than it should be, so making it bigger is a tricky proposition.

And finally, on a non-technical note, adding any code to the protobuf distribution puts maintenance work on me, and I'm overloaded as it is.

Usually I recommend that people set up a googlecode project to host code like this, but lzip_stream may be a bit small to warrant that. Maybe a project whose goal is to provide protobuf adaptors for many different compression formats? Or a project for hosting random protobuf-related utility code in general?

Christopher Smith

unread,

Dec 10, 2009, 12:06:16 PM12/10/09

to Kenton Varda, Jacob Rief, Brian Olson, Protocol Buffers

One compression algo that I thought would be particularly useful with PB's would be LZO. It lines up nicely with PB's goals of being fast and compact. Have you thought about allowing an integrated LZO stream?

--Chris

--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To post to this group, send email to prot...@googlegroups.com.
To unsubscribe from this group, send email to protobuf+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.

--
Chris

Jacob Rief

unread,

Dec 11, 2009, 5:20:53 PM12/11/09

to Christopher Smith, Kenton Varda, Brian Olson, Protocol Buffers

Hello Chris,

2009/12/10 Christopher Smith <cbs...@gmail.com>:

> One compression algo that I thought would be particularly useful with PB's
> would be LZO. It lines up nicely with PB's goals of being fast and compact.
> Have you thought about allowing an integrated LZO stream?
>
> --Chris

My goal is to compress huge amounts >5GB of small serialized chunks
(~150...500 Bytes) into a single stream, and still being able to
randomly access each part of it without having to decompress to whole
stream. GzipOutputStream (with level 5) reduces the size to about 40%
compared to the uncompressed binary stream, whereas my
LzipOutputStream (with level 5) reduces the size to about 20%. The
difficulty with gzip is to find synchronizing boundaries in the stream
during uncompression
If your aim is to exchange small messages, say by RPC, than a fast but
less efficient algorithm is the right choice. If however you want to
store huge amounts of data permanently, your requirements may be
different.

In my opinion, generic streaming classes such as
ZeroCopyIn/OutputStream, shall offer different compression algorithms
for different purposes. LZO has advantages if used for communication
of small to medium sized chunks of data. LZMA on the other hand has
advantages if you have to store lots of data for a long term. GZIP is
somewhere in the middle. Unfortunately Kenton has another opinion
about adding too many compression streaming classes.

Today I studied the API of LZO. From what I have seen, I think one
could implement two LzoIn/OutputStream classes. LZO compression
however has a small drawback, let me explain why: The LZO API is not
intended to be used for streams. Instead it always compresses and
decompresses a whole block. This is different behaviour than gzip and
lzip, which are intended to compress streams. A compression class has
a fixed sized buffer of typically 8 or 64kB. If this buffer is filled
with data, lzip and gzip digest the input and you can start to fill
the buffer from the beginning. On the other hand, the LZO compressor
has to compress the whole buffer in one step. The next block then has
to be concatenated with the already compressed data, which means that
during decompression you have to fiddle these chunks apart.

If your intention is to compress a chunk of data with, say less than
64kB each, and then to put it on the wire, then LZO is the right
solution for you. For my requirements, as you will understand now, LZO
does not really fit well.
If there is a strong interest in an alternative Protocol Buffer
compression stream, don't hesitate to contact me.

Jacob

Daniel Wright

unread,

Dec 11, 2009, 5:55:47 PM12/11/09

to Jacob Rief, Christopher Smith, Kenton Varda, Brian Olson, Protocol Buffers

As I'm sure you can imagine, we store a lot of data in protocol buffer format at Google, so we often want to store very large files with many serialized protocol buffers. The technique we use is to batch a bunch of records together, compress them and write the compressed block to the file (with checksumming and positioning markers). Then if you want to read a specific record, you need to decompress one block. That doesn't take much longer than the disk seek, so it's not a problem unless you have huge blocks.

The code we use to do this and the file format is honestly a bit of mess (it's grown slowly over many years), so it's not suitable to be open-sourced. It certainly makes sense to have an open source library to do this, and it sounds similar to what your code is aiming at. But I agree with Kenton that it should not be part of the Protocol Buffer library -- it should be a separate project. It doesn't even need to be directly connected to Protocol Buffers -- you can use the same format for any kind of record.