After investigating gzip as a possible algorithm to use with our
messages, we want to try lzo. There's a java implementation of lzo
being used by the hadoop community. Gzip has around 120 msec latency
for medium size messages, but we see some messages that reach up to
9MB occasionally, and the performance of gzip on those is poor --
around 4 seconds on the machines we experimented with this on. That's
unacceptable. lzo is better in unit test so far, but I haven't run a
live experiment yet.
I'm curious to know if anyone else has been experimenting with this,
and I'm also curious to get criticism on how I'm using the compression
library.
When I have a message to compress, I know the size of the byte array
stream buffer to allocate. Then call the writeTo on it. Is there
anything I should do other than this, given a message? writeTo should
be pretty performant, yes? In unit test, when measuring the speed that
takes, it is pretty good. I would like to experiment with ways to
tweak things as much as possible. And, right now the idea for
decompression is to make a good enough guess of a typical compression
ratio for the flavor of data we are passing back and forth and use
that as a heuristic for how much we want to allocate at start.
It seems fairly obvious to me to do things the way I am doing, but I
do not like to assume that I know what I'm doing at this point.
--
sheila
I don't quite understand what you are doing. Are you allocating a
ByteArrayOutputStream, writing the message to it, then passing the
byte[] from the ByteArrayOutputStream to some LZO library? You could
just call message.toByteArray() if that is what you want, which will
be faster.
I haven't tested this carefully, but my experience is that if you want
the absolute best performance while using the Java API:
* If you are writing to an OutputStream, you want re-use a single
CodedOutputStream. It has an internal buffer, and allocating this
buffer multiple times seems to slow things down. You probably want
this option if you are writing many messages. Its typically pretty
easy to provide your own implementation of OutputStream if you need to
pass data to something else (eg. LZO).
* If you have a byte[] array that is big enough, pass it in to
CodedOutputStream.newInstance() to avoid an extra copy.
* If you just want a byte[] array that is the exact right size, just
call message.toByteArray()
Does the LZO library have an OutputStream API? This would allow you to
compress large protobuf messages as they are written out, rather than
needing to serialize the entire thing to a byte[] array, then compress
it. This could be "better," but as always you'll have to measure it.
Hope this helps,
Evan
--
Evan Jones
http://evanjones.ca/
I've got a servlet filter which wraps the HttpServletResponse. So, the
servlet response's output stream, which is wrapped in a stream from
the lzo library, is compressing data as it is getting written to.
> I haven't tested this carefully, but my experience is that if you want the
> absolute best performance while using the Java API:
>
> [helpful info]
> Does the LZO library have an OutputStream API? This would allow you to
Yes.
For the curious, I'm using code from http://github.com/kevinweil/hadoop-lzo
> Hope this helps,
>
> Evan
Thank you
--
sheila
Ah, so the best case is probably message.writeTo(servletOutputStream)
If you are writing multiple messages, you'll probably want to
explicitly create a single CodedOutputStream to write all of them.
If you experiment with this and find something different, I would be
interested to know.
Wow! Glad to hear this helped so much.
If you have a sequence of messages, you could try using a single
CodedOutputStream. Something like:
CodedOutputStream out = new CodedOutputStream(compressionStream);
for (msg : messages) {
msg.writeTo(out);
}
out.flush();
This should be slightly faster than using:
msg.writeTo(compressionStream);
because it avoids re-allocating the CodedOutputStream (and its internal
buffer). It should be quite a bit better for small messages.
> Now I'm trying to figure out how I can speed up the decompression on
> the receiving side.
>
> What I have right now is:
> * Take the CompressionInputStream, convert it into a byte[]
> * Take the resulting byte[] and do .parseFrom(byte[])
>
> This seems to be a faster route, then just
> doing .parseFrom(CompressionInputStream).
Interesting. The only reason that I can think of which might make the
byte[] version faster is that maybe you use a big read, while
.parseFrom(InputStream) defaults to 4096 bytes. You could try editing
the source to make BUFFER_SIZE in CodedInputStream bigger, if you care.
The only thing I can think of is if you are reading a sequence of many
messages, you can again re-use a single CodedInputStream, although this
requires some work. Again, this will be better for small messages but
probably not large messages. This is trickier than re-using a single
CodedOutputStream. If you are interested, I can send the details about
what I have used. Although to be honest: I haven't tested it carefully
to see if this is *actually* faster than doing the simple thing such as
.parseFrom() and friends.