[protobuf] serialization size from 2.0.x to 2.3.x, also message design best practices

225 views
Skip to first unread message

sheila miguez

unread,
Apr 27, 2010, 2:08:38 PM4/27/10
to Protocol Buffers
Should I see smaller serialization sizes going from 2.0.x to 2.3? I
was hoping to, and I compiled a sample message to compare
serialization sizes between versions. The size was the same. The
sample message has a number of different data types.

I notice in the changelog that string serialization performance is
improved in 2.3.1. Any idea when it will be released? And, what sort
of improvements have been observed?

Some of our payloads are large, e.g. 3 MB, and some of the messages
are composed mainly of strings. Improvements in serialization size
would be nice. I'm also thinking we should determine best practices
for message designs that would lead to optimal packing. But, will the
compiler mostly take care of that? Or will it help for me to determine
this even so? Have people already published best practices for message
design?

thanks

--
sheila

--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To post to this group, send email to prot...@googlegroups.com.
To unsubscribe from this group, send email to protobuf+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.

Kenton Varda

unread,
Apr 27, 2010, 3:04:12 PM4/27/10
to sheila miguez, Protocol Buffers
No revision of protobufs is ever likely to change the serialized size of existing messages, because doing so would presumably break backwards compatibility.  A revision might introduce a new encoding mechanism that is more compact (like packed encoding did), but this is unusual, since there is not much room for improvement in the existing encoding.

The "optimizations" mentioned in the changelog are CPU speed or memory usage optimizations, not encoded size optimizations.

Note that protobufs only encode structure.  They do not do any compression.  You should apply compression separately on top of your data if you need it.  Note that this will add considerable CPU cost, so you must decide if it's a trade-off you want to make.

sheila miguez

unread,
Apr 27, 2010, 3:38:58 PM4/27/10
to Kenton Varda, Protocol Buffers
On Tue, Apr 27, 2010 at 2:04 PM, Kenton Varda <ken...@google.com> wrote:

> Note that protobufs only encode structure.  They do not do any compression.
>  You should apply compression separately on top of your data if you need it.
>  Note that this will add considerable CPU cost, so you must decide if it's a
> trade-off you want to make.

As it turns out, I've been collecting metrics on compression latency
and compression ratios for our messages to decide whether it's worth
it.

In tomcat, I've set compressionMinSize to around 100K. I get numbers
around 140 ms for mean latency and 0.18 compression ratio, for gzip.
variance is pretty big though. I wasn't expecting a good compression
ratio for protobuf messages since they are decently packed already,
but was happy to see that result.

Anyway, I don't like the latency, on the other hand, over a certain
amount it seems to make up for the latency due to network hops for
sufficiently large payloads. I'm still tweaking variables and getting
data. If it would help anyone else here, and I discover anything
useful, I will follow up.

I realize things could be improved via good api and model design
rather than having to tweak things via compression etc. but I don't
have the resources to redesign everything all at once. (nor would I
want to).

Are there protobuf user groups, and if so, is there one in Chicago?

Evan Jones

unread,
Apr 27, 2010, 3:52:06 PM4/27/10
to Protocol Buffers
On Apr 27, 2010, at 15:04 , Kenton Varda wrote:
> The "optimizations" mentioned in the changelog are CPU speed or
> memory usage optimizations, not encoded size optimizations.

Totally unrelated, but this reminds me that I think there may still be
one optimization possible with Java protocol buffers and strings. I'll
try to spend some time revisiting this at some point "soon."

Evan

--
Evan Jones
http://evanjones.ca/

Kenton Varda

unread,
Apr 27, 2010, 5:33:02 PM4/27/10
to sheila miguez, Protocol Buffers
On Tue, Apr 27, 2010 at 12:38 PM, sheila miguez <she...@pobox.com> wrote:
I wasn't expecting a good compression
ratio for protobuf messages since they are decently packed already,
but was happy to see that result.

Yep, Protobufs are a compact encoding, but compression can still work well depending on your data set.  For example, if you load your message with compressible strings or other repetitive data, the protobuf encoding itself is not going to actually compress them, so adding zlib on top will help.

Marc Gravell

unread,
Apr 27, 2010, 5:37:46 PM4/27/10
to Kenton Varda, Protocol Buffers
In the case of repeated strings etc (excluding the "enum" case), I've been toying whether something is possible by associating certain objects / values with unique identifiers on the wire. Potentially this would also allow "graph" (rather than "tree") serialization.

This is obviously well into the hazy area of "outside what protobuf offers, but possible to represent as valid protobuf fragments / messages", but I'd be interested in people's thoughts... worth investigating? Or silly?

Marc
--
Regards,

Marc

Adam Vartanian

unread,
Apr 27, 2010, 5:53:33 PM4/27/10
to Marc Gravell, Kenton Varda, Protocol Buffers
> In the case of repeated strings etc (excluding the "enum" case), I've been
> toying whether something is possible by associating certain objects / values
> with unique identifiers on the wire. Potentially this would also allow
> "graph" (rather than "tree") serialization.
> This is obviously well into the hazy area of "outside what protobuf offers,
> but possible to represent as valid protobuf fragments / messages", but I'd
> be interested in people's thoughts... worth investigating? Or silly?

We've done that before to save space. In the simplest case, you can
just put a repeated string field in your outermost message that's a
list of every unique string in the rest of the message, and then
anywhere else in the message that you would have put a string, you put
an int that says which string should go there.

There are downsides to that approach, though. The biggest one is that
you can't build your message up piece by piece, you have to build the
whole message object in one go, because all of them need to have input
into this one global data structure; similarly, you can't process the
message on the other side without carrying around the index everywhere
you're using any part of the message. It can save a lot of space if
you end up repeating the same strings over and over, though.

- Adam
Reply all
Reply to author
Forward
0 new messages