Why are groups deprecated?

766 views
Skip to first unread message

Jeremy Leader

unread,
Jul 10, 2008, 8:36:07 PM7/10/08
to Protocol Buffers
Is there a discussion anywhere of *why* groups are deprecated in favor
of embedded messages?

--
Jeremy Leader
jle...@oversee.net

Kenton Varda

unread,
Jul 10, 2008, 11:01:59 PM7/10/08
to Jeremy Leader, Protocol Buffers
On Thu, Jul 10, 2008 at 5:36 PM, Jeremy Leader <jle...@oversee.net> wrote:
Is there a discussion anywhere of *why* groups are deprecated in favor
of embedded messages?

Two reasons:
1) They're redundant, since you can always use nested messages instead (wire format details aside...).  The only reason the exist in the first place is because, at the they were created, you could not embed one message inside another.  (Loooong time ago...)

2) They are syntactically awkward.  The syntax defines a type and a field in a single definition, with a single name.  Since a type and a field aren't actually allowed to have the same name, the field name is actually lower-cased to differentiate it.  This is a really ugly hack so we'd like to move away from it. (Unfortunately groups are still widely-used internally so we cannot simply stop implementing them.)

Marc Gravell

unread,
Jul 31, 2008, 4:02:25 AM7/31/08
to Protocol Buffers
I've been doing a bit of work on groups / nested sub-messages, and in
some cases there are distinct advantages to groups:

1: depending on the implementation, groups can yield a considerable
performance improvement, due to not having to calculate the length of
the inner message. For protobuf-net (which works on existing objects,
not the "builder" pattern that that java/C++ etc implementation uses)
this more than doubles the speed; and as a trivial side benefit
reduces the volume too, since a group-terminator is [for low tag
numbers] a single byte, where-as a length could be 2, 3 bytes (or more
on a bad day). Of course, for high tag numbers this swings back the
other way... but in most cases it is likely that groups are part of
the "core" message, so having a low tag.

2: it allows "firehose" streaming of data without having to buffer -
i.e. where you have a "read once" source of sub-messages (that might
be a few levels deep in the tree). With sub-messages you'd need to
consume the source to find the initial length, forcing you to buffer
all the actual messages - where-as with groups you can just spew them
out. For a "read once" source I'm referring to things like (in .NET )
IEnumerable<T> / IQueryable<T> (LINQ), which can [depending on the use-
case] be true object streams.

Some metrics for "1" are presented here: http://code.google.com/p/protobuf-net/

Basically, I'm just saying "don't write them off", even for new
usage... arguably the second point is more important than the first
(since either way it is still pretty quick), since this allows true
bufferless usage.

Marc

Marc Gravell

unread,
Jul 31, 2008, 4:04:08 AM7/31/08
to Protocol Buffers

Jon Skeet

unread,
Jul 31, 2008, 5:20:45 AM7/31/08
to Protocol Buffers
On Jul 31, 9:04 am, Marc Gravell <marc.grav...@gmail.com> wrote:
> Direct link to stats:http://code.google.com/p/protobuf-net/#Northwind

Interesting. I should be able to produce similar stats soon - or you
could attempt to explore the ported API for yourself, of course :)
I think github has the latest code, and it should be enough to do this
sort of serialization/deserialization.

Jon

Marc Gravell

unread,
Jul 31, 2008, 5:54:40 AM7/31/08
to Protocol Buffers
> or you could attempt to explore the ported API for yourself,

I've been planning that my next move... I just wanted to finish the
"groups" handling first (now committed, although I think I broke
the .proto emitter...), so probably later today ;-p

Marc

Torbjörn Gyllebring

unread,
Jul 31, 2008, 12:42:36 PM7/31/08
to Marc Gravell, Protocol Buffers

*dons helmet of robusness and wielding my hammer of parallel thinking* 

Guessing wildly here but some reasons for prefering String encoding and hence length prefixed messages.

Robustness, knowing the length beforehand makes it possible to simply drop to large sub-packets and detect malicous or malformed streams. Also knowing the length beforehand gives some opportunities to tweak the submessage allocator or memory pool usage. Furthermore having a bound submessage length could in theory enable parallel submessage parsing making the parser more scalable since each submessage could potentially be parsed in parallel, using group notation makes this prohibitive since finding the end of the submessage bascily as much work as parsing the complete lot. Filtering is also easier (or let's say faster) if the submessage length is know before hand since it enables us to simply skip un(wanted|needed) parts without ever hitting the parser logic. 

But yes, begin/end group makes streaming and "firehose" scenarios much more convinient. 

Jeremy Leader

unread,
Jul 31, 2008, 1:32:44 PM7/31/08
to Torbjörn Gyllebring, Marc Gravell, Protocol Buffers
So it sounds like length-delimited (i.e. sub-messages) are more
efficient for reading, and tag-delimited (i.e. groups) are more
efficient for writing.

In general, it's probably better to optimize for reading (since a
message is written by one writer, but could be read by multiple recipients).

I could imagine cases where the writer is more resource-constrained than
the readers, so the implementer might prefer using a group instead of a
nested message. As Kenton mentioned, groups are syntactically
unattractive; perhaps it might be better if there were a field option
that could be applied to a sub-message field, indicating whether it
should be length-delimited or tag-delimited? I think it's possible to
make the parser smart enough that it could handle either format, so
changing the option from length-delimited to tag-delimited or vice-versa
would not break compatibility.

--
Jeremy Leader
jle...@oversee.net

Torbjörn Gyllebring wrote:
> *dons helmet of robusness and wielding my hammer of parallel thinking*
>
> Guessing wildly here but some reasons for prefering String encoding and
> hence length prefixed messages.
>
> Robustness, knowing the length beforehand makes it possible to simply
> drop to large sub-packets and detect malicous or malformed streams. Also
> knowing the length beforehand gives some opportunities to tweak the
> submessage allocator or memory pool usage. Furthermore having a bound
> submessage length could in theory enable parallel submessage parsing
> making the parser more scalable since each submessage could potentially
> be parsed in parallel, using group notation makes this prohibitive since
> finding the end of the submessage bascily as much work as parsing the
> complete lot. Filtering is also easier (or let's say faster) if the
> submessage length is know before hand since it enables us to simply skip
> un(wanted|needed) parts without ever hitting the parser logic.&nb sp;

Torbjörn Gyllebring

unread,
Jul 31, 2008, 1:52:34 PM7/31/08
to Jeremy Leader, Marc Gravell, Protocol Buffers

My impression is that groups are supported by any implementation used inside google and by "all" (read proto# and protobuf-net) alternative implementations in this thread, basicly I think the worry here is the seriousness of "deprecated" since support is fairly trivial to add and maybe just putting the group/submessage choice into the hands of the user with a submessage default is a pragmatic way to handle this. 

Kenton Varda

unread,
Jul 31, 2008, 1:54:55 PM7/31/08
to Marc Gravell, Protocol Buffers
I'm surprised by your results.  I don't see how computing sizes ahead of time can be slower than actually serializing the data -- it should be much faster, because embedded strings do not need to be scanned to determine their size.

Also, note that if you are serializing to a byte array -- which is the usual case in our usage -- then it helps to allocate an array that is exactly the right size ahead of time, which means you have to compute all the sizes anyway.

Torbjörn Gyllebring

unread,
Jul 31, 2008, 2:03:59 PM7/31/08
to Kenton Varda, Marc Gravell, Protocol Buffers

The results are probably the by product of using diffrent approaches to building the message to be serialized. Marc's and my implementation both take any class lyinga around annotated with the proper attributes and serializes it to a byte stream. From what I've heard the google implementations uses builder classes or generated classes that computes the size of the resulting message as a by product of initialization/building them, basicly amortizing the cost of length calculation at build time. Currently proto# trades memory efficency for implementation ease by writing nested messages into new buffers and then copying them to their final destination and I belive protobuf-net does on the fly length calculation, both options benefit from group encoding. 

Kenton Varda

unread,
Jul 31, 2008, 2:13:19 PM7/31/08
to Torbjörn Gyllebring, Marc Gravell, Protocol Buffers
I see.  So messages n levels deep have their sizes recomputed n times during serialization?  That would be slow.  But this sounds like an implementation deficiency to me.  :)

BTW, an optimization that we have seen work well:  When writing an embedded message, assume that its size will be less than 128 and thus will fit in one byte.  Then, you can write the sub-message and, afterwards, go back and fill in the 1-byte hole you left with the size.  If it turns out the size was too big, you have to move the data over a byte, but most messages are small.

I guess you could also leave a 2-byte hole, and if it turns out you only needed 1 byte, you can just write an overlong varint.  Wastes a little space and makes your message non-canonical, though.

Marc Gravell

unread,
Jul 31, 2008, 3:50:17 PM7/31/08
to Protocol Buffers
> So it sounds like length-delimited (i.e. sub-messages) are more
> efficient for reading, and tag-delimited (i.e. groups) are more
> efficient for writing.

I believe length-prefixed is only more efficient (for reading) when
*skipping* data in unexpected fields. For expected fields the two
should be equivalent. At the moment my code is slightly *slower* when
reading streams, but that is an implementation niggle that I hope to
fix soon.

> As Kenton mentioned, groups are syntactically
> unattractive; perhaps it might be better if there were a field option
> that could be applied to a sub-message field, indicating whether it
> should be length-delimited or tag-delimited?

Actually, I mis-read the spec the first time I read it; I was reading
it as (for example):

repeated group MessageType fieldName = 1;

Now, I'm not suggesting it needs changing, but if you think about
groups in that way there is no smell; it is, as you say, just a field
option.

> I think it's possible to
> make the parser smart enough that it could handle either format, so
> changing the option from length-delimited to tag-delimited or vice-versa
> would not break compatibility.

For info, protobuf-net already accepts (when reading) either format
for sub-messages, regardless of what was defined ;-p
(but it writes what was defined)

Marc

Marc Gravell

unread,
Jul 31, 2008, 3:56:03 PM7/31/08
to Protocol Buffers
> I see. So messages n levels deep have their sizes recomputed n times during
> serialization? That would be slow.

Not outrageously so, but enough to be irritating.

> But this sounds like an implementation deficiency to me. :)

I'll agree to "a consequence of an alternative approach". In contrast,
though, I could argue that the fully-buffered approach forces a large
serialization footprint for large payloads ;-p (although this is
already assumed since the entier message needs to be in memory, where-
as protobuf-net can process data from an external object-source on-the-
fly).

> I guess you could also leave a 2-byte hole, and if it turns out you only
> needed 1 byte, you can just write an overlong varint.  Wastes a little space
> and makes your message non-canonical, though.

A neat trick, but at the moment protobuf-net writes directly to the
output stream (which is buffered internally); so without assuming
"seek" this isn't possible... maybe I'll have to consider always
writing to a buffer first... it isn't a big change, it is simply
something I wanted to avoid...

Marc

Kenton Varda

unread,
Jul 31, 2008, 5:25:39 PM7/31/08
to Marc Gravell, Protocol Buffers
I'm not recommending the "fully-buffered" approach.  Instead, you should compute all of the sizes of sub-messages in a first pass, then serialize as a second pass.  You can store the sizes either on the original objects if there is a way to do that, or store them into a vector which is then consumed while writing.

gsxr

unread,
Jul 31, 2008, 5:54:53 PM7/31/08
to Protocol Buffers
Hi Kenton,

Still following your interesting group.

This is an observation not directly contributing to the topic of this
thread.

I recently reviewed the topics on this discussion group to try to gain
further insight into "protocol buffers" as a product and a technology.
There is real interest from the software community and the
integrations
achieved with other toolsets is impressive.

There are a significant number of discussion threads that (IMO) arise
from
a common root cause, even this topic may be part of that sample.

i.e. the encoding of protocol buffers and the related parsing is not
self-terminating.

Perhaps that statistic is of interest to you?

Cheers,
Scott

On Aug 1, 9:25 am, "Kenton Varda" <ken...@google.com> wrote:
> I'm not recommending the "fully-buffered" approach. Instead, you should
> compute all of the sizes of sub-messages in a first pass, then serialize as
> a second pass. You can store the sizes either on the original objects if
> there is a way to do that, or store them into a vector which is then
> consumed while writing.
>

Alek Storm

unread,
Jul 31, 2008, 7:25:52 PM7/31/08
to Protocol Buffers
On Jul 31, 2:50 pm, Marc Gravell <marc.grav...@gmail.com> wrote:
> > I think it's possible to
> > make the parser smart enough that it could handle either format, so
> > changing the option from length-delimited to tag-delimited or vice-versa
> > would not break compatibility.
>
> For info, protobuf-net already accepts (when reading) either format
> for sub-messages, regardless of what was defined ;-p
> (but it writes what was defined)

I was about to suggest creating an option to specify whether to
serialize a message as tag-delimited or length-delimited, with the
deserializer accepting either, but it looks like Jeremy was way ahead
of me. However, I think it should be a message-level option, which
defaults to length-delimited and is overridable at the field level.
That way, you can use it in top-level messages, which solves the
common problem of users having to implement their own protocol to send
multiple messages over a wire. For top-level messages, it would have
an extra "none" value, which would generate no delimiters (current
behavior). "None" would be useful in precisely two cases: you're
storing exactly one message in a file, or you're implementing your
own, more complicated, wrapper around PB messages. Fields with a
"none"-delimited message type without being overridden at the field
level would use length delimiters. For example:

message Foo {
option delimiter = TAG;
...
}

message Bar {
option delimiter = NONE;
...
}

message Baz {
optional Foo f = 1;
optional Bar b = 2 [delimiter = LENGTH];
}

Marc Gravell

unread,
Aug 1, 2008, 4:31:33 AM8/1/08
to Protocol Buffers
For info, I tweaked my code to use a fully-buffered approach*, and
this makes the numbers far more appealing (I'll update the stats on
the protobuf-net page in a moment...). Groups are still /slightly/
faster as it can avoid the need to build a buffer at all, but not so
noticeably now. I used a few tricks to avoid /nested/ buffers so there
will be at most 1 additional copy of the data, which helps. I also
fixed the issue with length-prefixed streams being slower to
deserialize.

All in all not a bad train journey ;-p

Marc

* =as a quick'n'dirty "see how it performs" hack; I might look at
making the buffer decision optional, and/or cacheing the lengths [the
problem there being that we aren't in control of the data-objects, so
there is nowhere handy to put the data, and any external lookup is
going to have a cost of its own...]

Torbjörn Gyllebring

unread,
Aug 5, 2008, 3:18:07 PM8/5/08
to Marc Gravell, Protocol Buffers

Just a small followup on this. I implemented group encoding for submessages and the numbers for serialization compared to my naive length prefix approach is a saving of 40%. 

Im guessing it's partly due to my stupid handling of nested messages:

* Write submessage to MemoryStream

* Write length to outer stream

* Copy submessage stream to outer stream.

@Marc: Did you get comparable numbers, how does precalculating submessage length change things?

Marc Gravell

unread,
Aug 6, 2008, 12:01:06 AM8/6/08
to Protocol Buffers
> @Marc: Did you get comparable numbers, how does precalculating submessage
> length change things?

Originally, yes - but that was because I was doing a full length
calculation before writing the length, rather than serializing. I
swapped this to write to a single buffer throughout (which sounds
similar to what you have done?) and the numbers are now very similar
for group and length-prefix. I think this is a good thing, as I'd
rather the default was length-prefix (for commonality) - so I'm glad
they are now comparable. It sounds like yours is quicker again,
though!

Marc

Kenton Varda

unread,
Aug 6, 2008, 12:19:43 AM8/6/08
to Marc Gravell, Protocol Buffers
Again, note that the Google-authored protobuf libraries pre-compute all lengths and do it very fast.  Benchmarks of the C++ implementation in particular show that only about 10% of serialization time on average is spent computing the sizes, and precomputing sizes has other performance advantages (like being able to allocate an appropriate amount of space for the message ahead of time).

Torbjörn Gyllebring

unread,
Aug 6, 2008, 1:56:32 AM8/6/08
to Kenton Varda, Marc Gravell, Protocol Buffers

This is perfect since it gives me a target for the length calculation and as you say that should bring benefits when it comes to memory allocation that I suspect will offset the time needed. 

Thanks.

Torbjörn Gyllebring

unread,
Aug 6, 2008, 2:02:52 AM8/6/08
to Marc Gravell, Protocol Buffers

My current default submessage writer creates a new memory stream for each submessage writes to that then copies the content to the target stream. Reusing the same scratchbuffer shaves about 10% off that but has the potential to keep a lingering buffer in memory long after it's usefullness, using group encoding removes any need to move data around and hence lets me write directly to the target stream taking away 40% of the original time. From Kenton's reply indicating that length calculation can be done in around 10% of the total time for serialization Im guessing that length prefix can be as fast, or even faster since it gives the option to preallocate a big enough buffer enabling some potential optimizations when it comes to buffering. Need to investigate this further.

Kenton Varda

unread,
Aug 6, 2008, 2:13:34 AM8/6/08
to Torbjörn Gyllebring, Marc Gravell, Protocol Buffers
I should probably clarify that the relative time of length computation vs. serialization, like any benchmark, can vary drastically depending on the data.  E.g. if you have a message containing lots of strings, that will make length computation look cheaper since it doesn't have to scan those strings (in C++, at least, where we keep the bytes stored in UTF-8 at all times).  If you have a message containing lots of optional int32s, length computation will probably appear more expensive.  It also depends on whether the entire message object fits in cache.  If so, the length pass will just prep the cache for the serialization pass; otherwise, both passes are going to go directly to main memory, which could easily be the bottleneck.  In any case, 10% was just a rough median from glancing at the benchmarks.

Torbjörn Gyllebring

unread,
Aug 6, 2008, 2:23:03 AM8/6/08
to Kenton Varda, Marc Gravell, Protocol Buffers

Yes of course as always milage may vary :) 

Probably the only way to find out for sure is to implement it and that really wasn't worthwhile under the assumption that it would consume about as much time as actually doing the serialization which is true if you disregard memory movement costs. For me what the group encoding experiment showed was basicly that somewhere close to 40% of my serialization time is spent moving submessages between buffers (using the Northwind proto to have a common ground), in light of that Length calculation really doesn't need to be super speedy to start paying for itself in many scenarios. Also the added benefit of removing some memory pressure from the GC and allocator is probably a good thing globaly in many cases.

Marc Gravell

unread,
Aug 6, 2008, 2:48:01 AM8/6/08
to Protocol Buffers
> Reusing the same scratchbuffer shaves about 10% off that but has the
> potential to keep a lingering buffer in memory long after it's usefullness,

Not if you do it right... if you look at
SerializationContext.WriteLengthPrefixed (note I have some hefty
changes to commit at some point...), it checks whether the current
stream is a MemoryStream; if it is, it manipulates it directly. If
not, it creates a MemoryStream and serializes to that, noting that any
deeply nested messages will find themselves writing to that
MemoryStream, so they can cheat. There is a smaller scratch buffer
(SerializationContext.Workspace), but that is used for very localised
byte processing.

Note also the use of a length-underestimate (again, to change
shortly) - this allows strings (as the classic example) to quickly say
"195 bytes or more", so we can allow a suitable guess for the size of
the length-prefix, and just blit the data if we guessed sufficiently
badly (a few high unicode values usually won't change the number of
bytes needed for the prefix). But like I say - all this is up in the
air at the moment.

What I do know is that *at least under my own implementation*, this
approach is considerably faster than accurately computing the length
when there are deeply nested sub-messages with complex data
(optionals, irksome decimal/DateTime, high-utf strings, etc). And it
also means we can use a pure "read once" approach for data sources
like IEnumerable<T>.

Anyway, that was my findings... ;-p

Marc

chi...@gmail.com

unread,
Sep 2, 2008, 8:37:26 PM9/2/08
to Protocol Buffers
Hi,

I just spotted this old thread and wanted to revive it with my 2
cents. I agree the that group syntax in the proto files need to go
away. The message syntax is much more powerful and provides the same
functionality.

But I think that the encoding format on a a message field needs to be
optionally controlled by a field level option. Pre-computing the
message size does not add to much overhead, but it IS additional
overhead which does not always provide added benefit. This overhead
will only get larger on deeply nested complex messages. In
applications where 90% of your CPU is consumed by message marshaling,
even a 10% performance increase is a substantial savings.

So I want to re-iterate that I think it's vital that both encoding
forms be supported and let the end user choose which encoding form
best suites his needs. I think it would make everyone happy if we
supported something like:

message Bar {
required Foo field1=1 [encode=STREAM];
required Foo field1=1 [encode=FRAME];
}

BTW.. in the Java case prefixing the size, will not help the
performance. I suspect that this would be the case in most garbage
collected languages. In java the only way to get a performance boost
out of the size prefix would be use deferred de-marshalling. But if
you did something like that you would loose the ability to validate
correctness of the message since you dont' fully de-marshall the
entire message. But even then that might be a good option to have in
some use cases.

Regards,
Hiram

On Aug 6, 2:13 am, "Kenton Varda" <ken...@google.com> wrote:
> I should probably clarify that the relative time of length computation vs.
> serialization, like any benchmark, can vary drastically depending on the
> data.  E.g. if you have a message containing lots of strings, that will make
> length computation look cheaper since it doesn't have to scan those strings
> (in C++, at least, where we keep the bytes stored in UTF-8 at all times).
>  If you have a message containing lots of optional int32s, length
> computation will probably appear more expensive.  It also depends on whether
> the entire message object fits in cache.  If so, the length pass will just
> prep the cache for the serialization pass; otherwise, both passes are going
> to go directly to main memory, which could easily be the bottleneck.  In any
> case, 10% was just a rough median from glancing at the benchmarks.
> On Tue, Aug 5, 2008 at 10:56 PM, Torbjörn Gyllebring <
>
> torbjorn.gyllebr...@gmail.com> wrote:
> > This is perfect since it gives me a target for the length calculation and
> > as you say that should bring benefits when it comes to memory allocation
> > that I suspect will offset the time needed.
>
> > Thanks.
>
> > On Wed, Aug 6, 2008 at 6:19 AM, Kenton Varda <ken...@google.com> wrote:
>
> >> Again, note that the Google-authored protobuf libraries pre-compute all
> >> lengths and do it very fast.  Benchmarks of the C++ implementation in
> >> particular show that only about 10% of serialization time on average is
> >> spent computing the sizes, and precomputing sizes has other performance
> >> advantages (like being able to allocate an appropriate amount of space for
> >> the message ahead of time).
>
Reply all
Reply to author
Forward
0 new messages