encoding of embedded messages and repeated elements

etorri

unread,

Jun 23, 2009, 12:06:35 PM6/23/09

to Protocol Buffers

Hello,

The "length delimited" encoding basically tells that the following N
bytes belong to this field. Wouldn't it be easier to instead use the
number of elements that belong to the embedded message (repeated
element).

Now (as far as I have understood) the message needs to be built from
fragments and then collected together as the lengths are not known
beforehand and it would be expensive to calculate the byte-length of
the embedded message.

Instead, it would be relatively inexpensive to calculate just the
number of following elements that make the embedded message before
starting to encode it.

This would enable streaming of PB or encoding and sending the elements
right as they are encoded.

Sorry if I misunderstood something. I have just started looking at BP.

Alek Storm

unread,

Jun 23, 2009, 12:17:32 PM6/23/09

to Protocol Buffers

Hi etorri,

Embedded messages and strings have the exact same wire format. When parsing a message, it's impossible to know whether you're parsing one or the other, and since strings have to be encoded using their length in bytes, we can't do something different for embedded messages.

Cheers,
Alek

--
Alek Storm

etorri

unread,

Jun 23, 2009, 1:37:39 PM6/23/09

to Protocol Buffers

ok thanks. it is as it is.

(just looking at the feasibility of implementing the PB in Ada for my
own projects :-)

Kenton Varda

unread,

Jun 23, 2009, 3:07:28 PM6/23/09

to etorri, Protocol Buffers

The advantage of writing the length is that a parser can skip the entire sub-message easily without having to parse its contents. Otherwise, we would probably use the "group" encoding for sub-messages, where a special end tag marks the end of the message.

On Tue, Jun 23, 2009 at 9:06 AM, etorri <e...@torri.be> wrote:

etorri

unread,

Jun 24, 2009, 3:57:23 AM6/24/09

to Protocol Buffers

Does some existing parser actually implement that skipping feature?

There would not be any need for a end-tag. Let's assume that there
would be two different tags

2 - Length_Delimited, which could contain a packed list of bytes
(string, memory block) or other types where the parser needs to know
what is packed inside (no tags)
6 - Group or Element_Delimited - which would be like Length_Delimited
but have the number of elements that follow that belong to this field

So for an example message where the first field is a group

(1,6),3 - field numbered 1 of the message, type 6 = Group and 3
elements that follow belong to this group
(1,2),5,"Hello" -field number 1 of the embedded message would be a
string
(3,1),120 - field nr 3 of the embedded message, varint of value 120
(4,1),0 - field nr 4 of the embedded message, varint of value 0
(2,2),5,"World" - field nr 2 of the message

this would be the encoding of the following TheMessage

message Embedded {
required string greeting = 1;
optional int32 useless = 2;
required int32 good = 3;
required int32 evil = 4;
}

message TheMessage {
required Embedded e = 1;
required string target = 2;
}

So in this case there would not be need for an end tag. When
constructing the message it should be relatively easy to count the
number of embedded elements instead of knowing how much space they
occupy. This would enable streaming/serializing the elements
recursively out one by one.

Kenton Varda

unread,

Jun 24, 2009, 2:42:02 PM6/24/09

to etorri, Protocol Buffers

The end-tag approach is more efficient than your idea -- it's faster (no need to count elements at all) and it takes no more space (no need to write a count, which makes up for the extra space taken by the end tag).

But in any case, the encoding is not something we can change at this point, since protocol buffers is nothing without backwards-compatibility.

And yes, some existing parsers do, in fact, take advantage of the ability to skip over messages without parsing them, and there are many features that people are considering implementing (like lazy parsing) which would need this.

It actually turns out that pre-computing the size of the embedded message does not take very long compared to actually writing it.

Piotr Findeisen

unread,

Jun 25, 2009, 12:13:25 PM6/25/09

to Protocol Buffers

Hi!

On Jun 24, 8:42 pm, Kenton Varda <ken...@google.com> wrote:
> The end-tag approach is more efficient than your idea -- it's faster (no
> need to count elements at all) and it takes no more space (no need to write
> a count, which makes up for the extra space taken by the end tag).
> But in any case, the encoding is not something we can change at this point,
> since protocol buffers is nothing without backwards-compatibility.

As I read the code of C++ protobuf deserializer I found it supports
end-tag approach using END_GROUP constant -- or I just misunderstood
the code and/or this thread?

From my experiments it looks like I can stream messages one by one
separating them with END_GROUP tag, but -- again from comments in the
code -- it's deprecated. If "protocol buffers is nothing without
backwards-compatibility", can I assume that existing and future
implementation of C++ and also Java/Python deserializers will support
this approach?

best regards,
Piotr

Kenton Varda

unread,

Jun 25, 2009, 1:05:25 PM6/25/09

to Piotr Findeisen, Protocol Buffers

Yes, "groups" are never going to fully go away. But we recommend against using them in new code.

Reply all

Reply to author

Forward