Marshalling 2 messages will Unmarshall to 1 Message: Feature or Bug?

219 views
Skip to first unread message

chi...@gmail.com

unread,
Sep 3, 2008, 3:07:47 PM9/3/08
to Protocol Buffers
So it seems to me that by design, if you marshall 2 messages to a
stream and then try to read that stream, you will only get 1 message
back, the 2nd being merged into the 1st.

Is this a feature that folks actively look for? Because to a
beginer's eyes he would expect to be able to de-marshall back the 2
messages.

A quick fix for this would be to always add an end of group tag after
a message is marshalled. I think this should still be backward
compatible with older clients since they should ignore the end of
group tag.

Marc Gravell

unread,
Sep 3, 2008, 3:42:36 PM9/3/08
to Protocol Buffers
It is, I understand, very much by design that the protocol allows you
to concatenate messages to (for example) add additional fields. If you
want to serialize multiple messages, the simplest approach is to
serialize an outer message that has a repeated message as content -
there will be only one outer message but multiple inner messages.

Marc

Chris

unread,
Sep 3, 2008, 5:13:56 PM9/3/08
to Marc Gravell, Protocol Buffers
Pardon me, but now I need to ask questions.

If you marshal a message is it preceded by its length?
If so, then each message can be read individually.
If not, then there is nothing on the wire to indicate where the boundary
lies.

So the right way to put 2 messages next to each other is with their
length written first (like when they are fields).

Or am I missing something?

--
Chris

Marc Gravell

unread,
Sep 3, 2008, 5:22:55 PM9/3/08
to Protocol Buffers
> If you marshal a message is it preceded by its length?
One message? no. Nested messages *are* either length-prefixed or group-
delimited, though.

> If not, then there is nothing on the wire to indicate where the boundary
> lies.
Correct

> So the right way to put 2 messages next to each other is with their
> length written first (like when they are fields).
Except that that isn't legal on the wire for an outer-message... the
first byte(s) is/are assumed to be the first field token.

So if you want to send two Foo messages, simply send a single Bar that
has (for example)

message Bar {
repeated Foo items = 1;
}

Then a Bar with 2 Foo items will be written as:

{field 1 | string wire-type}{length of first Foo}{body of first Foo}
{field 1 | string wire-type}{length of second Foo}{body of second Foo}

Does that make sense?

And of course you can now also send other messages too (the data does
not have to be homogeneous).

Marc

chi...@gmail.com

unread,
Sep 3, 2008, 5:25:27 PM9/3/08
to Protocol Buffers
Yes but this tends to not work well when for example writing messages
to a log file or sending multiple messages asynchronously over long
lived socket.
While I could frame the message, I kinda dislike that because the wire
encoding of the stream is now not 100% described by the proto file.

Could I just manually delimit the end of message by adding in the end
of group tag myself to the stream after the message is encoded? And
if did would all implementations properly decode to multiple messages?

Kenton Varda

unread,
Sep 3, 2008, 5:36:10 PM9/3/08
to chi...@gmail.com, Protocol Buffers
On Wed, Sep 3, 2008 at 2:25 PM, hi...@hiramchirino.com <chi...@gmail.com> wrote:
Could I just manually delimit the end of message by adding in the end
of group tag myself to the stream after the message is encoded?  And
if did would all implementations properly decode to multiple messages?

Existing parsers will reject a message that contains an unexpected endgroup tag.

However, if you construct the CodedInputStream yourself on the receiving end, you can instruct it to expect an end-group tag rather than EOF.

Marc Gravell

unread,
Sep 4, 2008, 12:06:05 AM9/4/08
to Protocol Buffers
See also the thread "Facilitate Object Streaming Options"

Marc

Chris

unread,
Sep 4, 2008, 3:22:17 AM9/4/08
to Kenton Varda, chi...@gmail.com, Protocol Buffers
There is no need to break the wire format by adding unexpected end tags.

The plan of creating a "Bar { repeated Message foo = 1 }" and reading a
Bar means that every Message must be read in a single operation. One
might need to read each Message one at a time.

The only thing that needs to be done is to expose two operations in the API:

(1) Write the varint encoded size of the message followed by the message
(2) Read the varint encoded size of the message followed by the message

And these operations are already present, because this is how a Message
field is written and read!
Note that I do not propose writing a (field id,wire type) tag, just the
length.

All of the C++/Java/Python/Haskell/Lisp/C/Java/C#/Matlab/... bindings
will always needs (1) and (2) internally. So they need only expose
these in the external API.

In fact, I already had exposed (1) and (2) in Haskell because I assumed
they were going to be useful for this purpose.
In fact, I thought this was such a good idea that I assumed the other
APIs did this and did not realize this was not part of the other APIs.
Perhaps it is part the current API?

Using this only affects the top-level of a given wire-stream, making it
look more like a protocol channel with a series of (potentially
different) messages.

Cheers,
Chris

Mats Kindahl

unread,
Sep 4, 2008, 6:25:00 AM9/4/08
to Kenton Varda, chi...@gmail.com, Protocol Buffers
Kenton Varda wrote:
> On Wed, Sep 3, 2008 at 2:25 PM, hi...@hiramchirino.com <chi...@gmail.com>wrote:
>
>> Could I just manually delimit the end of message by adding in the end
>> of group tag myself to the stream after the message is encoded? And
>> if did would all implementations properly decode to multiple messages?
>
>
> Existing parsers will reject a message that contains an unexpected endgroup
> tag.
>
> However, if you construct the CodedInputStream yourself on the receiving
> end, you can instruct it to expect an end-group tag rather than EOF.


Check the post below for some information on how I handled it. Yes, I'm using
CodedInputStream and framing the messages myself.

http://mysqlmusings.blogspot.com/2008/08/missing-pieces-in-protobuf-binary-log.html

Just my few cents,
Mats Kindahl

Marc Gravell

unread,
Sep 4, 2008, 6:55:33 AM9/4/08
to Protocol Buffers
> Note that I do not propose writing a (field id,wire type) tag, just the
> length.

Which means that it is no longer wire-compatible. If you want to
stream individual messages, then it should be possible: Jon had some
thoughts for his implementation (which derives from the java version).
protobuf-net has full support for streaming individual messages (both
read and write) in the guise of a wrapper message (the "repeated
Bar"), which retains full compatibility with other clients.
Additionally, sending the tag/wire-type allows you to a: send
different types of messages in the same stream [perfect for RPC], and
b: use groups instead of length-prefixed if you choose.

That's my tuppence, anyway... (see the end of the other chain I cited
for more information)

Marc

chi...@gmail.com

unread,
Sep 4, 2008, 9:29:23 AM9/4/08
to Protocol Buffers


On Sep 3, 5:36 pm, "Kenton Varda" <ken...@google.com> wrote:
> On Wed, Sep 3, 2008 at 2:25 PM, hi...@hiramchirino.com <chir...@gmail.com>wrote:
>
> > Could I just manually delimit the end of message by adding in the end
> > of group tag myself to the stream after the message is encoded?  And
> > if did would all implementations properly decode to multiple messages?
>
> Existing parsers will reject a message that contains an unexpected endgroup
> tag.
>

Not the existing Java and C++ implementations at least: Notice that
the MergeFrom has default switch case where the end of group tag is
checked.. And if it hit marshaling stops. Why do you say that
existing parsers would reject the message if end group tag is
appended?

default: {
handle_uninterpreted:
if
(::google::protobuf::internal::WireFormat::GetTagWireType(tag) ==
::google::protobuf::internal::WireFormat::WIRETYPE_END_GROUP)
{
return true;
}
DO_(::google::protobuf::internal::WireFormat::SkipField(
input, tag, mutable_unknown_fields()));
break;
}

Regards,
Hiram

Chris

unread,
Sep 4, 2008, 3:58:57 PM9/4/08
to Protocol Buffers
Marc,

I do not know what protobuf-net's streaming is doing (and I read
http://groups.google.com/group/protobuf/browse_thread/thread/951ed9d0359184ea/7713405ac3599fb1?hl=en&lnk=gst&q=Facilitate#7713405ac3599fb1
).

Adding a wrapper message does nothing for the issue of delimiting the
outermost message.

And using a single wrapper message means that one has to read all the
input before using the first contained message.

I do not propose removing the current way of doing things. I merely
proposed adding to the API so that length prefixed messages can be
written and read. This could only be a problem if one does not know
whether a stream has such a length prefix, but in that case I doubt you
would also know which kind of message to try reading.

Question: I have not looked for the answer in the code yet, but how
does the service/method API delimit the request and answer?

--
Chris

Kenton Varda

unread,
Sep 5, 2008, 3:50:58 PM9/5/08
to Chris, chi...@gmail.com, Protocol Buffers
So basically what you want is these two functions:

bool WriteLengthDelimitedMessage(
    const protobuf::Message& message,
    protobuf::io::ZeroCopyOutputStream* output) {
  protobuf::io::CodedOutputStream coded_output(output);
  int size = message.ByteSize();
  if (!coded_output.WriteVarint32(size)) return false;
  if (!message.SerilaizeWithCachedSizes(&coded_output)) return false;
  return true;
}

bool ReadLengthDelimitedMessage(
    protobuf::io::ZeroCopyInputStream* input,
    protobuf::Message* message) {
  protobuf::io::CodedInputStream coded_input(input);
  uint32 size;
  if (!input->ReadVarint32(&size)) return false;
  protobuf::io::CodedInputStream::Limit limit = coded_input.PushLimit(size);
  if (!message->ParseFromCodedStream(&coded_input)) return false;
  if (!coded_input.ConsumedEntireMessage()) return false;
  coded_input.PopLimit(size);
  return true;
}

Feel free to use the above code (note: hasn't been compiled or tested).  I guess the question is whether or not this needs to be in libprotobuf itself.

Kenton Varda

unread,
Sep 5, 2008, 4:02:18 PM9/5/08
to chi...@gmail.com, Protocol Buffers
After MergeFromCodedStream() returns, the caller checks that parsing ended on an endgroup tag (using CodedInputStream::LastTagWas()) or at EOF (CodedInputStream::ConsumedEntireMessage()).  Look at message.cc or wire_format_inl.h to see this.

You can get around these checks by creating a CodedInputStream yourself and calling ParseFromCodedStream() or MergeFromCodedStream() directly.  In this case it is up to you to decide whether or not parsing ended at the right place.

prot...@personal.mightyreason.com

unread,
Sep 6, 2008, 5:44:20 AM9/6/08
to Kenton Varda, chi...@gmail.com, Protocol Buffers
Kenton Varda wrote:
> Feel free to use the above code (note: hasn't been compiled or
> tested). I guess the question is whether or not this needs to be in
> libprotobuf itself.

I am in favor of putting some kind of delimited message format into the
API, this is the requested feature. This will keep people from being
too creative and reinventing incompatible solutions to have delimited
messages. Such as the suggestions in this thread.

The LengthDelimited functions are the most obvious: all possible
implementation of messages must already have the functionality
internally (as your code shows). So a few lines of code in the API (not
the generated code files) takes care of this feature request. I have
not checked if the existing API for all languages exposes enough to
write the equivalent of the few lines of code you showed for C++.

The most obvious use to me is reading from a continous (e.g. network)
stream of bytes. The outermost message needs to be delimited somehow,
currently by the application inventing more protocol rules.

Cheers,
Chris

Jon Skeet

unread,
Sep 8, 2008, 2:15:13 AM9/8/08
to Protocol Buffers
On Sep 6, 10:44 am, proto...@personal.mightyreason.com wrote:
> Kenton Varda wrote:
> > Feel free to use the above code (note: hasn't been compiled or
> > tested).  I guess the question is whether or not this needs to be in
> > libprotobuf itself.
>
> I am in favor of putting some kind of delimited message format into the
> API, this is the requested feature.  This will keep people from being
> too creative and reinventing incompatible solutions to have delimited
> messages.  Such as the suggestions in this thread.

In my C# API I have a MessageStream (or something like that - I don't
have the source with me) which allows streaming by writing a sequence
of messages as if they were a repeated element with field 1 within a
container message. I then have a MessageStreamIterator which can
iterate through the messages in a stream in the obvious way.

The nice thing about this is that if you want to load everything in
one go, just create the virtual container message, and load that.

I agree that it would be nice to see this functionality in more
environments, agreeing it to be a common format for streaming. We
*could* add the flexibility of allowing the field number to be set on
both the iterator and the writer if that were deemed useful. I don't
think it would be good to mix message types within a stream, so just a
single number would suffice, perhaps defaulting to 1.

Jon

Kenton Varda

unread,
Sep 8, 2008, 1:10:26 PM9/8/08
to prot...@personal.mightyreason.com, chi...@gmail.com, Protocol Buffers
Sure, I'd be OK with adding methods to Message like:

  bool SerializeDelimitedTo(CodedOutputStream* output);
  bool SerializeDelimitedToZeroCopyStream(ZeroCopyOutputStream* output);
  bool SerializeDelimitedToFile(int file_descriptor);
  bool SerializeDelimitedToOstream(ostream* output);

  bool ParseDelimitedFrom(CodedOutputStream* input);
  bool ParseDelimitedFromZeroCopyStream(ZeroCopyInputStream* input);
  bool MergeDelimitedFrom(CodedOutputStream* input);
  bool MergeDelimitedFromZeroCopyStream(ZeroCopyInputStream* input);
  // Note that we cannot parse a length-delimited message from
  // a file descriptor or an istream since these interfaces don't provide
  // a way to push data back into the stream if we read too far.

I'm pretty swamped, though.  Does someone want to write up a patch (with unit tests)?

Chris Kuklewicz

unread,
Sep 8, 2008, 1:40:33 PM9/8/08
to Protocol Buffers
Hi Jon,

So we both agree there should be a way to delimit messages in a
stream. In your stream is sounds like there is

(A) The field type + wire type, which here is 1*8+2 = 10 written as a
varint the stream in a single byte since it is <= 127.
(B) The length of the message as a varint
(C) The message body

And in you C# code reading this one message at a time is support by a
special API command.

So this is very close to my proposal (which is in my Haskell code). I
do not have (A) but I do have (B) and (C).

Does you C# API allow (in some way) the user to read (B)+(C) as a
single message from the stream? Does it allow the user to write such
a thing?

In my Haskell code could make a wrapper message and an extension key
to allow it to write (A)+(B)+(C). And this could read all the
messages in the stream. But I lack a special API command to read a
single such message from the stream. The (A)+(B)+(C)+(A)+(B)+(C)+...
is wire compatible with such a repeated message but your reading API
is special. Since a special API is needed I chose (B)+(C)+(B)+(C)+(B)+
(C)+... and if the user wants them all then the user has to loop over
this collecting a single message at a time.

Where is your C# API documented, in case I wish to support your (A)+(B)
+(C) single message reading behavior?

As for mixing message types in a stream, I think the ability to have a
heterogeneous stream (or a stream embedded in another protocol) will
be needed by someone. So the main question, like with delimited
messages, is whether each application invents its own technique. One
could use a serialized DescriptorProto Message to announce the type
before each message. All the fields are optional, so the herald
message needs only to set the name field. Or perhaps serialize a
(nearly empty) FileDescriptorProto with the (nearly empty) embedded
DescriptorProto would make it more clear which object is being
serialized. But this is getting me off track.

The main point of the message is to see if the C# and Haskell code can
agree on common APIs and encodings for delimited messages in a
continuous stream.

Here is a generalization question: If we adopt a way to read (A)+(B)+
(C) then will the generalize beyond message types to groups or to
basic types like Double ? What happens in the special API in your C#
code if one tries to do this or encounters such a stream? My Haskell
API only works with messages. I have not exposed a friendly way to
encode or decode anything else.

Cheers,
Chris

Jon Skeet

unread,
Sep 9, 2008, 4:25:23 AM9/9/08
to Protocol Buffers
On Sep 8, 6:40 pm, Chris Kuklewicz <turingt...@gmail.com> wrote:
>    So we both agree there should be a way to delimit messages in a
> stream.  In your stream is sounds like there is
>
> (A) The field type + wire type, which here is 1*8+2 = 10 written as a
> varint the stream in a single byte since it is <= 127.
> (B) The length of the message as a varint
> (C) The message body
>
> And in you C# code reading this one message at a time is support by a
> special API command.

Correct.

> So this is very close to my proposal (which is in my Haskell code).  I
> do not have (A) but I do have (B) and (C).
>
> Does you C# API allow (in some way) the user to read (B)+(C) as a
> single message from the stream?  Does it allow the user to write such
> a thing?

You could do it using CodedInputStream/CodedOutputStream directly, but
it wouldn't be terribly pleasant.

> In my Haskell code could make a wrapper message and an extension key
> to allow it to write (A)+(B)+(C).  And this could read all the
> messages in the stream.  But I lack a special API command to read a
> single such message from the stream.  The (A)+(B)+(C)+(A)+(B)+(C)+...
> is wire compatible with such a repeated message but your reading API
> is special.  Since a special API is needed I chose (B)+(C)+(B)+(C)+(B)+
> (C)+... and if the user wants them all then the user has to loop over
> this collecting a single message at a time.
>
> Where is your C# API documented, in case I wish to support your (A)+(B)
> +(C) single message reading behavior?

I haven't got as far as docs yet, but the discussion around this is in
the earlier "streaming messages" thread.

> As for mixing message types in a stream, I think the ability to have a
> heterogeneous stream (or a stream embedded in another protocol) will
> be needed by someone.  So the main question, like with delimited
> messages, is whether each application invents its own technique.  One
> could use a serialized DescriptorProto Message to announce the type
> before each message.  All the fields are optional, so the herald
> message needs only to set the name field.  Or perhaps serialize a
> (nearly empty) FileDescriptorProto with the (nearly empty) embedded
> DescriptorProto would make it more clear which object is being
> serialized.  But this is getting me off track.

I think it's best to concentrate on the simple requirement first, and
not guess too much about what would be needed. Use cases for an
homogenous stream are easy to come up with - the simplest being
logging, for example.

> The main point of the message is to see if the C# and Haskell code can
> agree on common APIs and encodings for delimited messages in a
> continuous stream.

The benefit of the A+B+C format is that it's not really coming up with
a new encoding, so much as just a new way of reading repeated "field
1" messages. The whole data still ends up being a valid message of the
appropriate type.

But yes, I agree that a common API would be good to decide.

> Here is a generalization question:  If we adopt a way to read (A)+(B)+
> (C) then will the generalize beyond message types to groups or to
> basic types like Double ? What happens in the special API in your C#
> code if one tries to do this or encounters such a stream?  My Haskell
> API only works with messages.  I have not exposed a friendly way to
> encode or decode anything else.

If you want to encode a stream of doubles directly, I'm not sure that
PBs are the right way to go. However, with CodedInputStream and
CodedOutputStream I guess it would be feasible.

Jon

Jon Skeet

unread,
Sep 9, 2008, 4:38:33 AM9/9/08
to Protocol Buffers
On Sep 9, 9:25 am, Jon Skeet <sk...@pobox.com> wrote:
> > As for mixing message types in a stream, I think the ability to have a
> > heterogeneous stream (or a stream embedded in another protocol) will
> > be needed by someone.  So the main question, like with delimited
> > messages, is whether each application invents its own technique.  One
> > could use a serialized DescriptorProto Message to announce the type
> > before each message.  All the fields are optional, so the herald
> > message needs only to set the name field.  Or perhaps serialize a
> > (nearly empty) FileDescriptorProto with the (nearly empty) embedded
> > DescriptorProto would make it more clear which object is being
> > serialized.  But this is getting me off track.
>
> I think it's best to concentrate on the simple requirement first, and
> not guess too much about what would be needed. Use cases for an
> homogenous stream are easy to come up with - the simplest being
> logging, for example.

I've just thought of a nice way to do this: provide a message
descriptor which would describe how the whole data could be observed
as a single message. This is backwardly compatible with my current
streaming API (which is still open to change, btw). You'd then iterate
through and get a sequence of field/value pairs, where each value is
of the correct type for the corresponding field in the "umbrella"
message.

Basically this is a "pull" version of the Observer pattern which has
been mentioned before.

The only benefit of my current API is that it doesn't require the
"umbrella" message to be defined beforehand.

Jon

Chris

unread,
Sep 10, 2008, 4:18:51 PM9/10/08
to Jon Skeet, Protocol Buffers
Hi Jon,

>> The main point of the message is to see if the C# and Haskell code can
>> agree on common APIs and encodings for delimited messages in a
>> continuous stream.
>>
>
> The benefit of the A+B+C format is that it's not really coming up with
> a new encoding, so much as just a new way of reading repeated "field
> 1" messages. The whole data still ends up being a valid message of the
> appropriate type.
>
> But yes, I agree that a common API would be good to decide.
You are right, the key API innovation in your system are the "read/write
one field only" commands.
You fix the field number and wire encoding to be field number 1 and
length delimited (for messages).

Would a good choice for a new API by a generalization of those commands?
Hmmm...you also write:

> I think it's best to concentrate on the simple requirement first, and
> not guess too much about what would be needed. Use cases for an
> homogenous stream are easy to come up with - the simplest being
> logging, for example.

Is writing and reading a field at a time an overly complicated mechanism?

You fixed field# only sinks and sources the message object. A more general API would be able to set the field# and return the field#. A slight generalization would work on strings and bytes since their wire encoding is identical to messages. A full generalization would work on all allowed field types.

These are each about 2 or 3 lines of Haskell (plus documentation), so I will probably add them all.

And it seems that I am coming around to your view that the A+B+C is better than B+C encoding.

Cheers,
Chris


Chris

unread,
Sep 11, 2008, 4:29:38 AM9/11/08
to Jon Skeet, Protocol Buffers
I would like to note that at this other thread in this same mailing list:

http://groups.google.com/group/protobuf/browse_thread/thread/19ab6bbb364fef35?hl=en#

This is about Alex integrating with Hadoop and:
> Now, when I see the stream coming in on the deserialization side, I get
> "<binary>my_string<binary>" The leading binary is the same as the
> original,
> however the trailing binary is something new entirely.
Where Kenton replies:
> No, it won't work. Protocol buffers are not self-delimiting. They assume
> that the input you provide is supposed to be one complete message, not a
> message possibly followed by other stuff.
>
> You will need to somehow communicate the size of the message and make sure
> to limit the input to that size.
And so we have a customer for a delimited message API to use in a mixed
protocol binary stream.

I have just posted a message in that thread pointing at this thread.

It looks like (Length + Message) on the wire would work.

I would also like to note that there is another (probably silly) way to
delimit a message: a trailing byte of value 0 to 7. The 0-7, as a wire
tag, decodes to 0 as a field number and 0-7 as the wire encoding. A
field number of 0 is disallowed by the ".proto" specification. Thus the
0-7 cannot be for the next field and could be used as punctuation after
the message by a new API.

I still prefer Tag+Length+Message or Length+Message. But there have
been long threads here with those that think precomputing the Length is
expensive and/or want a streaming write capability. These people might
want a punctuation delimited API.

Cheers,
Chris

alexlod...@gmail.com

unread,
Sep 11, 2008, 5:28:11 AM9/11/08
to Protocol Buffers
Just in case it's helpful, here is the Hadoop JIRA that tracks the
progress of integrating Protocol Buffers into Hadoop:

https://issues.apache.org/jira/browse/HADOOP-3788

Alex

On Sep 11, 4:29 pm, Chris <turingt...@gmail.com> wrote:
> I would like to note that at this other thread in this same mailing list:
>
> http://groups.google.com/group/protobuf/browse_thread/thread/19ab6bbb...
Reply all
Reply to author
Forward
0 new messages