Deserializing Messages of unknown type at compile-time

4,200 views
Skip to first unread message

alexlod...@gmail.com

unread,
Sep 8, 2008, 7:33:11 AM9/8/08
to Protocol Buffers
I have a scenario where I'm trying to create a Serializer and
Deserializer class that can handle any general Message, given a stream
(InputStream or OutputStream) and an instance of a particular Message
implementation.

How can I use this information to serialize and deserialize? I will
break things down slightly more:

Serializing:
This seems easy. Given a stream and an instance, just call
Message#writeTo(output) on the instance.

Deserializing:
More tricky. Given a stream and an instance, I'm trying to get the
Descriptor by calling Message#getDescriptorForType() on the instance
and passing the return value, along with an input stream, to
DynamicMessage#parseFrom(Descriptor,input). I then cast the
DynamicMessage that is returned by parseFrom to the same type of the
instance that is given to me.

The problem that I'm encountering is during deserialization. I'm
getting an InvalidProtocolBufferException. Here's the trace:

com.google.protobuf.InvalidProtocolBufferException: Protocol message
contained an invalid tag (zero).
at
com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:
52)
at
com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:67)
at com.google.protobuf.FieldSet.mergeFrom(FieldSet.java:397)
at com.google.protobuf.DynamicMessage
$Builder.mergeFrom(DynamicMessage.java:289)
at com.google.protobuf.DynamicMessage
$Builder.mergeFrom(DynamicMessage.java:213)
at com.google.protobuf.AbstractMessage
$Builder.mergeFrom(AbstractMessage.java:240)
at com.google.protobuf.AbstractMessage
$Builder.mergeFrom(AbstractMessage.java:329)
at
com.google.protobuf.DynamicMessage.parseFrom(DynamicMessage.java:102)

What's curious about this is that my .proto files each only have one
field in each Message, and each of those fields has a tag of 1. None
of my tags are 0.

I have a feeling that I'm probably misusing the API for
deserialization, or perhaps I may have mis-defined my .proto files.
Here's an example of a .proto file that I'm using:

message LongMessage {
required int64 value = 1;
}

Any and all help is greatly appreciated. Thanks ahead of time for
your help :).

Alex

Kenton Varda

unread,
Sep 8, 2008, 1:16:21 PM9/8/08
to alexlod...@gmail.com, Protocol Buffers
On Mon, Sep 8, 2008 at 4:33 AM, <alexlod...@gmail.com> wrote:
More tricky.  Given a stream and an instance, I'm trying to get the
Descriptor by calling Message#getDescriptorForType() on the instance
and passing the return value, along with an input stream, to
DynamicMessage#parseFrom(Descriptor,input).  I then cast the
DynamicMessage that is returned by parseFrom to the same type of the
instance that is given to me.

That won't work.  DynamicMessage is a different class; it does not know how to instantiate the protocol-compiler-generated version of the class.  Instead, you should do:

  Message result =
    messageInstance.newBuilderForType().mergeFrom(input).build();

Actually, you should check isInitialized() before calling build(), or use buildPartial() instead, but that's a separate issue.
 
The problem that I'm encountering is during deserialization.  I'm
getting an InvalidProtocolBufferException.  Here's the trace:

...


What's curious about this is that my .proto files each only have one
field in each Message, and each of those fields has a tag of 1.  None
of my tags are 0.

The protocol compiler would not allow you to use tag zero anyway.  It looks like your input data is not identical to the data written by the sender.

Alex Loddengaard

unread,
Sep 8, 2008, 9:18:54 PM9/8/08
to Kenton Varda, Protocol Buffers
On Tue, Sep 9, 2008 at 1:16 AM, Kenton Varda <ken...@google.com> wrote:
That won't work.  DynamicMessage is a different class; it does not know how to instantiate the protocol-compiler-generated version of the class.  Instead, you should do:

  Message result =
    messageInstance.newBuilderForType().mergeFrom(input).build();

Actually, you should check isInitialized() before calling build(), or use buildPartial() instead, but that's a separate issue.

I changed my deserializing code to use the above, but I'm getting the same exception.  I also tried to call isInitialized() on the instance given to me, and the instance is not initialized.  That is, isInitialized() returned false.  I'm plugging in to a large framework that I'm not entirely familiar with (Hadoop), so I can only speculate what's going on here.  I think that the Message instance given to me was created with reflection and is not a valid Message.  I'm making this claim because isInitialized() is returning false.

Is there any other way to deserialize?  Can you provide any other good approaches to debugging this?  In the meantime, I'm going to take my example out of the large framework in hopes of better understanding the problem I'm having.
 
The protocol compiler would not allow you to use tag zero anyway.  It looks like your input data is not identical to the data written by the sender.

I'm confident that the sender data is the same data that is created when I serialize.  Perhaps I'm serializing incorrectly?  I'm creating a CodedOutputStream given an OutputStream and passing that to writeTo.  However, I'm not using a CodedInputStream to deserialize.  Should I be using Coded or non-Coded streams?

I stopped using CodedOutputStream when serializing and got the following exception:

com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had invalid wire type.
        at com.google.protobuf.InvalidProtocolBufferException.invalidWireType(InvalidProtocolBufferException.java:62)
        at com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(UnknownFieldSet.java:410)
        at com.google.protobuf.FieldSet.mergeFieldFrom(FieldSet.java:454)
        at com.google.protobuf.FieldSet.mergeFrom(FieldSet.java:402)
        at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:248)
        at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:240)
        at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:329)
        at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:184)

Thanks for your help, Kenton!  I got a good feeling that I'm almost there :).

Alex

Kenton Varda

unread,
Sep 8, 2008, 9:27:33 PM9/8/08
to Alex Loddengaard, Protocol Buffers
On Mon, Sep 8, 2008 at 6:18 PM, Alex Loddengaard <alexlod...@gmail.com> wrote:
On Tue, Sep 9, 2008 at 1:16 AM, Kenton Varda <ken...@google.com> wrote:
That won't work.  DynamicMessage is a different class; it does not know how to instantiate the protocol-compiler-generated version of the class.  Instead, you should do:

  Message result =
    messageInstance.newBuilderForType().mergeFrom(input).build();

Actually, you should check isInitialized() before calling build(), or use buildPartial() instead, but that's a separate issue.

I changed my deserializing code to use the above, but I'm getting the same exception.  I also tried to call isInitialized() on the instance given to me, and the instance is not initialized.

That message instance is probably a default instance.  isInitialized() will always be false on those, unless it has no required fields at all.
 
 
The protocol compiler would not allow you to use tag zero anyway.  It looks like your input data is not identical to the data written by the sender.

I'm confident that the sender data is the same data that is created when I serialize.

The exceptions that you're reporting strongly suggest that you are *not* seeing the same data on both ends.  Try this:  serialize to a byte array or ByteString, compute a checksum of some sort for debugging, then write the bytes to your output.  On the other end, read the bytes back into a ByteString or byte array, checksum again, and see if it's the same.  Then parse from that.  I'm pretty confident that if the checksums are the same, you will not see the error you're seeing.
 
  Perhaps I'm serializing incorrectly?  I'm creating a CodedOutputStream given an OutputStream and passing that to writeTo.

This is redundant -- you can just pass the OutputStream to writeTo().
 
  However, I'm not using a CodedInputStream to deserialize.  Should I be using Coded or non-Coded streams?

It doesn't matter.  If given a normal stream, mergeFrom() / writeTo() will wrap it in a coded stream on their own.

Kenton Varda

unread,
Sep 8, 2008, 9:28:48 PM9/8/08
to Alex Loddengaard, Protocol Buffers
On Mon, Sep 8, 2008 at 6:27 PM, Kenton Varda <ken...@google.com> wrote:


On Mon, Sep 8, 2008 at 6:18 PM, Alex Loddengaard <alexlod...@gmail.com> wrote:
On Tue, Sep 9, 2008 at 1:16 AM, Kenton Varda <ken...@google.com> wrote:
That won't work.  DynamicMessage is a different class; it does not know how to instantiate the protocol-compiler-generated version of the class.  Instead, you should do:

  Message result =
    messageInstance.newBuilderForType().mergeFrom(input).build();

Actually, you should check isInitialized() before calling build(), or use buildPartial() instead, but that's a separate issue.

I changed my deserializing code to use the above, but I'm getting the same exception.  I also tried to call isInitialized() on the instance given to me, and the instance is not initialized.

That message instance is probably a default instance.  isInitialized() will always be false on those, unless it has no required fields at all.

To clarify:  In my original message I was saying that you should call isInitialized() on the builder returned by mergeFrom(), to make sure the parsed message is complete, before you call build().

Alex Loddengaard

unread,
Sep 8, 2008, 9:42:26 PM9/8/08
to Kenton Varda, Protocol Buffers
On Tue, Sep 9, 2008 at 9:28 AM, Kenton Varda <ken...@google.com> wrote:
To clarify:  In my original message I was saying that you should call isInitialized() on the builder returned by mergeFrom(), to make sure the parsed message is complete, before you call build().

Ah.  Now isInitialized() is returning true, though I'm still having problems deserializing.  Now that I'm using OutputStream and InputStream, I'm getting the following exception:


com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had invalid wire type.

I'm going to take my code out of Hadoop to see if Hadoop is causing these issues.  I'm still weary of that, though, because other serialization frameworks such as Facebook's Thrift seem to work in the framework that I am using.

Thanks for your help, Kenton!  I'll check back soon with my progress.

Alex Loddengaard

unread,
Sep 8, 2008, 10:47:43 PM9/8/08
to Kenton Varda, Protocol Buffers
After taking my code out of Hadoop, it looks as though my deserializing mechanism is working fine.  My problem lies with my integration with Hadoop.

Thanks for resolving this issue, Kenton!

Alex

Alex Loddengaard

unread,
Sep 9, 2008, 12:11:21 AM9/9/08
to Kenton Varda, Protocol Buffers
I have a follow-up question:

Will using messageInstance.newBuilderForType().mergeFrom(input).build(); work for a stream that contains trailing binary information?

I'm asking this question for the following reason: I'm using a very simple example where my Message just contains a single String.  When I print the serialized message with a value of "my_string", I get "<binary>my_string".  Now, when I see the stream coming in on the deserialization side, I get "<binary>my_string<binary>"  The leading binary is the same as the original, however the trailing binary is something new entirely.  The trailing binary is probably being created by Hadoop.

Kenton, you have made it very clear that messageInstance.newBuilderForType().mergeFrom(input).build(); is the correct approach.  What could possibly be going wrong if the stream I'm trying to deserialize from contains trailing binary data?

Thanks ahead of time for your help.

Alex

Alex Loddengaard

unread,
Sep 9, 2008, 1:52:29 AM9/9/08
to Kenton Varda, Protocol Buffers
On more follow-up (sorry for all these follow-ups):

I should revise my problem slightly.  I had said that I am given an instance of a Message class when deserializing.  This is true, though sometimes that instance is null.  In the cases when it's null, I'm not able to call newBuilderForType() on it.  I'm not able to call getDefaultInstance(), either.  This is now problematic, though there may be a work around.  Also given to me is a Class instance of the Message.  I'm using Reflection to instantiate a new Message instance, then getDefaultInstance() to get the default instance, and then I'm calling newBuilderForType().  Is this problematic?

Thanks again.  Sorry for all the spam!

Alex

Kenton Varda

unread,
Sep 9, 2008, 12:56:16 PM9/9/08
to Alex Loddengaard, Protocol Buffers
On Mon, Sep 8, 2008 at 9:11 PM, Alex Loddengaard <alexlod...@gmail.com> wrote:
I have a follow-up question:

Will using messageInstance.newBuilderForType().mergeFrom(input).build(); work for a stream that contains trailing binary information?

No, it won't work.  Protocol buffers are not self-delimiting.  They assume that the input you provide is supposed to be one complete message, not a message possibly followed by other stuff.

You will need to somehow communicate the size of the message and make sure to limit the input to that size.

Kenton Varda

unread,
Sep 9, 2008, 1:00:28 PM9/9/08
to Alex Loddengaard, Protocol Buffers
On Mon, Sep 8, 2008 at 10:52 PM, Alex Loddengaard <alexlod...@gmail.com> wrote:
I should revise my problem slightly.  I had said that I am given an instance of a Message class when deserializing.  This is true, though sometimes that instance is null.  In the cases when it's null, I'm not able to call newBuilderForType() on it.  I'm not able to call getDefaultInstance(), either.  This is now problematic, though there may be a work around.  Also given to me is a Class instance of the Message.  I'm using Reflection to instantiate a new Message instance, then getDefaultInstance() to get the default instance, and then I'm calling newBuilderForType().  Is this problematic?

Hmm, I think the framework you are using is poorly designed -- it should always give you a non-null default instance.  Using Java reflection is ugly.

getDefaultInstance() is actually a static method.  So, you don't have to instantiate a new message instance first -- just call the static method without an instance.  You can't actually instantiate the message class directly anyway, since the constructors are private.

Alex Loddengaard

unread,
Sep 10, 2008, 12:05:00 AM9/10/08
to Kenton Varda, Protocol Buffers
Thanks for your feedback, Kenton!  You've answered all of my questions.

Alex

Chris

unread,
Sep 11, 2008, 4:18:52 AM9/11/08
to Kenton Varda, Alex Loddengaard, Protocol Buffers
Hi Alex,

Kenton Varda wrote:
> On Mon, Sep 8, 2008 at 9:11 PM, Alex Loddengaard
> <alexlod...@gmail.com <mailto:alexlod...@gmail.com>> wrote:
>
> I have a follow-up question:
>
> Will using

> /messageInstance.newBuilderForType().mergeFrom(input).build();/


> work for a stream that contains trailing binary information?
>
>
> No, it won't work. Protocol buffers are not self-delimiting. They
> assume that the input you provide is supposed to be one complete
> message, not a message possibly followed by other stuff.
>
> You will need to somehow communicate the size of the message and make
> sure to limit the input to that size.

Aha. This <binary>message<binary> case is one of the heretofore
hypothetical use cases I am discussing in the adjacent thread on this
mailing list / group. The thread is online at

http://groups.google.com/group/protobuf/browse_thread/thread/b0ce2c7d8b05896e?hl=en
and was spawned from
http://groups.google.com/group/protobuf/browse_thread/thread/b0ce2c7d8b05896e?hl=en#

This is mainly myself, Jon, and Kenton slowly forming a consensus on the
right API for delimited messages. I had proposed simply adding the
length (varint) before the message, and Kenton demonstrated c++ code for
this. Jon proposed adding a field number / wiretype tag before the
length and message, which makes it look much more like a protocol-buffer
field on the wire.

What do you need Alex?

--
Chris

Alex Loddengaard

unread,
Sep 11, 2008, 4:40:35 AM9/11/08
to Chris, Kenton Varda, Protocol Buffers
Hi Chris,

Once I learned that Messages are not self-delimiting (thanks, Kenton!), I started working with Hadoop's source to stop the trailing bits from being included in the InputStream.  I've since fixed this issue, kind of at least ;).

Perhaps a good general solution is to allow a user to put an option in a .proto file or a Message declaration that makes Messages self-delimiting.  That way users who want speed don't need to us it, and users who want convenience can use it.  The implementation of this would probably be tricky, I'm sure.

Thanks for the follow up, Chris.  For now I'm good to go!  Let me know if I can provide any other feedback.

Alex

vi...@unleashnetworks.com

unread,
Sep 11, 2008, 11:20:53 PM9/11/08
to Protocol Buffers
Kenton,

> No, it won't work. Protocol buffers are not self-delimiting. They assume
> that the input you provide is supposed to be one complete message, not a
> message possibly followed by other stuff.


There are a couple of related threads about delimiting the outer
message (with either a marker or a length). The need for this seems to
arise from streaming (especially when input would block such as on a
network socket).

Could this not be solved by a simple convention in the proto file ?
(Maybe I am missing something big here)

Let us say we have a proto as follows

message TRPProtocol
{
message TRPPDU
{
required int32 version;
required int32 type;

optional HelloRequest hello_req = 1;
optional HelloResponse hello_resp = 2;
optional ConnectRequest connect_req =
3;
// etc etc
};
required TRPPDU thepdu=1;
};

On the wire the outer message is not length delimited, but the inner
message is. The inner message is represented by the 'required' field
'thepdu'.

It would then be possible to stream instances of the inner message
"TRPPDU". I hope my understanding is correct. Could you write
something like the following ?

TRPProtocol::TRPPDU Pdu;
Pdu.ParseFromFileDescriptor( socket_fd); // socket_fd has been
opened and initialized earlier

to read just one message, respond to that if needed, and then read the
next one.

Is my understanding correct ? Is this how it is done at Google when
using PB for client - server comms ?

Thanks,

Vivek

Kenton Varda

unread,
Sep 12, 2008, 1:25:58 PM9/12/08
to vi...@unleashnetworks.com, Protocol Buffers
Even if the message contains only one, non-repeated field, ParseFrom*() will keep reading until EOF or an error.

At Google, we have lots of various container formats, for streaming, record-based files, database tables, etc., where each record is a protocol buffer.  All of these formats store the size of the message before the message itself.  Our philosophy is that because we have protocol buffers, all of these *other* formats and protocols can be designed to pass around arbitrary byte blobs, which greatly simplifies them.  An arbitrary byte blob is not necessarily self-delimiting, so it's up to these container formats to keep track of the size separately.
Reply all
Reply to author
Forward
0 new messages