reflection? dynamic generation?

Tom 5235

unread,

Jul 8, 2008, 12:08:59 PM7/8/08

to Protocol Buffers

Thanks for making this software available. I have a couple of
questions.

Are there any reflection capabilities? That is, given a stream that
receives protocol buffer messages, is there any way of reconstructing
the .proto that was used to generate the messages on the stream?

I mean, Google seems to have its code base organized well enough that
you know that there are 12183 .proto files, but if protocol buffers
start being used outside Google, people will end up with protocol
buffer files for which the .proto files are simply lost

Also, the Python usage seems to involve manually generating code in an
external file. I understand why that is the usual use case, but from
a Python API, I'd expect that I can load a .proto file dynamically or
even put the protocol buffer definition inline in the source code.
Did I just miss the functions to do this? Even a simple workaround
(invoking the external tool behind the scenes) would make development
much easier, since a lot of Python development simply does not involve
any kind of build process.

Tom

Nathan Schrenk

unread,

Jul 8, 2008, 12:20:29 PM7/8/08

to Tom 5235, Protocol Buffers

On Tue, Jul 8, 2008 at 9:08 AM, Tom 5235 <tmb...@gmail.com> wrote:
>
> Are there any reflection capabilities? That is, given a stream that
> receives protocol buffer messages, is there any way of reconstructing
> the .proto that was used to generate the messages on the stream?

The serialized protocol buffer data is not self-describing. If you
have a stream of bytes generated by serializing some protocol buffer
type you have to know what type to expect to deserialize it.

> I mean, Google seems to have its code base organized well enough that
> you know that there are 12183 .proto files, but if protocol buffers
> start being used outside Google, people will end up with protocol
> buffer files for which the .proto files are simply lost

Then those people will be sad because it will be a difficult task to
recover that data unless the .proto files can be reconstructed.

Nathan

Tom 5235

unread,

Jul 8, 2008, 12:55:34 PM7/8/08

to Protocol Buffers

> The serialized protocol buffer data is not self-describing. If you
> have a stream of bytes generated by serializing some protocol buffer
> type you have to know what type to expect to deserialize it.

Then maybe it should be made self-describing in a future version.

A simple way would be to insert out-of-band text records containing
the compressed .proto file itself (at the beginning of any stream, and
possibly periodically in live streams). The overhead would be tiny
for most streams, it could still be turned off in situations where it
really matters, and the implementation should be nearly trivial.

> Then those people will be sad because it will be a difficult task to
> recover that data unless the .proto files can be reconstructed.

Come on, that's not a good answer. People have many decades of
experience with large scale data management and protocols in
heterogeneous, real-world environments. Protocol buffers is
doubtlessly a good solution within a single organization. But why not
take the small extra step to avoid making people "sad"?

Tom

Hein

unread,

Jul 8, 2008, 1:08:24 PM7/8/08

to Protocol Buffers

On Jul 8, 9:55 am, Tom 5235 <tmb...@gmail.com> wrote:
> Then maybe it should be made self-describing in a future version.
>
> A simple way would be to insert out-of-band text records containing
> the compressed .proto file itself (at the beginning of any stream, and
> possibly periodically in live streams).

That's beyond the scope of protocol buffers -- note that the library
doesn't even include sequences or files of protocol buffers.

-Hein

Kenton Varda

unread,

Jul 8, 2008, 2:13:41 PM7/8/08

to Tom 5235, Protocol Buffers

On Tue, Jul 8, 2008 at 9:08 AM, Tom 5235 <tmb...@gmail.com> wrote:

Are there any reflection capabilities?

Note that "reflection" has a different meaning for protocol buffers -- namely, it refers to the ability to programmatically iterate over and manipulate the fields of a protocol message object, after it has been parsed.

That is, given a stream that
receives protocol buffer messages, is there any way of reconstructing
the .proto that was used to generate the messages on the stream?

You can reverse-engineer the serialized encoding into tag number / value pairs, but there's no way to determine the names of the fields without the .proto file.

I agree that this makes things difficult if you don't know what type you're decoding. However, including self-description in serialized protocol buffers would make them much larger.

One thing we could do is define a standardized "self-describing protocol buffer" format like so:

import "google/protobuf/descriptor.proto";
message SelfDescribingProto {
  repeated FileDescriptorProto proto_file = 1;
  required string type_name = 2;
  required bytes encoded_message = 3;
}

A FileDescriptorProto contains all the information available from a .proto file.

Also, the Python usage seems to involve manually generating code in an
external file. I understand why that is the usual use case, but from
a Python API, I'd expect that I can load a .proto file dynamically or
even put the protocol buffer definition inline in the source code.
Did I just miss the functions to do this? Even a simple workaround
(invoking the external tool behind the scenes) would make development
much easier, since a lot of Python development simply does not involve
any kind of build process.

I agree, this would be useful. We could write an extension module which links against libprotoc and invokes the Python code generator, then exec()s it.

Tom 5235

unread,

Jul 8, 2008, 7:38:13 PM7/8/08

to Protocol Buffers

> That's beyond the scope of protocol buffers -- note that the library
> doesn't even include sequences or files of protocol buffers.

I suppose it comes down to what Google is trying to accomplish with
open sourcing protocol buffers.

In our lab, we generate a lot of structured data in our work.
Protocol buffers have most of the required properties for our needs:
they're small, simple, compact, supports the languages we use, etc.
We'd love to use them. Most of the other solutions in this space are
far more complicated and require completely separate approaches to
communications and storage.

But having stream be self-describing is a requirement for us, for two
reasons.

First, our data is valuable, and experience shows that data files and
source code will invariably become separated sooner or later.

Second, a lot of the tools we develop are tools that generically
operate on large data streams (on-line machine learning, log files,
network data, etc.), and for that, we need to be able to open a stream
and automatically analyze the data contained in it.

So, I think Google needs to decide whether archival storage and
support for machine learning and data mining are important future use
cases or not. Maybe that would be a good thing to communicate.

Tom

Tom 5235

unread,

Jul 8, 2008, 7:46:04 PM7/8/08

to Protocol Buffers

> I agree that this makes things difficult if you don't know what type you're
> decoding. However, including self-description in serialized protocol
> buffers would make them much larger

I don't see how. A protocol buffer definition is a fairly small
ASCII string, negligible compared to the total size of a typical
protocol buffer stream. For storage purposes, it only needs to be
inserted once at the beginning of the file. For self-describing
streams, it can be inserted every now and then into the stream. All
of this would be optional--the programmer could turn it on or off.

But if there is no standard for how to put this data into a stream,
and if the library doesn't make it easy to create self-describing
streams, few people are going to put this data into their streams, and
there won't be any point building tools for it.

See my other message for why it's important: data mining tools,
machine learning tools, and archival storage of scientific data simply
demand the presence of such metadata.

> I agree, this would be useful. We could write an extension module which
> links against libprotoc and invokes the Python code generator, then exec()s
> it.

That sounds very useful.

Tom

Curt Micol

unread,

Jul 8, 2008, 8:14:06 PM7/8/08

to Tom 5235, Protocol Buffers

On Tue, Jul 8, 2008 at 7:38 PM, Tom 5235 <tmb...@gmail.com> wrote:
> So, I think Google needs to decide whether archival storage and
> support for machine learning and data mining are important future use
> cases or not. Maybe that would be a good thing to communicate.

I am sorry, I do not intend this to be harsh, but wouldn't this be the
decision of your lab and not Google? Communicating all of the
possible use cases would either completely shutdown further
development due to all of the writing that would need to be done, or
would simply bog down the purpose of protobuf'ers which is to be
simple and usable in any imaginable way. Google already has their use
case, it's up to the users now to find interesting and effective ways
to expand upon that.

You can always fork a project and expand upon protobuf to fit your
needs. Especially since I am sure you or your lab wouldn't be the
only ones to benefit from such use cases.

--
# Curt Micol

Kenton Varda

unread,

Jul 8, 2008, 9:16:18 PM7/8/08

to Tom 5235, Protocol Buffers

On Tue, Jul 8, 2008 at 4:46 PM, Tom 5235 <tmb...@gmail.com> wrote:

I don't see how. A protocol buffer definition is a fairly small
ASCII string, negligible compared to the total size of a typical
protocol buffer stream. For storage purposes, it only needs to be
inserted once at the beginning of the file. For self-describing
streams, it can be inserted every now and then into the stream. All
of this would be optional--the programmer could turn it on or off.

All this can be provided by a layer on top of protocol buffers, using the SelfDescribingProto format I described before. I would rather not embed this into libprotobuf itself when it works fine as a library, since we want to avoid bloat. It is sounding like we should start work on a "protobuf-utils" library that provides additional utilities like this.

In our internal usage, particularly in RPC messages, protocol buffers are usually very small -- hundreds of bytes, perhaps. So, including self-description with them by default would be a big burden.

So, I think Google needs to decide whether archival storage and
support for machine learning and data mining are important future use
cases or not. Maybe that would be a good thing to communicate.

Note that Google already uses protocol buffers extensively for exactly the purposes you list.

oob...@gmail.com

unread,

Jul 8, 2008, 10:11:16 PM7/8/08

to Protocol Buffers

This requirement for self describing binary data is exactly what Argot
does (www.einet.com.au). Argot has a message format which has:

Part 1: Meta Dictionary (self referencing binary description of data
dictionary elements)
Part 2: Data Dictionary (uses meta dictionary elements to describe the
data being stored)
Part 3: Data (uses only elements described in the data dictionary).

This format is actually used to store the format of the data
dictionaries themselves. I'm just about to update it with a new
release with some better documentation and improved meta dictionary.
I'd appreciate it if you had a look and let me know if it solves your
particular problem. I'm looking for ways to improve it, so if you
have any suggestions let me know.

Regards,
David.

Tom 5235

unread,

Jul 8, 2008, 10:52:21 PM7/8/08

to Protocol Buffers

> All this can be provided by a layer on top of protocol buffers, using the
> SelfDescribingProto format I described before. I would rather not embed
> this into libprotobuf itself when it works fine as a library, since we want
> to avoid bloat.

I think the metadata should be embedded by default. Without making it
the default, most people are not going to bother using it, or, worse,
they are going to have irrational fears that using metadata is going
to be expensive. Unless lots of people actually embed the metadata,
writing tools to deal with it is pointless.

I also don't follow the bloat argument. All you need is define one
standard message, a string, that contains the .proto source text.

> In our internal usage, particularly in RPC messages, protocol buffers are
> usually very small -- hundreds of bytes, perhaps. So, including
> self-description with them by default would be a big burden.

As I was saying: you do NOT need to embed the metadata with each
message. Since the .proto file describes the entire stream format,
all you ever need to do is embed the source text for the protocol
buffer definition once, as a string, at the beginning (if you want to
be able to "cut into" a stream, you might also embed it occasionally
in a stream with a sync token).

> Note that Google already uses protocol buffers extensively for exactly the
> purposes you list.

Yes and no. Google probably is doing machine learning and data mining
on protocol buffer streams, but given the design of protocol buffers
right now, you can't write general purpose machine learning tools that
treat the protocol buffer variables themselves as "columns".

As for archiving, of course, you can store data without embedded
metadata, but decades of experience show that people will likely not
be able to use the data later.

I think it would be a shame if you didn't include metadata; the design
of protocol buffers makes doing so trivial, and it would make an
enormous difference in the range of applications that protocol buffers
can be used for.

Tom

schu...@gmail.com

unread,

Jul 8, 2008, 11:00:31 PM7/8/08

to Protocol Buffers

I think there is a distinction between a "stream" of Protocol Buffers
and a Protocol Buffer itself. The library as far as I can tell does
not provide any specific stream mechanisms designed for storing large
numbers of protocol buffers, but leaves it up to the user to do this.
I agree if I were designing a file format for storing many protocol
buffers i might include the proto definition, but I almost might
include checksums of some sort as well which one could argue are just
as important. I think it would be trivial for one to design a
ProtoStream format of some sort that would allow both of these, and
would be something useful that could go in the proto-util directory
Kenton has been talking about.

thanks... mike

Tom 5235

unread,

Jul 8, 2008, 11:17:14 PM7/8/08

to Protocol Buffers

On Jul 9, 2:14 am, "Curt Micol" <asen...@gmail.com> wrote:
> You can always fork a project and expand upon protobuf to fit your
> needs. Especially since I am sure you or your lab wouldn't be the
> only ones to benefit from such use cases.

We already have a high-performance stream format with complete
metadata. It's a tiny fraction of the code size of protocol buffers
(protocol buffers is 83+kloc!), has more functionality, and has the
same performance.

The attraction of protocol buffers is not technical, it's that the
effort of maintaining and documenting it is shared with others;
forking it would make little sense.

> Google already has their use
> case, it's up to the users now to find interesting and effective ways
> to expand upon that.

Right, and all I'm suggesting is that a trivial change might make the
library a lot more useful to a lot more people, even if that change
doesn't matter much to Google internally. Isn't that kind of feedback
why people open source things?

> I am sorry, I do not intend this to be harsh, but wouldn't this be the
> decision of your lab and not Google?

Well, of course the decision is ours, and it's a pretty
straightforward one: we need the kind of metadata that I mentioned.

Tom

Kenton Varda

unread,

Jul 8, 2008, 11:36:29 PM7/8/08

to Tom 5235, Protocol Buffers

On Tue, Jul 8, 2008 at 7:52 PM, Tom 5235 <tmb...@gmail.com> wrote:

As I was saying: you do NOT need to embed the metadata with each
message. Since the .proto file describes the entire stream format,
all you ever need to do is embed the source text for the protocol
buffer definition once, as a string, at the beginning (if you want to
be able to "cut into" a stream, you might also embed it occasionally
in a stream with a sync token).

As someone else pointed out, protocol buffers currently does not define any sort of "stream of messages" abstraction. Therefore, there is no place to put this metadata, unless we add such an abstraction to the library, but that would add a lot of bloat.

The protobuf library already has *extensive* support for representing and using the kind of metadata you want -- see the FileDescriptorProto class, the FileDescriptor class that can be constructed from it, and the DynamicMessageFactory class. You can parse, serialize, and manipulate message entirely based on runtime metadata, without knowing the types at compile time. We simply leave it up to your application to decide where to actually put this metadata, because, again, protocol buffers do not mandate any particular format for storing or transmitting more than one message, and attaching the data to every message would be too inefficient.

Yes and no. Google probably is doing machine learning and data mining
on protocol buffer streams, but given the design of protocol buffers
right now, you can't write general purpose machine learning tools that
treat the protocol buffer variables themselves as "columns".

I think you need to look closer at the reflection interface and DynamicMessage. It is very much possible to manipulate protocol messages without knowing their types at compile time.

Tom 5235

unread,

Jul 9, 2008, 12:33:08 AM7/9/08

to Protocol Buffers

> I think you need to look closer at the reflection interface and
> DynamicMessage. It is very much possible to manipulate protocol messages
> without knowing their types at compile time.

As far as I can tell, all DynamicMessage lets me do is manipulate
messages whose type isn't available in my compilation unit, but the
message type itself still needs to be compiled and linked into my
program somewhere.

But, in any case, even if there is a more dynamic reflection
capability...

> We simply leave it up to your application to decide where to actually put this metadata

This basically means that I cannot build a tool that looks at other
people's protocol buffer message stream and finds the metadata.

In any case, thanks for the extensive feedback and information; it
looks like ensuring that metadata gets added to streams would be
difficult at this point because the design was created mainly from the
point of view of only encoding single messages.

Tom

Kenton Varda

unread,

Jul 9, 2008, 1:05:48 AM7/9/08

to Tom 5235, Protocol Buffers

On Tue, Jul 8, 2008 at 9:33 PM, Tom 5235 <tmb...@gmail.com> wrote:

As far as I can tell, all DynamicMessage lets me do is manipulate
messages whose type isn't available in my compilation unit, but the
message type itself still needs to be compiled and linked into my
program somewhere.

That's not correct. You can create a DynamicMessage representing any type for which you have a Descriptor, and you can construct Descriptors dynamically at runtime through a number of means, such as parsing .proto files on-the-fly or constructing DescriptorProto messages manually.

> We simply leave it up to your application to decide where to actually put this metadata

This basically means that I cannot build a tool that looks at other
people's protocol buffer message stream and finds the metadata.

You can if the person who owns the data provides you with the .proto file. You are correct that reverse-engineering protocols without the author's cooperation is non-trivial, though. Sorry that this doesn't fit your needs.

Jim Bruce

unread,

Jul 9, 2008, 1:26:13 AM7/9/08

to Protocol Buffers

On Jul 8, 9:33 pm, Tom 5235 <tmb...@gmail.com> wrote:
> > We simply leave it up to your application to decide where to actually put this metadata
>
> This basically means that I cannot build a tool that looks at other
> people's protocol buffer message stream and finds the metadata.

I think the recommended approach would be: Design and implement a
streaming format, get it upstream into protobuf-utils, and if it's the
first or best such format, it'll get used widely. Right now it's
beyond the scope of the core library, but if you build that util/
extension, I'm sure people would use it if they have similar needs.
- Jim

roman...@gmail.com

unread,

Jul 11, 2008, 10:58:57 AM7/11/08

to Protocol Buffers, kenton...@google.com

Hi Kenton,

How exactly does one construct a message via DescriptorProto?

Suppose I am trying to build a generic receiver that should handle 5
msg types (Foo1-Foo5) but I get the data in serialized form (a char
array). I would like to be able to determine the type dynamically and
instantiate the right Message object. Then I want to use reflection
to iterate through the fields. Is this possible?

Thank you,

Roman

Kenton Varda

unread,

Jul 11, 2008, 4:18:00 PM7/11/08

to roman...@gmail.com, Protocol Buffers

On Fri, Jul 11, 2008 at 7:58 AM, <roman...@gmail.com> wrote:

How exactly does one construct a message via DescriptorProto?

Suppose I am trying to build a generic receiver that should handle 5
msg types (Foo1-Foo5) but I get the data in serialized form (a char
array). I would like to be able to determine the type dynamically and
instantiate the right Message object. Then I want to use reflection
to iterate through the fields. Is this possible?

Yes, that's possible. You can use the DynamicMessage class to emulate an arbitrary message type, accessible via reflection:

http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.dynamic_message.html

You would need to communicate information about the message type along with the message itself. You could use a set of FileDescriptorProtos for this (representing the set of .proto files that define the message type). So, your complete message would be:

message SelfDescribingMessage {

repeated FileDescriptorProto proto_files = 1;

required string type_name = 2; // names a type defined in proto_files

required bytes message =3 ; // serialized message

}

When you receive this, you would create a DescriptorPool:

http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.descriptor.html#DescriptorPool

Load all of the proto_files into the DescriptorPool using the BuildFile() method, then use FindMessageTypeByName() to look up type_name. That gives you a Descriptor. You can then create a DynamicMessageFactory, pass the Descriptor to it, and it will give you a Message prototype. Do prototype->New() to get a new object, then you can use that to parse the message.

Hmm, this is kind of complicated. We should have a utility class that encapsulates all this.

Eric....@gmail.com

unread,

Aug 14, 2008, 3:51:06 AM8/14/08

to Protocol Buffers

> Yes, that's possible. You can use the DynamicMessage class to emulate an
> arbitrary message type, accessible via reflection:
>

> http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google...

>
> You would need to communicate information about the message type along with
> the message itself. You could use a set of FileDescriptorProtos for this
> (representing the set of .proto files that define the message type). So,
> your complete message would be:
>
> message SelfDescribingMessage {
> repeated FileDescriptorProto proto_files = 1;
> required string type_name = 2; // names a type defined in proto_files
> required bytes message =3 ; // serialized message
>
> }
>
> When you receive this, you would create a DescriptorPool:
>

> http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google...

>
> Load all of the proto_files into the DescriptorPool using the BuildFile()
> method, then use FindMessageTypeByName() to look up type_name. That gives
> you a Descriptor. You can then create a DynamicMessageFactory, pass the
> Descriptor to it, and it will give you a Message prototype. Do
> prototype->New() to get a new object, then you can use that to parse the
> message.
>
> Hmm, this is kind of complicated. We should have a utility class that
> encapsulates all this.

Will this make it in to the next version?

Kenton Varda

unread,

Aug 14, 2008, 4:46:30 PM8/14/08

to Eric....@gmail.com, Protocol Buffers

Sorry, I don't think so. There is really no demand at all for self-describing protocol buffers in Google, so it's unlikely that anyone here is going to find time to write the code.

Kenton Varda

unread,

Aug 14, 2008, 4:46:58 PM8/14/08

to Eric....@gmail.com, Protocol Buffers

(Of course, if someone outside of Google wants to write it, that'd be great! :) )

Reply all

Reply to author

Forward