Protobuf's Missing Features

2,507 views
Skip to first unread message

code_monkey_steve

unread,
Nov 7, 2008, 8:14:01 PM11/7/08
to Protocol Buffers
After playing with protobuf for the last few months, I've decided that
it's not quite suitable for my purposes, due to some design decisions
(which I'm sure seemed the like a good idea at the time). As much as
I hate reinventing the wheel, I've decided to create my own message
encoding framework implementing the features below ("And blackjack!
And hookers! Ah, who needs the framework"), while maintaining wire-
level compatibility with protobuf.

1. XML vs. Yet-Another-Proprietary-File-Format
The arguments against using XML at the wire-level are well documented,
but why, oh why, couldn't you have made the message definition format
(.proto) XML-based? Now every language has to code and debug (!)
their own parser, and there's no way to add meta-data to the message
definitions. What's wrong with just publishing a DTD/XSchema and
using an off-the-shelf XML parser?

This is my single biggest complaint, and the one reason protobuf is
unsuitable for my project: the message definitions need to include
enough information to dynamically generate the user interface for both
displaying and composing messages.

2. Message Inheritance (vs. Extensions?)
Are there any languages left that don't support single-inheritance,
even C? Reserving a zero'th message field for a base message class
uses almost no overhead, and allows for a nice message class
hierarchy.

Perhaps I just don't grok Extensions, but they seem more like a safety
feature than a re-usability mechanism.

3. Typedefs
E.g., "UUID=string", "Timestamp=double", etc. Syntactic sugar is
always good.

4. Built-un UUID Type
There are lots of other built-in types I'd like to have, but I think
this one's a must for a message encoder.

Alain M.

unread,
Nov 7, 2008, 8:51:53 PM11/7/08
to ProtBuf List

code_monkey_steve escreveu:

> This is my single biggest complaint, and the one reason protobuf is
> unsuitable for my project: the message definitions need to include
> enough information to dynamically generate the user interface for both
> displaying and composing messages.

I am new to this discussion, but it looks to me that all that
information should be *inside* your message and not in the structure of
the message. The is how I figure that PB was made for.

> 3. Typedefs : "Timestamp=double"

That is good for you, I intend to use machines without hardware
floating-point and that would be a huge problem. (remember that PCs are
2% of world's computers)

just my 2c,
Alain

Kenton Varda

unread,
Nov 7, 2008, 10:06:35 PM11/7/08
to code_monkey_steve, Protocol Buffers
On Fri, Nov 7, 2008 at 5:14 PM, code_monkey_steve <code.mon...@gmail.com> wrote:

After playing with protobuf for the last few months, I've decided that
it's not quite suitable for my purposes, due to some design decisions
(which I'm sure seemed the like a good idea at the time).  As much as
I hate reinventing the wheel, I've decided to create my own message
encoding framework implementing the features below ("And blackjack!
And hookers!  Ah, who needs the framework"), while maintaining wire-
level compatibility with protobuf.

Good luck with that.  It's more work than you might expect.

1. XML vs. Yet-Another-Proprietary-File-Format
The arguments against using XML at the wire-level are well documented,
but why, oh why, couldn't you have made the message definition format
(.proto) XML-based?

Because XML is too verbose and, frankly, really hard to read.

<message name="Foo">
  <field name="foo" number="1" type="int32" label="optional"/>
  <field name="bar" number="2" type="string" label="repeated"/>
</message>

vs.

message Foo {
  optional int32 foo = 1;
  optional string bar = 2;
}

 Now every language has to code and debug (!)
their own parser, and there's no way to add meta-data to the message
definitions.

Actually, libprotoc allows you to reuse protoc's implementation, so there's no need for anyone to write their own parser.


If you can't stand writing your code generator, you can always invoke protoc with the --descriptor_set_out option to parse the .proto files and convert them into a FileDescriptorSet, which is itself a protocol buffer (see src/google/protobuf/descriptor.proto).  You can then parse that in any language that supports protobufs and generate your code based on it.
 
This is my single biggest complaint, and the one reason protobuf is
unsuitable for my project:  the message definitions need to include
enough information to dynamically generate the user interface for both
displaying and composing messages.

You can do this with custom options.  For example, to annotate fields with descriptions for use in a UI:

  import "google/protobuf/descriptor.proto";
  extend google.protobuf.FieldOptions {
    optional string description = 12345;
  }

  message Foo {
    optional int32 foo = 1 [(description) = "The foo field."];
    repeated string bar = 2 [(description) = "The bar field."];
  }

This is a new feature and I admit it is not adequately documented at the moment.
 
2. Message Inheritance (vs. Extensions?)
Are there any languages left that don't support single-inheritance,
even C?  Reserving a zero'th message field for a base message class
uses almost no overhead, and allows for a nice message class
hierarchy.

Perhaps I just don't grok Extensions, but they seem more like a safety
feature than a re-usability mechanism.

This question is asked so often that I have a canned response ready:

================================

Many people have observed that extensions solve similar problems to inheritance, and wonder why Protocol Buffers do not implement inheritance instead. The short answer is that extensions fit better into the Protocol Buffer model, whereas inheritance creates many difficult questions and significantly complicates both interface and implementation. The long answer (copied from an e-mail discussion) follows.

When people talking about protocol buffer inheritance, there are generally two distinct ways they want to use it (1) Cases where the consumer of the message knows exactly which subclass they expect to receive. In this case, all the user really wants is to be able to define a message which has all the same fields as some base message plus some extras specific to their app. (2) Cases where the consumer does not necessarily know which subclass it will receive, and wants to be able to check what kind of message it has received after receiving it (like a "dynamic_cast" or "instanceof").

Our feeling about case 1 is that the best way to accomplish it is to simply embed an instance of the "base" message into your "derived" message. Sure, we could add a whole lot of code generation which makes this look like inheritance, but it does not seem worth the effort. Besides, this is arguably "implementation inheritance", which many believe is not good O-O design.

If we wanted to go further and make the wire format be compatible between base classes and derived classes (which it seems many people would expect), it would either add a bunch of complication to the parsing code or would require that each subclass contain a complete copy of the superclass's parser, extended with the subclass's additional fields.

Additionally, the descriptor and reflection interfaces would have to be updated to know about subclassing, etc., which is complicated.

Overall, it just doesn't seem worth the added complexity.

Case 2 is more interesting. This is the case extensions were designed to address. The previous solution -- MessageSet -- is used a lot, and there are many cases where a single MessageSet contains multiple messages. In Google, we frequently see MessageSets containing several messages.

The most obvious problem with using inheritance in this case is that we would need multiple inheritance even just to cover existing use cases. Many people object to multiple inheritance for many reasons.

Now, even if we pretended the multiple-extension use cases didn't exist, it would still be extremely difficult to solve this problem using an inheritance model. For example, if you don't know what message type you're receiving, how do you know what class to use to parse it? The wire format would have to identify this somehow -- before the actual data started -- which would have to be hacky (if not impossible) to do without breaking backwards-compatibility. Alternatively, you could put the data to the side and not actually instantiate the subclass until someone attempts to "down-cast" the object, but that's awkward.

Add this to all the same design issues listed in case 1 and the fact that people frequently want a single message to contain multiple extensions and we see that inheritance just is not the right solution here.

=============================
 
3. Typedefs
E.g., "UUID=string", "Timestamp=double", etc.  Syntactic sugar is
always good.

Even many fully-featured programming languages -- e.g. Java -- don't provide this.

4. Built-un UUID Type
There are lots of other built-in types I'd like to have, but I think
this one's a must for a message encoder.

What's wrong with defining your own UUID message?  What would we gain from having it built-in?

Marc Gravell

unread,
Nov 8, 2008, 4:53:02 AM11/8/08
to Protocol Buffers
Re point 1: no, you don't.

To illustrate, protobuf-net's protogen actually re-writes the
descriptor as xml. If you want, you can use this "as is", just add a
new xslt and you're done:

protogen -i:foo.proto -o:bar.whatever -t:yourlanguage

Or if you want them as xml, this already exists:

protogen -i:foo.proto -o:foo.xml -t:xml

You are more than welcome to use protogen to extract the data you want
as xml

Re point 2: again, you can get around this if you need on an
implementation basis. protobuf-net will spoof inheritance as
extensions

Re point 4: again, trivial via a bytes - no need for bespoke
support... protobuf-net will handle Guid data automatically

Marc

Marc Gravell

unread,
Nov 8, 2008, 4:54:14 AM11/8/08
to Protocol Buffers
(I've recently added protogen to the download section, here):

http://code.google.com/p/protobuf-net/

codeazure

unread,
Nov 9, 2008, 7:07:36 PM11/9/08
to Protocol Buffers
On Nov 8, 2:06 pm, Kenton Varda <ken...@google.com> wrote:
> 1. XML vs. Yet-Another-Proprietary-File-Format
> > The arguments against using XML at the wire-level are well documented,
> > but why, oh why, couldn't you have made the message definition format
> > (.proto) XML-based?
>
> Because XML is too verbose and, frankly, really hard to read.
> <message name="Foo">
> <field name="foo" number="1" type="int32" label="optional"/>
> <field name="bar" number="2" type="string" label="repeated"/>
> </message>
> vs.
> message Foo {
> optional int32 foo = 1;
> optional string bar = 2;
> }

I agree - both PB & XML are machine parsable, but PB is much more
human parsable. Maybe some people really like reading XML, but I'm not
a member of that club :-) XML is a tool that is invaluable in some
situations, but it is inappropriate to try and apply it to everything.
As another poster has suggested, you can convert proto files into XML
& then do what you like with them.

> > This is my single biggest complaint, and the one reason protobuf is
> > unsuitable for my project: the message definitions need to include
> > enough information to dynamically generate the user interface for both
> > displaying and composing messages.
>
> You can do this with custom options. For example, to annotate fields with
> descriptions for use in a UI:
>
> import "google/protobuf/descriptor.proto";
> extend google.protobuf.FieldOptions {
> optional string description = 12345;
> }
>
> message Foo {
> optional int32 foo = 1 [(description) = "The foo field."];
> repeated string bar = 2 [(description) = "The bar field."];
> }
>
> This is a new feature and I admit it is not adequately documented at the
> moment.

This is a _really_ nice feature, very handy. I can see you put this
comment in the SVN logs, but it would be worth updating the main docs
soon so more people can find out about it.

> > 2. Message Inheritance (vs. Extensions?)
> > Perhaps I just don't grok Extensions, but they seem more like a safety
> > feature than a re-usability mechanism.
>
> Our feeling about case 1 is that the best way to accomplish it is to simply
> embed an instance of the "base" message into your "derived" message. Sure,
> we could add a whole lot of code generation which makes this look like
> inheritance, but it does not seem worth the effort.

I'm fine with this - even though it's not "real" inheritance, it
fulfils the data structuring needs I have. I would have some
misgivings about being able to make a more complex inheritance scheme
as portable to as many languages as PB is. It is so easy to have
converters into JSON and simple data systems like that & it would be a
shame to make that harder or impossible. PB keeps things simple but
scalable...

> The most obvious problem with using inheritance in this case is that we
> would need multiple inheritance even just to cover existing use cases. Many
> people object to multiple inheritance for many reasons.

This is a powerful argument to not go there. I use C++ all the time &
like the power and flexibility of the language, but oh my does it come
at a cost of complexity to handle things like this. If you only ever
intended to have PB connections between C++, Java, and other languages
of similar power, then it might make sense to consider inheritance,
but not with the wide range of language support that currently exists.

> > 3. Typedefs
> > E.g., "UUID=string", "Timestamp=double", etc. Syntactic sugar is
> > always good.
>
> Even many fully-featured programming languages -- e.g. Java -- don't provide
> this.

True, but it doesn't mean it's a bad idea. I would suggest using the C
preprocessor to add #define for the typedefs you want. This way, it
doesn't interfere with PB, but allows you to add typedefs to your
data. I wouldn't mind seeing something like this in PB natively, but
it's not a major thing.

I suppose you could use that "extend FieldOptions" feature you
described to add meta-data describing the type, but it seems a bit of
overkill.

> 4. Built-un UUID Type
>
> > There are lots of other built-in types I'd like to have, but I think
> > this one's a must for a message encoder.
>
> What's wrong with defining your own UUID message? What would we gain from
> having it built-in?

Agreed - this type is way too specialized to build into the language.
It may be common in some classes of application development, but
there's plenty of people like me who never use them. If PB added UUID,
then there would be calls for all kinds of application specific types
to be built in, such as date/time. In particular, since UUID is a
string, there seems little sense adding type handling for a formatted
string.

Jeff

Greg Copeland

unread,
Nov 15, 2008, 8:57:18 AM11/15/08
to Protocol Buffers


On Nov 7, 7:14 pm, code_monkey_steve <code.monkey.st...@gmail.com>
wrote:
> After playing with protobuf for the last few months, I've decided that
> it's not quite suitable for my purposes, due to some design decisions
> (which I'm sure seemed the like a good idea at the time).  As much as
> I hate reinventing the wheel, I've decided to create my own message
> encoding framework implementing the features below ("And blackjack!
> And hookers!  Ah, who needs the framework"), while maintaining wire-
> level compatibility with protobuf.
>
> 1. XML vs. Yet-Another-Proprietary-File-Format
> The arguments against using XML at the wire-level are well documented,
> but why, oh why, couldn't you have made the message definition format
> (.proto) XML-based?  Now every language has to code and debug (!)
> their own parser, and there's no way to add meta-data to the message
> definitions.  What's wrong with just publishing a DTD/XSchema and
> using an off-the-shelf XML parser?
>

I couldn't agree more. I actually started down the same road. XML as
an IDL is nearly an ideal use. I'm constantly amazed at how few people
consider this but frankly, it's just about the ideal use of XML.

> 2. Message Inheritance (vs. Extensions?)
> Are there any languages left that don't support single-inheritance,
> even C?  Reserving a zero'th message field for a base message class
> uses almost no overhead, and allows for a nice message class
> hierarchy.
>

I agree. The requirement of composition rather than inheritance
highlights a significant weakness of PBs. I've only just started
toying with PB and but this weakness jumped out at me the first time I
reviewed the documentation.

My other complaint is the lack of constructor options. Granted, one
has various copy options, but additional constructors are sorely
needed, even if options came with some caveats.

The interface really needs some tweaking too.
From the tutorial:
phone_number->set_number(number);

That's overly complicated. That should read as:
tutorial::Person::PhoneNumber* phone_number = person->add_phone
( number );

Or better yet:
tutorial::Person::PhoneNumber* phone_number = person->add_phone
( number, type );

Like I said, I'm just getting started but I've already identified some
PB short comings. Hopefully PB will continue to improve over time. I
may yet create my own tool but until then, I'll be playing more with
PB to get a feel for additional pros and cons. I may yet create my own
tool but

Greg Copeland

unread,
Nov 15, 2008, 12:39:32 PM11/15/08
to Protocol Buffers
On Nov 15, 7:57 am, Greg Copeland <gtcopel...@gmail.com> wrote:

> The interface really needs some tweaking too.
> From the tutorial:
>     phone_number->set_number(number);
>
> That's overly complicated. That should read as:
>     tutorial::Person::PhoneNumber* phone_number = person->add_phone
> ( number );
>
> Or better yet:
>     tutorial::Person::PhoneNumber* phone_number = person->add_phone
> ( number, type );
>
> Like I said, I'm just getting started but I've already identified some
> PB short comings. Hopefully PB will continue to improve over time.  I
> may yet create my own tool but until then, I'll be playing more with
> PB to get a feel for additional pros and cons. I may yet create my own
> tool but


Hmmm. I'm not sure what happened, but my post was destroyed. Perhaps
firefox sneezed or I hit something right before I submitted. Sorry
about that. Sorry again for I'm replying to my self.

As for the counter XML argument, I believe it to be fairly weak. Yes,
XML is more verbose yet I'd gladly trade it for not having to manually
enumerate each and every field in PB-IDL when 99+% its obvious and
needlessly tedious. Additionally, Google's own comments to require a
pure python implementation is justification alone to use XML as the
IDL rather than the PB-IDL. It is the same argument. XML is ubiquitous
in just about every language's library that matters. It can be readily
validated, has *many* rich tools, and regardless of the
implementation's performance, its only a compile time cost, never a
runtime cost - save only for perhaps dynamic message generation at
runtime. And even then, that's unlikely to be a performance issue for
any well written application. Once you consider the number of XML
editors and the number of users which can already, manually, easily
grok a well structured format like XML, its difficult to imagine the
need for yet another IDL (PB-IDL). Doubly so once once you consider
IDLs represents an almost ideal use-case for XML in the first place.

And believe me, that's saying a lot from me considering I believe XML
is one of the most overused technologies to date. In fact, I'd argue
XML is commonly used where is makes absolutely no sense what so ever
to even be considered for a project, let alone become the ubiquitous
interface - in whatever form it takes.

> The interface really needs some tweaking too.
> From the tutorial:
> phone_number->set_number(number);

Should read:
tutorial::Person::PhoneNumber* phone_number = person->add_phone() ;
phone_number->set_number(number);

And while I'm at it, I'll go ahead and offer a couple more comments.

"Message Type" is frequently mentioned in the documentation yet there
appears to be no available message type information available; as is,
no message type is available. Rather, there are message instances and
message classes which are referred to as types. I say this because
there does not appear to be anything which actually specifies a type
as commonly understood from a protocol perspective. The closest thing
I have identified would be a user created field which requires manual
population. Perhaps a default value would help reduce human errors,
but that seems kludged at best. If I'm even close here, there is no
such thing as a "message type" in PBs; only classes and instances.

Also, I may have identified a significant weakness which may in it
self be a complete show stopper. Please tell me I'm wrong. It appears
there exists no mechanism to automatically generate message types or
parse multiple potential messages from an ambiguous stream or buffer.
The tutorials are really lacking here too as they only deal with
single message classes at any given time; almost entirely negating the
need for something like PB in the first place. There currently exists
no examples which address this issue. At least none that I've found.
Some of the documentation directly addresses the issue but I've yet to
see it provide a real solution without lots of additional work.

Let's say I have three message classes whereby their sizes alone make
message disambiguation impossible. I have a single input stream. How
can I automatically do something like the following pseudo code.

// Factory invokes the handler registered for a given message type
factory.register( Msg1, handler1 ) ;
factory.register( Msg2, handler2 ) ;
factory.register( Msg3, handler3 ) ;
...
factory << stream ; // I don't know of any existing factory - nor have
I found one
factory.dispatch() ; // invoke the proper message handler

Or maybe something like this:

Message *baseMsg = factory << stream ; // I don't know of any existing
factory - nor have I found one
switch( baseMsg.type() ) {
case msg1.type:
Msg1 *msg = reinterpret_cast<Msg1 *>( baseMsg ) ;
break ;

case msg2.type:
Msg2 *msg = reinterpret_cast<Msg2 *>( baseMsg ) ;
break ;

case msg3.type:
Msg3 *msg = reinterpret_cast<Msg3 *>( baseMsg ) ;
break ;
}

// Let our registered handler for the given message be invoked
myReactor.dispatch( msg ) ;

At best, it appears we have to do something like the following, which
is rather error prone.

enum MsgType {
MSG1,
MSG2,
MSG3
}

Message Msg1 {
required int32 field = 1 ;
}

Message Msg2 {
required int32 field = 1 ;
}

Message Msg3 {
required int32 field = 1 ;
}

Message BaseMessage {
required MsgType type = 1 ;
optional Msg1 = 2 ;
optional Msg2 = 3 ;
optional Msg3 = 4 ;
}

Followed by some peeking on the stream and a switch case which looks
something like the following on the receiving side. Which really isn't
that bad.
BaseMessage baseMsg ;
baseMsg.ParseFromIstream( &stream ) ;

switch( baseMsg.type() ) {
case MSG1:
Msg1 *msg1 = baseMsg.Msg1() ;
break ;

Msg2 *msg2 = baseMsg.Msg2() ;
break ;

Msg3 *msg3 = baseMsg.Msg3() ;
break ;

}

The problem I have is on the sending side, I'm forced to do the
following which is highly error prone.
BaseMsg baseMsg ;
baseMsg.set_type( MSG1 ) ;
Msg1 *msg1 = baseMsg.addMsg1() ;

Notice the last options have become almost procedural in basis and are
fairly lengthy and error prone to boot - especially compared to like
the highly OO-desired options provided above. Unlike what I originally
stated above, it's not entirely clear to me a default value for the
message type is even an option here. Hopefully this brings home the
point, as far as I can tell, PB has no concept of a message type; only
classes and instances. I know I've read extensions are suppose to help
here, but it's not entirely clear to me exactly how it helps ease
these pains as there still is no message type.

Lastly, to make matters worse, given the above, every time I add or
revise a message type, all code which references *any* message is
forced to recompile even if it is dependant on a message which did not
change. For large and/or complex projects, that's very undesirable. I
suppose PIMPL may help there. Perhaps you have a more elegant solution
at hand via extensions and includes?

I suppose if we ignore these last couple of critiques, much of my OO-
desires can ultimately be created on top of PB, but given these are
typical use cases, one expects these basics to be provided up front.
After all, if it is not doing this work for us, not much more effort
is required to create our own tool which does all this for us, while
providing additional safety in code and increased usability. That's
seems like more than a fair trade in time.

Are there any plans to address these usability issues? Am I simply all
wet here and overlooking the obvious?
Reply all
Reply to author
Forward
0 new messages