Designing a C# protocol buffer compiler: three plans of attack

72 views
Skip to first unread message

Jon Skeet

unread,
Jul 9, 2008, 5:01:31 AM7/9/08
to Protocol Buffers
As mentioned previously, I'm interested in writing a .NET protocol
buffer compiler, which would initially generate C# code. I can
envisage three overall designs for this:

1) Write the generator in C++, like the Java one. Keep the generator
itself purely in native code.
Pros:
* No interop work required
* Can be run on machines with no CLI implementation

Cons:
* Same "you need to know C++" barrier to entry for the next .NET
language (more on this below)
* May involve quite a lot of C++, which I'm not hugely familiar with

2) Write a managed generator interface to the native protocol compiler
Pros:
* Managed interface can be used by other projects, e.g. to target
VB.NET, F# etc - or even non-.NET languages if the generator author
happens to prefer a .NET language to C++
* Most of the real generator code can be in C# :)

Cons:
* May require more detailed C++ and interop knowledge than I currently
have (i.e. less C++ code, but harder)
* Harder to run the same generator on all platforms

3) Write a managed protocol compiler from scratch
Pros:
* No nasty interop to worry about

Cons:
* Pain of forking and keeping it all up to date
* Wastes a lot of existing effort
* Lots of work!


Now, one question is whether any other generators for .NET languages
will really be needed. Once you've got C#, you can compile to a .NET
assembly and use that from any other .NET language, right? Well,
that's true - but there may be advantages in using the generated
source within a project instead of as a separate assembly:
1) The protocol buffers can be internal instead of public. In some
cases the PBs will be part of a public API, but in others they're just
an implementation detail which shouldn't be exposed.
2) With partial classes, PBs can start to gain behaviour without nasty
inheritance tricks. Obviously this is open to abuse, but I think in
some cases it will avoid creating unnecessary wrappers for data types
which really want to have some logic in them.

As you can't mix and match source languages within one assembly (at
the moment) it may be useful to have non-C# generators.

Any thoughts?

Jon

Nesser

unread,
Jul 9, 2008, 1:55:30 PM7/9/08
to Protocol Buffers
I would think that using C++ is the best bet for staying close to the
original code base, portability and leveraging the lower level methods
already available. As you mentioned in the cons, you probably won't
get much help from .Net developers with this implementation as the C++
learning curve may be too much.

The output on of the .Net classes would be neat to have source and
binary outputs possible based on options to the `protoc` compiler.
Simply look for an option in the argument list that points to a
{CSC,MSC,GMSC} compiler, then generate from each .proto file a
`AssemblyInfo.cs` file and build the source files to binary dlls.

Why would you want your data objects (PBs) to have behavior or logic
in them? Aren't these just containers?
I think the basic idea is that these are supposed to be generic lists
of {get,set} properties.
I've been wrong before.

Looking to sharpen my C++ skills. Let me know if I can help. Looking
forward to some code.

Cheers,
Chris

Kenton Varda

unread,
Jul 9, 2008, 2:21:28 PM7/9/08
to Jon Skeet, Protocol Buffers
Writing a code generator in C++ does not require a terribly deep understanding of the language.  It's just a bunch of print statements, really.  I don't know what is involved in your option #2, but it sounds more cumbersome.  I don't like Option #3 since I'd prefer that everyone reuse the existing parser implementation, so that if we add new language features you automatically are able to parse them.  So I think I'd recommend option #1.

Jon Skeet

unread,
Jul 9, 2008, 2:36:34 PM7/9/08
to Protocol Buffers
On Jul 9, 6:55 pm, Nesser <chris.n...@ge.com> wrote:
> I would think that using C++ is the best bet for staying close to the
> original code base, portability and leveraging the lower level methods
> already available.  As you mentioned in the cons, you probably won't
> get much help from .Net developers with this implementation as the C++
> learning curve may be too much.

Yes. I think this is the way I'll go.

> The output on of the .Net classes would be neat to have source and
> binary outputs possible based on options to the `protoc` compiler.
> Simply look for an option in the argument list that points to a
> {CSC,MSC,GMSC} compiler, then generate from each .proto file a
> `AssemblyInfo.cs` file and build the source files to binary dlls.

I'm not sure there's very much value in immediately compiling. As far
as I'm aware this doesn't happen for Java, does it? I think it would
increase the complexity for relatively little gain. Other tools know
how to build C# code better than we're likely to be able to support.

> Why would you want your data objects (PBs) to have behavior or logic
> in them?  Aren't these just containers?
> I think the basic idea is that these are supposed to be generic lists
> of {get,set} properties.
> I've been wrong before.

Well, part of object orientation is to keep behaviour and data
together. I know that my experience with protocol buffers so far has
occasionally meant I'd like to be able to just add behaviour directly
instead of creating an extra wrapper class.

Given the lack of data hiding, wrapping a protocol buffer for the sake
of other APIs may well be valid - but internally I like the idea of
being able to provide behaviour directly without the hassle of
wrapping.

> Looking to sharpen my C++ skills.  Let me know if I can help.  Looking
> forward to some code.

I'm hoping that'll be fairly soon. I'm about to start building... see
my reply to Kenton in a minute for more details.

Jon

Jon Skeet

unread,
Jul 9, 2008, 2:38:57 PM7/9/08
to Protocol Buffers
On Jul 9, 7:21 pm, "Kenton Varda" <ken...@google.com> wrote:
> Writing a code generator in C++ does not require a terribly deep
> understanding of the language.  It's just a bunch of print statements,
> really.  I don't know what is involved in your option #2, but it sounds more
> cumbersome.  I don't like Option #3 since I'd prefer that everyone reuse the
> existing parser implementation, so that if we add new language features you
> automatically are able to parse them.  So I think I'd recommend option #1.

Yup, I think that's the way I'll go. My new strategy is:

1) Get the current codebase building and running
2) Clone the Java code into csharp_generator etc
3) Get that building and running (generating Java in .cs files)
4) Generate very simple PBs.
5) Hand-edit the files to look like C#
6) Change the generation so that it generates the hand-crafted file
7) Use a different feature of PBs, and go back to step 5
8) Lather, rinse repeat :)

Oh, and as a parallel task implement the supporting libraries in C#.
That's likely to be the simplest part :)

Jon

DannO

unread,
Jul 11, 2008, 1:45:12 AM7/11/08
to Protocol Buffers
Is there a project setup so more people can help? I would hate to
have multiple people all make their own versions.

Dann

Jon Skeet

unread,
Jul 11, 2008, 2:38:49 AM7/11/08
to Protocol Buffers
On Jul 11, 6:45 am, DannO <dann.orm...@gmail.com> wrote:
> Is there a project setup so more people can help?  I would hate to
> have multiple people all make their own versions.

I have a git project, but I haven't worked out the details of pulling/
pushing etc yet. (I have less than 24 hours of git experience!)

At the moment to be honest it's probably fastest to let me just pump
out a straw man which does *something*, then we can collaborate on
improvements. There's a lot of donkey-work to be done which
unfortunately doesn't lend itself to splitting up if we want to get
any sort of consistency. (I'm currently bouncing around an awful lot
of files.)

I expect something to be ready for a public viewing fairly soon though
- within a week, certainly.

Jon

plague...@gmail.com

unread,
Jul 13, 2008, 6:11:47 PM7/13/08
to Protocol Buffers
Might i suggest that you create it entirely in managed code, using
System.Reflection to compile only classes with specific attributes?
This should be simple enough. The (de)serializer could also be
implemented in the same fashion using marshalling to force the image
of a class to portable data types. Its just an idea, but id be willing
to help code it if you like, Also we would be able to easily generate
proxy objects using system.reflection and save them to a new assembly
that can be included with applications that wish to use the proxy to
communicate back to the object using Protocol Buffers as an RPC
Transport. Just an idea. Also reflection can be used to extract an
object and create an interface to it. This would make creating the
PB's much much simpler.

Alek Storm

unread,
Jul 13, 2008, 6:24:03 PM7/13/08
to Protocol Buffers
On Jul 9, 4:01 am, Jon Skeet <sk...@pobox.com> wrote:
> 1) Write the generator in C++, like the Java one. Keep the generator
> itself purely in native code.
>
> 2) Write a managed generator interface to the native protocol compiler

I would recommend these two, as it would be difficult to keep the
grammar in sync with 3.

> 3) Write a managed protocol compiler from scratch

If you do pick this one, which would be more portable, this reference
grammar might be useful:
http://groups.google.com/group/protobuf/browse_thread/thread/1cccfc624cd612da#33102cfc0c57d449
(make sure you use the attached file at the bottom of the thread).

DrunkenCoder

unread,
Jul 13, 2008, 6:35:29 PM7/13/08
to Protocol Buffers


On 14 Juli, 00:24, Alek Storm <alek.st...@gmail.com> wrote:
> On Jul 9, 4:01 am, Jon Skeet <sk...@pobox.com> wrote:
>
> > 1) Write the generator in C++, like the Java one. Keep the generator
> > itself purely in native code.
>
> > 2) Write a managed generator interface to the native protocol compiler
>
> I would recommend these two, as it would be difficult to keep the
> grammar in sync with 3.
>
> > 3) Write a managed protocol compiler from scratch
>
> If you do pick this one, which would be more portable, this reference
> grammar might be useful:http://groups.google.com/group/protobuf/browse_thread/thread/1cccfc62...
> (make sure you use the attached file at the bottom of the thread).

Doing it the managed way from scratch would also make it possible to
utilize CodeDom providers and hence be able to generate code for
any .NET language supplying an implementation of the generator
interface found there. AFAIK VB, C#, F# and Boo all does so.

Jon Skeet

unread,
Jul 14, 2008, 1:45:09 AM7/14/08
to Protocol Buffers
On Jul 13, 11:11 pm, plaguethe...@gmail.com wrote:
> Might i suggest that you create it entirely in managed code, using
> System.Reflection to compile only classes with specific attributes?

I'm not entirely sure what you mean, to be honest. Are you talking
about the compiler or the backing library? What classes wouldn't be
compiled?

> This should be simple enough. The (de)serializer could also be
> implemented in the same fashion using marshalling to force the image
> of a class to portable data types. Its just an idea, but id be willing
> to help code it if you like, Also we would be able to easily generate
> proxy objects using system.reflection and save them to a new assembly
> that can be included with applications that wish to use the proxy to
> communicate back to the object using Protocol Buffers as an RPC
> Transport. Just an idea. Also reflection can be used to extract an
> object and create an interface to it. This would make creating the
> PB's much much simpler.

I think it's fairly simple to go from the .proto file to C# already -
getting exactly the right patterns to generate .proto files from the
C# could be quite tricky.

I suspect I don't fully understand your ideas - could you elaborate on
them?

Jon

Jon Skeet

unread,
Jul 14, 2008, 1:48:25 AM7/14/08
to Protocol Buffers
On Jul 13, 11:35 pm, DrunkenCoder <torbjorn.gyllebr...@gmail.com>
wrote:
> > If you do pick this one, which would be more portable, this reference
> > grammar might be useful:http://groups.google.com/group/protobuf/browse_thread/thread/1cccfc62...
> > (make sure you use the attached file at the bottom of the thread).
>
> Doing it the managed way from scratch would also make it possible to
> utilize CodeDom providers and hence be able to generate code for
> any .NET language supplying an implementation of the generator
> interface found there. AFAIK VB, C#, F# and Boo all does so.

That's true - but it does mean quite possibly creating more bugs by
rewriting the parsing aspect.

One intriguing idea might be to write a generator (in C++) which
didn't create the source code for the protocol buffers themselves -
but instead created CodeDOM which could then be used to generate VB,
F# etc.

To start with I'll just stick with the "generate simple C# from C++"
approach, but there's nothing to say we can't try different things
over time...

Jon

Alek Storm

unread,
Jul 14, 2008, 2:04:47 AM7/14/08
to Protocol Buffers
On Jul 14, 12:48 am, Jon Skeet <sk...@pobox.com> wrote:
> One intriguing idea might be to write a generator (in C++) which
> didn't create the source code for the protocol buffers themselves -
> but instead created CodeDOM which could then be used to generate VB,
> F# etc.

I don't know why you would want to. The whole point of .NET is for
multi-language interoperability, i.e., *not* having to generate
separate libraries for each language. Generate the code in
whatever .NET language you want, compile it to bytecode, and every
other .NET language can call that assembly. Much easier and simpler,
and you get the benefits of .NET's interoperability.

DrunkenCoder

unread,
Jul 14, 2008, 2:26:57 AM7/14/08
to Protocol Buffers
You're also stuck with a binary blob in your repository and a
inconvinient debugging scenario. Using code generation for the same
language as the host project makes it possible to compile in both
debug/release modes and make stepping through source a snap.

But sure just emitting an assembly is an easy and convient way to get
things rolling fast. And as long as things work, it's really good.

Jon Skeet

unread,
Jul 14, 2008, 2:39:02 AM7/14/08
to Protocol Buffers
For the most part, I completely agree. However, if you ever *did* want
to use partial classes to add behaviour, or perhaps make the protocol
buffers themselves internal to a project (providing a separate API for
callers, and using PBs just for storage, perhaps) then having multi-
language support could be handy.

There's also the minor issue that anyone wanting to see what their PB
code is doing behind the scenes (or debug it) will currently need to
understand C#.

Having said all that, my first approach is still going to be C#-only
generation, which will be fine for the vast majority of use cases, I
suspect. Getting that complete, stable, idiomatic and fast are my
first priorities.

Jon

Matthias Ernst

unread,
Jul 14, 2008, 2:52:25 AM7/14/08
to Protocol Buffers
Jon,

one other approach:
* I've created an output format for protoc that simply dumps the
FileDescriptorProto to disk. It's in the issue tracker #15:
http://code.google.com/p/protobuf/issues/detail?id=15
* If you bootstrap the C# code for reading FileDescriptorProtos (maybe
from the C++ code, I have no idea how easy that is with managed C++)
then you could write the code
generator in C#.

It would take and extra step, require people to run .protos through
"protoc --desc_out" and then through your generator but it may make it
much more comfortable to implement.

Matthias

DrunkenCoder

unread,
Jul 14, 2008, 3:12:52 AM7/14/08
to Protocol Buffers


On 14 Juli, 08:52, Matthias Ernst <ernst.matth...@gmail.com> wrote:
> Jon,
>
> one other approach:
> * I've created an output format for protoc that simply dumps the
> FileDescriptorProto to disk. It's in the issue tracker #15:http://code.google.com/p/protobuf/issues/detail?id=15
> * If you bootstrap the C# code for reading FileDescriptorProtos (maybe
> from the C++ code, I have no idea how easy that is with managed C++)
> then you could write the code
> generator in C#.
>
> It would take and extra step, require people to run .protos through
> "protoc --desc_out" and then through your generator but it may make it
> much more comfortable to implement.
>
> Matthias

This sounds like a really nice route to take.
The .NET compiler (or any other language and platform) could simply
wrap protoc effectivly decoupling the parsing from code generation,
this is excellent!
Also this has the nice property of diffrent language implementations
being self hosted implementing protocol buffers would be a matter of:
1: Manually coding support for reading FileDescriptorProtos
2: Emitting code from read FileDescriptorProto
3: Selfhosting!

Is the FileDescriptoroProto documented on the site or is it only part
of the source drop?

//Torbjörn

Jon Skeet

unread,
Jul 14, 2008, 3:14:07 AM7/14/08
to Protocol Buffers
On Jul 14, 7:52 am, Matthias Ernst <ernst.matth...@gmail.com> wrote:
> one other approach:
> * I've created an output format for protoc that simply dumps the
> FileDescriptorProto to disk. It's in the issue tracker #15:http://code.google.com/p/protobuf/issues/detail?id=15
> * If you bootstrap the C# code for reading FileDescriptorProtos (maybe
> from the C++ code, I have no idea how easy that is with managed C++)
> then you could write the code
> generator in C#.
>
> It would take and extra step, require people to run .protos through
> "protoc --desc_out" and then through your generator but it may make it
> much more comfortable to implement.

That's a *very* interesting idea. I like that a lot. The bootstrap
problem will be "hard" once but then only as hard as it would be for
any FileDescriptor change. (It would mean doubling the scale of the
problem, in terms of first having to deal with protoc changing, then
dealing with the C# generator changing, but neither stage would be
hugely tricky.)

I'm working on the backing library at the moment anyway, so I'll
revisit the decision when I've finished that. It might be worth
completing the half-finished native generator code anyway, to provide
a comparison point with the managed version - but it's certainly
something to investigate further.

Jon

Matthias Ernst

unread,
Jul 14, 2008, 12:48:12 PM7/14/08
to Protocol Buffers
On Jul 14, 9:12 am, DrunkenCoder <torbjorn.gyllebr...@gmail.com>
wrote:
FileDescriptorProto is part of the public API.
Message.getDescriptor().getFile().toProto().
I find it a very nice property that the whole proto object model is
available as protos. It also provides
for some nice mind-bending bootstrapping properties ... (see
descriptor.proto: descriptor.proto must be optimized
for speed because reflection-based algorithms don't work during
bootstrapping)

Jason C.

unread,
Jul 14, 2008, 2:13:24 PM7/14/08
to Protocol Buffers


On Jul 14, 1:45 am, Jon Skeet <sk...@pobox.com> wrote:
> I'm not entirely sure what you mean, to be honest. Are you talking
> about the compiler or the backing library? What classes wouldn't be
> compiled?

I mean, the protocol buffers compiler could create a .NET Assembly out
of an existing .proto file (Using System.Reflection) , or create
a .proto file out of a class marked with a specific attribute (Again,
using System.Reflection) But i was also talking about being able to
(de)serialize to and from any stream, Written 100% in managed code
with portability for mono in mind, So, both the backing library, and a
base class in a seperate .NET assembly (.dll), Much in the way that it
seems its done for C++, Except without code. All of the things that
would be coded in the resulting C++ file, could be implemented in
the .NET assembly generated by the compiler, and the class could be
declared either abstract or allow methods to be overridden so the
class can be inherited and modified to work as needed, while still
being able to be serialized to and from the pb format (On-wire).

I guess im not the best at explaining my ideas, maybe i should write a
small amount of nearly related, but incomplete code, to demonstrate my
idea?

>
> I think it's fairly simple to go from the .proto file to C# already -
> getting exactly the right patterns to generate .proto files from the
> C# could be quite tricky.

Definately could be tricky, Espically trying to handle all of the
various base data types, Compiler inserted methods and variables
(GetHashcode(), ToString()), But i do believe reflection could solve
all of this. I will write some sample code and get back to you with
that to see if that explains it better.

Alek Storm

unread,
Jul 14, 2008, 4:19:25 PM7/14/08
to Protocol Buffers
On Jul 14, 1:52 am, Matthias Ernst <ernst.matth...@gmail.com> wrote:
> * I've created an output format for protoc that simply dumps the
> FileDescriptorProto to disk. It's in the issue tracker #15:http://code.google.com/p/protobuf/issues/detail?id=15
> * If you bootstrap the C# code for reading FileDescriptorProtos (maybe
> from the C++ code, I have no idea how easy that is with managed C++)
> then you could write the code
> generator in C#.

Hmm... run that sucker on descriptor.proto (contains
FileDescriptorProto and friends) and you've got a completely self-
describing PB stream.

Marc Gravell

unread,
Jul 15, 2008, 5:35:37 AM7/15/08
to Protocol Buffers
> I mean, the protocol buffers compiler could create a .NET Assembly out
> of an existing .proto file (Using System.Reflection) , or create
> a .proto file out of a class marked with a specific attribute (Again,
> using System.Reflection)

We might be talking about the same thing; I have a working
implementation (well, partial implementation) of some of the
fundamental builfing blocks:
http://groups.google.com/group/protobuf/browse_thread/thread/8e27d92bd6bf22e9#

At the moment it uses the WCF data-contract / service-contract
attributes, since this gives us a lot for free (including a complete
RPC stack). But for mono etc it could happily use something else - but
it seems that WCF gives us a lot of bang per buck.

For member access (for performance reasons) it doesn't use reflection
(GetValue/SetValue) directly; it builds a typed delegate (one pair per
property) that it re-uses the delegate. An alternative approach would
be something like HyperDescriptor (just search), but that is more
complex.

I don't know the best way to share what code I have... any
suggestions, let me know ;-p

Marc

jonas....@gmail.com

unread,
Jul 16, 2008, 2:12:16 AM7/16/08
to Protocol Buffers
Please don't use WCF, stick with .Net 2.0.
I can do the communication framework (have done quite a few) when
everything else is done.

On 15 Juli, 11:35, Marc Gravell <marc.grav...@gmail.com> wrote:
> > I mean, the protocol buffers compiler could create a .NET Assembly out
> > of an existing .proto file (Using System.Reflection) , or create
> > a .proto file out of a class marked with a specific attribute (Again,
> > using System.Reflection)
>
> We might be talking about the same thing; I have a working
> implementation (well, partial implementation) of some of the
> fundamental builfing blocks:http://groups.google.com/group/protobuf/browse_thread/thread/8e27d92b...

Marc Gravell

unread,
Jul 16, 2008, 2:58:52 AM7/16/08
to Protocol Buffers
> Please don't use WCF, stick with .Net 2.0.

Too late - already working ;-p

That said, if you know any good ways (formatters, etc) to remove the
base-64 then I'm all ears... but getting the regular MTOM formatter to
work is my next thing to try...

However, one simple option would be to support multiple attributes
(i.e. check for our own, falling back to DataMemberAttribute if there
isn't one). This means that we could use #SYMBOLS to turn off the .NET
3.0 things for a pure .NET 2.0 build (we did something similar in
"MiscUtil" to support 2.0 vs 3.5).

However; I truly think that supporting the regular .NET 3.0 attributes
is key to making it highly usable: simply, this (at a stroke) makes it
instantly compatible with a whole host of existing code (LINQ-to-SQL,
WCF, mex, etc). [see other thread for more details why using
DataContract gives us many good things; equally, they are doing the
same thing, so it makes good sense to have commonality].

Marc

Marc Gravell

unread,
Jul 16, 2008, 3:09:09 AM7/16/08
to Protocol Buffers
Re "communication framework"; if you mean a standalone RPC channel
(outside of WCF), then that would be fantastic! Not something I've
done before, so I'd greatly appreciate it.

Minor aside: having worked with WCF a bit... I'm not saying it is
fantastic, but I really do like the interface-based contract; it makes
it a breeze to swap the implentation for unit tests and/or dependency
injection. It also means that (taking the .NET 3.0 discussion
separately) a single service/data contract [classes and an interface]
could be used seamlessly over *either* WCF *or* a custom RPC channel,
without the caller needing to know which.

From a code generation perspective, it also means that the core .proto
compiler only has to emit an interface (possibly attributed)
initially, with the ability to add extensions for generating an RPC
client/server later. Finally, it also makes it a breeze to directly
compare the three (reguar WCF / WCF with custom serialization / custom
RPC) without any other variables.

So: in the same way that I think supporting .NET 3.0 attributes on
data is a good thing [with #SYMBOLS to disable], I also think that it
would be a good design choice to use the same interface/attribute
pattern for any custom RPC layer; and again, we could use #SYMBOLS or
command-line switches on a standalone .proto compiler to choose
between .NET 2.0 and .NET 3.0.

Thoughts?

Marc

jonas....@gmail.com

unread,
Jul 16, 2008, 10:44:09 AM7/16/08
to Protocol Buffers
What I mean is that I need go get the serialization information from
somewhere (how an object is serialized to/from text/binary), then I'll
build conversion objects at runtime (i.e. creating C# classes runtime
and compile them) to get native serialization helpers for each object
type. I've done this in my DAL (http://www.codeplex.com/tinydal)
previously and it improves the performance a bit.

I don't know what you mean with a RPC channel, but I'll create a
socket implementation which do the serialization/deserialization
automatically. You'll receive/send .net objects through the
connection.

Marc Gravell

unread,
Jul 16, 2008, 3:44:06 PM7/16/08
to Protocol Buffers
I wonder how big the difference in speed would be... unfortunately
there is only one way to know; but I have some prelimiary (pre-
optimisation) metrics from the current code, and it is no slouch. I
hope to do some better metrics on the train tomorrow. Unfortunately,
until Jon's port gets a bit further down the line I have nothing
meaningful to compare to to know if it is any code. I also want to see
if I can hook it into the .NET binary serializer to give remoting a
kick (for WCF over MTOM I can get a 5-fold speed boost, based on large
payloads).

Marc
Reply all
Reply to author
Forward
0 new messages