Re: Cap'n proto generics

356 views
Skip to first unread message

Kenton Varda

unread,
Feb 21, 2014, 5:17:11 PM2/21/14
to Carter Schonwald, capnproto
Moving this to the mailing list for open discussion...

It occurs to me that my requirement (2) essentially means "type erasure".  I've never been good at using the correct academic terminology.  :)


On Sat, Feb 15, 2014 at 11:56 PM, Kenton Varda <temp...@gmail.com> wrote:
Hi Carter,

I think generics should look something like this:

    struct Foo(Bar, Baz) {
      # Foo is generic.  Bar and Baz are type parameters.

      a @0 :Int32;
      b @1 :Bar;
      c @2 :Text;
      d @3 :List(Baz);
    }

    struct Qux {
      e @0 :Foo(Text, SomeStruct);
      # Use of Foo().
    }

Notice that we use function call syntax for generics.  `Foo` is, after all, a type function.  I hate that C++ and Java use <> because they introduce all sorts of ambiguity with arithmetic operators, and there really is no reason to distinguish from value functions.

There are two important requirements I want to highlight:

1) Generic Cap'n Proto schemas should be implementable using the generics available in common programming languages -- that is, for the declarations above, the C++ code generator should generate a template class Foo<Bar, Baz>, a Java code generator should use Java generics, etc.

2) It should be possible to take an existing specific type and "genericize" it without breaking wire compatibility, and vice-versa.  That is, if I defined:

    struct Foo2 {
      a @0 :Int32;
      b @1 :Text;
      c @2 :Text;
      d @3 :List(SomeStruct);
    }

This type has exactly the same wire format as `Foo(Text, SomeStruct)`.

Unfortunately, I think the above requirements imply a sad limitation:  we can only allow type parameters to bind to pointer types, not primitives, much like in Java.  The reason for this is that the algorithm to compute offsets of primitives within the struct's data section is complicated, with every field's offset depending on the sizes of all the fields before it.  It seems unfeasible to compute the offset dynamically in template or runtime code.

If we were willing to give up either of the two requirements, then we could possibly get around this.  For instance, if we dropped (1) and instead said that code would only be generated for specific instances used in the schema files, then we could compute all offsets in advance the same way we do now.  Or, if we were willing to drop (2) then we could say that fields of generic type always occupy a whole word in each of the sections, which is enough space to hold any kind of value.

However, I think requirements (1) and (2) are both more important to me than the ability to parameterize non-pointer types.  The pointer-only requirement can be worked around, after all, by "boxing" primitive types as one would in Java, which at worst wastes 127 bits (which is much better than, say, boxing in Java).

If we agree that this is an acceptable limitation, then I think the remaining details fall into place intuitively.  For any given generic type, we generate one version of the type where all the type parameters are bound to `AnyPointer`, and then we generate a C++ template/Java generic/whatever wrapper around that type which interprets all of the AnyPointers according to the specific parameterization.  The most complicated implementation issues are around the schema and dynamic API in C++, but that's only because there are some unusual implementation requirements in C++ (mostly around avoiding heap allocation); I'm guessing C++-specific details don't interest you.  :)

Thoughts?

-Kenton


On Sat, Feb 15, 2014 at 10:16 PM, Carter Schonwald <carter.s...@gmail.com> wrote:
Not yet subscribed to the mailing list (amidst travel this week).

Anywho, please feel welcome to brain dump away.   Having a nice story for "template"/ monomorphized at run time generics (or whatever you want to call em) would settle one of the bigger design smells I think I'm (seemingly) stuck on wrt the protocol. (I like it overall mind you)

Anywho, id love to hear your brain dump on this topic.

-Carter


Jason Paryani

unread,
Feb 21, 2014, 6:51:05 PM2/21/14
to Kenton Varda, Carter Schonwald, capnproto
If we were willing to give up either of the two requirements, then we could possibly get around this.  For instance, if we dropped (1) and instead said that code would only be generated for specific instances used in the schema files, then we could compute all offsets in advance the same way we do now.  Or, if we were willing to drop (2) then we could say that fields of generic type always occupy a whole word in each of the sections, which is enough space to hold any kind of value.

This seems like a pretty big limitation. I imagine a lot of people (ie. me) would want to use this in the context of numerics, and be unable to without paying the cost of boxing. I haven't given it too much thought, but is there any way to keep (1), but use template specialization to deal with pre-computing offsets and what-not? I'm guessing this would be a C++ only solution though, since most other language's generics don't allow for specialization.
 

If we agree that this is an acceptable limitation, then I think the remaining details fall into place intuitively.  For any given generic type, we generate one version of the type where all the type parameters are bound to `AnyPointer`, and then we generate a C++ template/Java generic/whatever wrapper around that type which interprets all of the AnyPointers according to the specific parameterization.  The most complicated implementation issues are around the schema and dynamic API in C++, but that's only because there are some unusual implementation requirements in C++ (mostly around avoiding heap allocation); I'm guessing C++-specific details don't interest you.  :)

Would there be any big changes to how the dynamic API looks? Tangentially, what kind of changes would have to happen at the parsed schema level?

Kenton Varda

unread,
Feb 21, 2014, 7:17:25 PM2/21/14
to Jason Paryani, Carter Schonwald, capnproto
On Fri, Feb 21, 2014 at 3:51 PM, Jason Paryani <jpar...@gmail.com> wrote:
This seems like a pretty big limitation. I imagine a lot of people (ie. me) would want to use this in the context of numerics, and be unable to without paying the cost of boxing. I haven't given it too much thought, but is there any way to keep (1), but use template specialization to deal with pre-computing offsets and what-not? I'm guessing this would be a C++ only solution though, since most other language's generics don't allow for specialization.

Specialization doesn't really work because you'd need at least (number of types)^(number of type parameters) specializations to cover everything.  With only one type parameter it seems plausible, but with two or more it clearly isn't.

Also, as you say, that seems like it only works for generated C++ code.  Not only are there still difficulties with other languages, but also with dynamic schemas.

Regarding numerics, would it be fair to say that most of the problem is with lists of numbers, not individual fields?  As it turns out, the overhead of boxing magically goes away if we're talking about a list, because lists of structs are flattened and compacted.  E.g. if you have a struct containing a single UInt16 field, then a list of this struct will in fact only use two bytes per element (rounded up to one 64-bit word, plus a one-word tag).
 
Would there be any big changes to how the dynamic API looks? Tangentially, what kind of changes would have to happen at the parsed schema level?

The schema API would need some changes.  I haven't completely figured out what these would look like yet.  The `Schema` type would somehow need to contain not just a pointer to the schema proto but also a pointer to the type parameter assignments.  Doing this while keeping the property that getting a Schema for any compile-time-known specialization requires no allocation will be tricky, but doable I think.  For runtime specialization you'll of course need to allocate something, but perhaps that can be something you use SchemaLoader for.

The dynamic API (dynamic.h) probably won't change at all.  Once you have a Schema representing the specialization you're interested in, there's no reason for the dynamic API to treat it any differently from a non-generic type.

-Kenton

Carter Schonwald

unread,
Feb 21, 2014, 7:30:43 PM2/21/14
to Jason Paryani, Kenton Varda, capnproto
Jason, i don't know what you're working on, but i'm doing some hpc numerical stuff myself, and the solution is to have it be a pointer to an unboxed array, rather than eg an array of pointers


I'm ok with type erasure personally / Boxing.

 It does mean that a consumer of the generic type  will likely need a wee "type witness" value too unless they've arranged to have the right code in place to consume  the generic value. And thats fine! In my use cases I very much want to provide serialization formats that are "trust but verify" in their usage.  Type erasure is generally a *good* thing! Lots of weird issues happen when parametricity / type erasure die.


point being, i like this general design direction :) 
 


Carter Schonwald

unread,
Feb 21, 2014, 7:31:17 PM2/21/14
to Kenton Varda, Jason Paryani, capnproto
ok cool, yeah, some sort of wee "type witness" is exactly the piece i'm wondering about. 

Jason Paryani

unread,
Feb 21, 2014, 8:07:50 PM2/21/14
to Carter Schonwald, Kenton Varda, capnproto
Regarding numerics, would it be fair to say that most of the problem is with lists of numbers, not individual fields?  As it turns out, the overhead of boxing magically goes away if we're talking about a list, because lists of structs are flattened and compacted.  E.g. if you have a struct containing a single UInt16 field, then a list of this struct will in fact only use two bytes per element (rounded up to one 64-bit word, plus a one-word tag).

Ah ok, I forgot about the fact that lists will compact structs. I guess that would solve my complaint in most scenarios. Does this compaction still occur if there's a nested struct? ie:

struct Foo(Bar) {
  a @0 :Int32;
  b @1 :Bar;
}

struct Int32Box {
  value @0 :Int32;
}

struct Qux {
  e @0 :List(Foo(Int32Box));
}


Kenton Varda

unread,
Feb 21, 2014, 8:13:14 PM2/21/14
to Jason Paryani, Carter Schonwald, capnproto
On Fri, Feb 21, 2014 at 5:07 PM, Jason Paryani <jpar...@gmail.com> wrote:
Does this compaction still occur if there's a nested struct?

No, only the first level gets compacted.  In your example case, Foo.b will always be a pointer to an independently-allocated object.

Andreas Stenius

unread,
Feb 22, 2014, 2:05:15 PM2/22/14
to Kenton Varda, capnproto
2014-02-21 23:17 GMT+01:00 Kenton Varda <temp...@gmail.com>:
[...]
However, I think requirements (1) and (2) are both more important to me than the ability to parameterize non-pointer types.  The pointer-only requirement can be worked around, after all, by "boxing" primitive types as one would in Java, which at worst wastes 127 bits (which is much better than, say, boxing in Java).


Why is (1) so important?
Also, how about languages that don't have generics in the first place.. there point (1) doesn't even apply (at least not in the way as it was stated; there may still be some form of limitations to fulfill pt 1).
I feel (albeit vaguely) that it ought to be enough to have generated code for every specialization used in the schema, rather than to provide a generic schema type.. 


Andrew Lutomirski

unread,
Feb 22, 2014, 8:10:38 PM2/22/14
to Andreas Stenius, capnproto, Kenton Varda

C++ templates are Turing complete, so C++ can, in theory, have the best of both worlds.

/me cowers :)

>
> --
> You received this message because you are subscribed to the Google Groups "Cap'n Proto" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to capnproto+...@googlegroups.com.
> Visit this group at http://groups.google.com/group/capnproto.

Carter Schonwald

unread,
Feb 22, 2014, 9:10:42 PM2/22/14
to Andrew Lutomirski, Andreas Stenius, capnproto, Kenton Varda
does anyone have a real use case for the magic unboxed generics? 

Carter Schonwald

unread,
Feb 22, 2014, 10:23:07 PM2/22/14
to Andrew Lutomirski, Andreas Stenius, capnproto, Kenton Varda
i perhaps said that too harshly.

that said, i'm interested in generics support that I can reasonably hope to work reliably in a wide variety of client languages, compiled / interpreted / vmed all, and have it just work™.  

Kenton, its up to you at the end of the day, but the boxed + type witness style approach sounds great to me (and If I can help move that forward,  please tell me how :) )

Kenton Varda

unread,
Feb 23, 2014, 4:01:59 PM2/23/14
to Andreas Stenius, capnproto
On Sat, Feb 22, 2014 at 11:05 AM, Andreas Stenius <g...@astekk.se> wrote:
2014-02-21 23:17 GMT+01:00 Kenton Varda <temp...@gmail.com>:
[...]
However, I think requirements (1) and (2) are both more important to me than the ability to parameterize non-pointer types.  The pointer-only requirement can be worked around, after all, by "boxing" primitive types as one would in Java, which at worst wastes 127 bits (which is much better than, say, boxing in Java).


Why is (1) so important?

Two important reasons:

1) It's what people expect, therefore it will be more intuitive and require less learning.

2) People will want to write generic code in their language of choice which operates on the generic types.  That's probably going to be difficult if the types are not defined using the generics of the target language.  E.g. if I have a Cap'n Proto map struct Map(Key, Value), I may want to define some helpers in C++ like:

  template <typename K, typename V>
  Maybe<ReaderFor<V>> lookup(Map::Reader<K, V> map, ReaderFor<K> key);

  template <typename K, typename V>
  Builder<V> insert(Map::Builder<K, V> map, ReaderFor<K> key);

If the generated types weren't templates then I wouldn't be able to write these helpers generically.
 
Also, how about languages that don't have generics in the first place.. there point (1) doesn't even apply (at least not in the way as it was stated; there may still be some form of limitations to fulfill pt 1).

In those languages, the goal would be to use whatever idioms normally replace generics.  In the worst case I think we'd just generate the type as if all the type parameters were bound to `AnyPointer`.  (Note again that this is only sufficient if we say that type parameters can only be bound to pointer types.)
 
I feel (albeit vaguely) that it ought to be enough to have generated code for every specialization used in the schema, rather than to provide a generic schema type.. 

Another problem with this approach comes in cases where multiple independent schemas import a shared schema and then specialize one of its generic types in the same way.  Where does that specialization get generated?  Do we end up with two copies of the same type?  C++ templates have this problem and compilers and linkers do a lot of hacky things to deal with it; not sure I want to reproduce all that in the Cap'n Proto compiler.

-Kenton

Andrew Lutomirski

unread,
Feb 23, 2014, 8:28:29 PM2/23/14
to Kenton Varda, Andreas Stenius, capnproto
On Sun, Feb 23, 2014 at 2:01 PM, Kenton Varda <temp...@gmail.com> wrote:
> On Sat, Feb 22, 2014 at 11:05 AM, Andreas Stenius <g...@astekk.se> wrote:
>>
>> 2014-02-21 23:17 GMT+01:00 Kenton Varda <temp...@gmail.com>:
>>>>
>>>> [...]
>>>> However, I think requirements (1) and (2) are both more important to me
>>>> than the ability to parameterize non-pointer types. The pointer-only
>>>> requirement can be worked around, after all, by "boxing" primitive types as
>>>> one would in Java, which at worst wastes 127 bits (which is much better
>>>> than, say, boxing in Java).
>>>>
>>
>> Why is (1) so important?
>
>
> Two important reasons:
>
> 1) It's what people expect, therefore it will be more intuitive and require
> less learning.
>
> 2) People will want to write generic code in their language of choice which
> operates on the generic types. That's probably going to be difficult if the
> types are not defined using the generics of the target language. E.g. if I
> have a Cap'n Proto map struct Map(Key, Value), I may want to define some
> helpers in C++ like:
>
> template <typename K, typename V>
> Maybe<ReaderFor<V>> lookup(Map::Reader<K, V> map, ReaderFor<K> key);
>
> template <typename K, typename V>
> Builder<V> insert(Map::Builder<K, V> map, ReaderFor<K> key);
>
> If the generated types weren't templates then I wouldn't be able to write
> these helpers generically.

For C++ in particular, this is a non-issue: a generic could be a
template with no definition, and each type in a .capnp file could
generate an explicit specialization.

That does not address the equivalent problem in Java, though, AFAIK.
I don't know about Haskell and such, nor do I know how .NET handles
this.

--Andy

Carter Schonwald

unread,
Feb 24, 2014, 9:18:20 PM2/24/14
to Andrew Lutomirski, Kenton Varda, Andreas Stenius, capnproto
As a point of order, i'm totally ok with the case where making a preexisting field generic breaks wire compatibility if that enables some flexibility in how much unboxing or not can be done on generic fields. Totally totally ok with it. You'd better have versioning anyways :)  


Carter Schonwald

unread,
Feb 24, 2014, 9:21:09 PM2/24/14
to Andrew Lutomirski, Kenton Varda, Andreas Stenius, capnproto
likewise, Haskell has pretty good support for handling unboxed / struct-like data in a generic way (that can be pretty human friendly to the end user). IN principle tooling for other langs can do the same.

 I understand that Java is slightly hosed because it can't even have an unsigned Int type, but thats a separate topic :) 

Kenton Varda

unread,
Feb 25, 2014, 7:41:20 PM2/25/14
to Carter Tazio Schonwald, capnproto
Oops, apparently I didn't reply-all before.


On Tue, Feb 25, 2014 at 4:33 PM, Carter Tazio Schonwald <carter.s...@gmail.com> wrote:
ok, thats super reasonable. I'm honestly still new to dealing with those sorts of issues. :)

-Carter


On Tuesday, February 25, 2014 at 1:44 AM, Kenton Varda wrote:

On Mon, Feb 24, 2014 at 6:18 PM, Carter Schonwald <carter.s...@gmail.com> wrote:
As a point of order, i'm totally ok with the case where making a preexisting field generic breaks wire compatibility if that enables some flexibility in how much unboxing or not can be done on generic fields. Totally totally ok with it. You'd better have versioning anyways :)  

Well, I'm not.  :)

Cap'n Proto inherits the Protobuf philosophy that eschews "versioning" in favor of a more fluid kind of compatibility where old code simply ignores new fields it doesn't know about and so there's no need to update the world after every change.  If updating everyone is necessary then people will often choose to implement ugly hacks where they shoe-horn one piece of data into some other type that wasn't designed to hold it and eventually the whole protocol descends into madness.

-Kenton


Carter Schonwald

unread,
Feb 25, 2014, 8:59:37 PM2/25/14
to Kenton Varda, capnproto
ahh, I thought it was a private aside :)

so in that sense, capn proto is meant to "make sure the schema writer" plans for growth in their format?

Greg Stark

unread,
Feb 26, 2014, 4:38:00 AM2/26/14
to Kenton Varda, Carter Tazio Schonwald, capnproto

Is there a way to programmatically test two schemas to check if the new one is compatible with the old one?

--
greg

Andreas Stenius

unread,
Feb 26, 2014, 9:05:22 AM2/26/14
to Greg Stark, Kenton Varda, Carter Tazio Schonwald, capnproto
2014-02-26 10:38 GMT+01:00 Greg Stark <st...@mit.edu>:

Is there a way to programmatically test two schemas to check if the new one is compatible with the old one?


I don't know the inner workings of it, but the evolution test for the compiler might be the closest thing:

Cheers

Kenton Varda

unread,
Feb 26, 2014, 5:33:36 PM2/26/14
to Greg Stark, Carter Tazio Schonwald, capnproto
On Wed, Feb 26, 2014 at 1:38 AM, Greg Stark <st...@mit.edu> wrote:

Is there a way to programmatically test two schemas to check if the new one is compatible with the old one?

In fact, there is.  If you make a SchemaLoader and use it to load two different schemas with the same type ID, it will check that one is a compatible upgrade from the other, and will use that one.

That said, I'm not very happy with the SchemaLoader class right now.  The interface is weird and the implementation feels awkward.  So there's a possibility things will change in the future.  But I definitely think that checking compatibility is something that should be provided.  E.g. I think it would be awesome if people could set up a git hook that checks that changes they make to a .capnp file do not break compatibility.
Reply all
Reply to author
Forward
0 new messages