Only one memory allocation for Zero-Copy

dschatz

unread,

Sep 9, 2013, 9:52:04 AM9/9/13

to capn...@googlegroups.com

Is there a way to ensure that a capnproto message only allocates memory once from its Builder?

I work on a system that allows for Zero Copy networking as follows:

1. A client allocates a buffer from the networking subsystem:
char* network_allocate(size_t num_bytes);

the networking subsystem in turn will allocate enough memory to contain packet headers and footers.

2. Client fills the buffer with payload data

3. Client calls the networking subsystem to send the buffer without copying
void network_send(char* buffer)

Using this and capnproto is very appealing if I could achieve both zero copy and no (de)serialization for low latency message passing. But it requires that I only allocate once, from the networking system. Is this possible within capnproto?

Andrew Lutomirski

unread,

Sep 9, 2013, 1:04:36 PM9/9/13

to dschatz, capnproto

You can subclass MessageBuilder for this.

Looking at it, though, I have another question: I bet it's possible to
shave 16 bytes off of single-segment serialized messages. A
multiple-segment message starts with a count of segments. Assuming
that the only allowable message roots are structs (is this true? can
I serialize an Int16?), the first segment of a message will always
start with a struct pointer. So, if the format changed such that the
struct pointer tag wasn't zero, then as long as a multi-segment
message could have no more than, say, 2^62 - 1 segments, a segment
count could never look like a struct pointer. Then single-segment
messages could omit the segment count and the length of the first
message.

Another possible tweak would be to always omit the length of the last
segment, since I doubt that anyone will use the segment table as a way
to make the messages self-delimiting.

(In the case where you're serializing into a circular buffer for, say,
shared-memory IPC, you'd probably just reserve a couple of header
bytes for the total length, and the receiver could check whether the
message wrapped to distinguish between the one- and two-segment
cases.)

--Andy

Andrew Lutomirski

unread,

Sep 9, 2013, 1:06:22 PM9/9/13

to dschatz, capnproto

On Mon, Sep 9, 2013 at 10:04 AM, Andrew Lutomirski <an...@luto.us> wrote:
> On Mon, Sep 9, 2013 at 6:52 AM, dschatz <schatzb...@gmail.com> wrote:
>> Is there a way to ensure that a capnproto message only allocates memory once
>> from its Builder?
>>
>> I work on a system that allows for Zero Copy networking as follows:
>>
>> 1. A client allocates a buffer from the networking subsystem:
>> char* network_allocate(size_t num_bytes);
>>
>> the networking subsystem in turn will allocate enough memory to contain
>> packet headers and footers.
>>
>> 2. Client fills the buffer with payload data
>>
>> 3. Client calls the networking subsystem to send the buffer without copying
>> void network_send(char* buffer)
>>
>> Using this and capnproto is very appealing if I could achieve both zero copy
>> and no (de)serialization for low latency message passing. But it requires
>> that I only allocate once, from the networking system. Is this possible
>> within capnproto?
>
> You can subclass MessageBuilder for this.

Sorry, I read your message wrong. This is difficult with capnproto or
otherwise unless that network_send function takes a pointer and a
length as parameters -- you'd have to know the serialized length in
advance. protobuf can do that, but that's only because protobuf needs
to make a pass over the message to serialize it.

--Andy

dschatz

unread,

Sep 9, 2013, 1:45:25 PM9/9/13

to capn...@googlegroups.com, dschatz, an...@luto.us

The network_send function takes a previously allocated buffer (the size was determined at allocation time, not send time).

This would require saying up front what the "shape" of the message would be before actually giving the data to be placed into the message so that the size could be computed and the memory allocated. In this respect protobuf is also inadequate because I must construct the message in memory in the deserialized format first, then compute the length of the serialized message, allocate the buffer, then copy + serialize to the buffer. Capnproto gets rid of the serialization but it sounds like I still have the memory to memory copy which I was hoping could be avoided.

How difficult would it be to modify capnproto to allow for this?

David Renshaw

unread,

Sep 9, 2013, 1:52:40 PM9/9/13

to dschatz, capnproto, Andrew Lutomirski

MallocMessageBuilder lets you pass in a buffer to use as the message's
first segment:

https://github.com/kentonv/capnproto/blob/master/c%2B%2B/src/capnp/message.h#L312

> --
> You received this message because you are subscribed to the Google Groups
> "Cap'n Proto" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to capnproto+...@googlegroups.com.
> Visit this group at http://groups.google.com/group/capnproto.

Andrew Lutomirski

unread,

Sep 9, 2013, 1:55:27 PM9/9/13

to dschatz, capnproto

On Mon, Sep 9, 2013 at 10:45 AM, dschatz <schatzb...@gmail.com> wrote:
> The network_send function takes a previously allocated buffer (the size was
> determined at allocation time, not send time).
>
> This would require saying up front what the "shape" of the message would be
> before actually giving the data to be placed into the message so that the
> size could be computed and the memory allocated. In this respect protobuf is
> also inadequate because I must construct the message in memory in the
> deserialized format first, then compute the length of the serialized
> message, allocate the buffer, then copy + serialize to the buffer. Capnproto
> gets rid of the serialization but it sounds like I still have the memory to
> memory copy which I was hoping could be avoided.
>
> How difficult would it be to modify capnproto to allow for this?

Presumably impossible. How is capnproto supposed to know how big the
message is before you fill in the message?

What zero copy API are you using? It sounds much more sensible to fix
the API to allow you to ask for space, use part of it, and give the
rest back.

OpenOnload, for example, explicitly supports this when using onload_zc_send.

--Andy

Kenton Varda

unread,

Sep 9, 2013, 2:14:45 PM9/9/13

to dschatz, capnproto

Hi Dan,

I think you could accomplish what you want as long as you can make sure to over-estimate the space needed at allocation time. With Cap'n Proto this should not be too hard since structs have fixed widths. You can then use capnp::FlatMessageBuilder (from capnp/message.h) to build your message in-place, or implement your own MessageBuilder subclass. When it comes time to actually send the message, you can call builder.getSegmentsForOutput()[0].size() to find out the actual size (in words) of the segment if you want to avoid sending the trailing unused space. Alternatively, it's fine if you send the trailing space, as it will be ignored on the receiving end. On the receiving end, you'll want to use capnp::SegmentArrayMessageReader and pass it the whole message as a single segment.

So that just leaves the problem of how to estimate space usage. The Cap'n Proto library can't really help you with this since how would you express a request to estimate the size of a message without actually building it? What you could do instead is build a sample message at start-up time and find out how big in turns out to be. Then, as long as your subsequent messages have the same "shape", you can assume they will be the same size.

If your messages vary widely in size, then there are two options remaining:

- Forgo zero-copy. Build you message as a MallocMessageBuilder and then call the root struct's totalSizeInWords() method and add 1 to find out the size needed to represent the message as a single segment. (You need to add 1 for the root pointer.) Then you can create a FlatMessageBuilder and use its setRoot() method to copy over the whole struct.

- Improve the API of your zero-copy transport such that it allows you to allocate additional segments later if needed. Then you can implement a subclass of MessageBuilder in the obvious way.

-Kenton

On Mon, Sep 9, 2013 at 6:52 AM, dschatz <schatzb...@gmail.com> wrote:

--

dschatz

unread,

Sep 9, 2013, 3:59:59 PM9/9/13

to capn...@googlegroups.com

Hi Kenton,

Some of my messages have lists and are therefore variable length. My hope was that there might be a way to fully describe the shape of the message at run-time (say how big each variable length item in a message would be) then ask how large that would be before actually giving it the data.

Unfortunately the NIC I work with does not support scatter-gather so sending multiple segments would require a copy into a single network buffer (therefore it would no longer be zero-copy).

I will likely go the route of using a FlatMessageBuilder and estimating or adjusting the size of an allocation heuristically. Thanks

Kenton Varda

unread,

Sep 9, 2013, 4:24:38 PM9/9/13

to dschatz, capnproto

On Mon, Sep 9, 2013 at 12:59 PM, dschatz <schatzb...@gmail.com> wrote:

Hi Kenton,

Some of my messages have lists and are therefore variable length. My hope was that there might be a way to fully describe the shape of the message at run-time (say how big each variable length item in a message would be) then ask how large that would be before actually giving it the data.

Honestly, I'm not sure there's be much to gain from a separate API for this. You'd have to make an extra pass over your data to describe the message shape. At that point you may have already lost a lot of the benefit of avoiding a copy later.

Unfortunately the NIC I work with does not support scatter-gather so sending multiple segments would require a copy into a single network buffer (therefore it would no longer be zero-copy).

What about sending each segment as a separate datagram, and then re-assemble them at the receiving end? If your message is big enough to need multiple segments then it probably doesn't fit into a single packet anyway.

-Kenton

Kenton Varda

unread,

Sep 9, 2013, 4:37:41 PM9/9/13

to Andrew Lutomirski, dschatz, capnproto

On Mon, Sep 9, 2013 at 10:04 AM, Andrew Lutomirski <an...@luto.us> wrote:

Looking at it, though, I have another question: I bet it's possible to
shave 16 bytes off of single-segment serialized messages. A
multiple-segment message starts with a count of segments. Assuming
that the only allowable message roots are structs (is this true? can
I serialize an Int16?), the first segment of a message will always
start with a struct pointer. So, if the format changed such that the
struct pointer tag wasn't zero, then as long as a multi-segment
message could have no more than, say, 2^62 - 1 segments, a segment
count could never look like a struct pointer. Then single-segment
messages could omit the segment count and the length of the first
message.

Interesting line of thought. But note that the segment count is 4 bytes, and each segment size is also 4 bytes, so the segment table for a single-segment message is actually only 8 bytes. Also, since the "tag" bits of a pointer are the least-significant bits, I'm not sure how having a non-zero tag for structs would help distinguish them from a segment count.

It would probably be possible to make a rule that the segment count cannot be more than 2^31, so if the top bit is set, then we can interpret the segment table in a completely different way. We could, for instance, interpret bytes 4-7 as the upper bytes of a struct pointer (indicating the section sizes) and assume the lower bytes of that pointer to be all-zero (indicating a struct pointer pointing to the very next word, which is the norm for a message root).

I'm not really sure if this is worth the complication, though.

Another possible tweak would be to always omit the length of the last
segment, since I doubt that anyone will use the segment table as a way
to make the messages self-delimiting.

Actually, it's very much intended for such use. It currently works just fine to write multiple Cap'n Proto messages to a stream without any additional delimiting.

In cases where the transport provides its own framing, people are welcome to skip the "standard" serialization and instead call getSegmentsForOutput() and frame them however they want.

-Kenton

Andrew Lutomirski

unread,

Sep 9, 2013, 4:55:10 PM9/9/13

to Kenton Varda, dschatz, capnproto

On Mon, Sep 9, 2013 at 1:37 PM, Kenton Varda <temp...@gmail.com> wrote:
> On Mon, Sep 9, 2013 at 10:04 AM, Andrew Lutomirski <an...@luto.us> wrote:
>>
>> Looking at it, though, I have another question: I bet it's possible to
>> shave 16 bytes off of single-segment serialized messages. A
>> multiple-segment message starts with a count of segments. Assuming
>> that the only allowable message roots are structs (is this true? can
>> I serialize an Int16?), the first segment of a message will always
>> start with a struct pointer. So, if the format changed such that the
>> struct pointer tag wasn't zero, then as long as a multi-segment
>> message could have no more than, say, 2^62 - 1 segments, a segment
>> count could never look like a struct pointer. Then single-segment
>> messages could omit the segment count and the length of the first
>> message.
>
>
> Interesting line of thought. But note that the segment count is 4 bytes,
> and each segment size is also 4 bytes, so the segment table for a
> single-segment message is actually only 8 bytes. Also, since the "tag" bits
> of a pointer are the least-significant bits, I'm not sure how having a
> non-zero tag for structs would help distinguish them from a segment count.

Whoops -- I read that backwards.

>
> It would probably be possible to make a rule that the segment count cannot
> be more than 2^31, so if the top bit is set, then we can interpret the
> segment table in a completely different way. We could, for instance,
> interpret bytes 4-7 as the upper bytes of a struct pointer (indicating the
> section sizes) and assume the lower bytes of that pointer to be all-zero
> (indicating a struct pointer pointing to the very next word, which is the
> norm for a message root).
>
> I'm not really sure if this is worth the complication, though.

A simpler approach might be to send the number of segments as ((N-1)
<< 1) | 1 -- that is, ensure that the low bit is always set. That
also makes sure that struct pointers never look like segment tables.
In the simplest case, doing just this means that trying to read a
segment table as a struct or vice versa is guaranteed to fail cleanly.

Your fancier thing has the benefit that this kind of struct is still
self-delimiting, which is nice (given the below).

>
>>
>> Another possible tweak would be to always omit the length of the last
>> segment, since I doubt that anyone will use the segment table as a way
>> to make the messages self-delimiting.
>
>
> Actually, it's very much intended for such use. It currently works just
> fine to write multiple Cap'n Proto messages to a stream without any
> additional delimiting.
>
> In cases where the transport provides its own framing, people are welcome to
> skip the "standard" serialization and instead call getSegmentsForOutput()
> and frame them however they want.

--Andy

>
> -Kenton

Kenton Varda

unread,

Sep 9, 2013, 5:06:28 PM9/9/13

to Andrew Lutomirski, dschatz, capnproto

The problem is, I think it's a bit late to change the serialization format. There are users now, and I have ambiguously promised not to break wire compatibility for a while (while making no such promise about the API).

We could eventually introduce a new framing format that users would have to switch to explicitly, though.

Andrew Lutomirski

unread,

Sep 9, 2013, 5:28:10 PM9/9/13

to Kenton Varda, dschatz, capnproto

On Mon, Sep 9, 2013 at 2:06 PM, Kenton Varda <temp...@gmail.com> wrote:
> The problem is, I think it's a bit late to change the serialization format.
> There are users now, and I have ambiguously promised not to break wire
> compatibility for a while (while making no such promise about the API).
>
> We could eventually introduce a new framing format that users would have to
> switch to explicitly, though.
>

Fair enough. I will either grumble a little bit about the wasted word
per message or I might just use single segments exclusively.

What does "for a while" mean?

--Andy

Kenton Varda

unread,

Sep 9, 2013, 11:36:55 PM9/9/13

to Andrew Lutomirski, dschatz, capnproto

On Mon, Sep 9, 2013 at 2:28 PM, Andrew Lutomirski <an...@luto.us> wrote:

Fair enough. I will either grumble a little bit about the wasted word
per message or I might just use single segments exclusively.

FWIW, capnp's encode and decode methods recognize a flag "--flat" which indicates that the message is encoded as a single segment, with no framing. So there is precedent for that. It's just tricky to use if you want to support arbitrarily-sized messages without a copy.

Also, you can always write your own custom framing format. It's not too hard. serialize.c++ is pretty simple and only relies on public interfaces, so fork it and change it as you see fit. :)

What does "for a while" mean?

I think in the 0.1 release announcement I claimed that the wire format wouldn't be changing, since otherwise I felt it would be dishonest to say that the code was "ready to use".

Though I did slightly change the way unions are laid out in 0.2. But I don't think anyone was actually using it yet. There are definitely users now.

-Kenton

Jan Huwald

unread,

Sep 10, 2013, 1:49:18 PM9/10/13

to capn...@googlegroups.com

On 09/09/2013 09:59 PM, dschatz wrote:
> Hi Kenton,
>
> Some of my messages have lists and are therefore variable length. My hope
> was that there might be a way to fully describe the shape of the message at
> run-time (say how big each variable length item in a message would be) then
> ask how large that would be before actually giving it the data.
>
> Unfortunately the NIC I work with does not support scatter-gather so
> sending multiple segments would require a copy into a single network buffer
> (therefore it would no longer be zero-copy).
>
> I will likely go the route of using a FlatMessageBuilder and estimating or
> adjusting the size of an allocation heuristically. Thanks

You can implement a non-copying realloc() if you manage the address
space of your application yourself: Just mmap() a small buffer at a
given offset and add another mmap() directly after it once you need more
space. Multiple buffers are separated by a large address difference
(e.g. 1G).

signature.asc

Reply all

Reply to author

Forward