Shared memory transport

sanj...@gmail.com

unread,

Nov 13, 2014, 4:58:00 AM11/13/14

to capn...@googlegroups.com

Hi Ken,

Is anyone (including you) working on a shared memory transport for Cap'nProto ? If yes, then when it is scheduled for release ? If no, any idea what intricacies one would have to keep in mind to add such a feature ? (if the undersigned newbie were to undertake it :-) )

The feature has been listed on the roadmap for 0.5
https://github.com/kentonv/capnproto/blob/master/doc/roadmap.md

And there has been discussion on it earlier
https://groups.google.com/d/msg/capnproto/PFfM82VRGw8/VnaUGHoMM1MJ
and
https://groups.google.com/d/msg/capnproto/rHsPvZtbuEk/MPlUUm6acwgJ

-Sandeep

Kenton Varda

unread,

Nov 14, 2014, 11:56:21 PM11/14/14

to sanj...@gmail.com, capnproto

Hi Sandeep,

I am co-founder of Sandstorm.io, which also has a lot to gain from the shared memory transport. But, we see this as an optimization, and we have so far had higher priorities. Currently we are using RPC over unix sockets instead. When we implement the shared memory transport, I intend to do it in such a way that it kicks in automatically whenever you use a unix socket for RPC.

Can you tell me a bit about your use case? Do you need shared memory RPC urgently, or could you use unix sockets for now knowing that it will be optimized later?

-Kenton

--
You received this message because you are subscribed to the Google Groups "Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to capnproto+...@googlegroups.com.
Visit this group at http://groups.google.com/group/capnproto.

sanj...@gmail.com

unread,

Nov 16, 2014, 12:27:45 PM11/16/14

to capn...@googlegroups.com, sanj...@gmail.com

Kenton,

Comments in text

On Saturday, November 15, 2014 10:26:21 AM UTC+5:30, Kenton Varda wrote:

Hi Sandeep,

I am co-founder of Sandstorm.io, which also has a lot to gain from the shared memory transport. But, we see this as an optimization, and we have so far had higher priorities. Currently we are using RPC over unix sockets instead. When we implement the shared memory transport, I intend to do it in such a way that it kicks in automatically whenever you use a unix socket for RPC.

As the primary author, it is possible that you would come up with an elegant implementation, but if you don't mind, could you share how you would go about it. I would love to know and maybe if I get time, I could hack in something temporary which you could pull later if you find the changes acceptable.

Can you tell me a bit about your use case? Do you need shared memory RPC urgently, or could you use unix sockets for now knowing that it will be optimized later?

Its for a storage cluster project. You also responded to a private query of ours. I have already tried unix domain sockets but its an incremental improvement. IIRC, as you stated elsewhere on this group, shared memory transport has infinite bandwidth.

Kenton Varda

unread,

Nov 16, 2014, 7:25:40 PM11/16/14

to sanj...@gmail.com, capnproto

Hi Sandeep,

I haven't worked out all the details yet, but I think it would look something like this:

Each end of the connection performs the following steps to set up:
- Establish a regular unix socket connection.

- Create a temporary file for outgoing messages. Try to create the file on a tmpfs if possible so that it isn't written to disk.

- mmap the temporary file into memory read-write. This is the "outgoing" buffer.

- Open a second file descriptor to the same file with O_RDONLY.

- Send this read-only file descriptor over the unix socket as an SCM_RIGHTS message (see http://man7.org/linux/man-pages/man7/unix.7.html).

- Receive the read-only file descriptor sent by the peer for their end of the connection.

- mmap the received descriptor read-only. This is the "incoming" buffer.

Now, communication can happen something like this:

- You have "incoming" and "outgoing" buffers (per above).

- When you construct a new message, allocate space for it from your "outgoing" buffer. You will need to write a custom subclass of capnp::MessageBuilder which does this allocation, as an alternative to MallocMessageBuilder. Many different allocation strategies could make sense.

- Once the message is complete and ready to be sent, you need to notify the peer that a message is available and transmit the segment table -- that is, a list of offsets within the buffer where each message segment appears and the size of each segment. (More on this below.)

- The peer receives notification that your message is available and creates a MessageReader for it.

- The MessageReader is passed to application code, etc.

- When the MessageReader is destroyed, the peer sends notification back that the message is no longer needed.

- You may then mark the buffer space used by the message for reuse.

Two problems left:

- How exactly does notification (of message availability, and then message consumption) work?

- How is the segment table transmitted?

It may make sense to just use the unix socket for these. These notifications would be short so should find their way to the other process quickly. The segment table could be transmitted as part of this message. This keeps things simple.

Another option would be to maintain a linked list of "notifications" inside the buffers themselves. Start from the first 8 bytes of the buffer. These act as a pointer to the first notification. Initially, the pointer is null. When the first notification is ready, its offset is written to these first 8 bytes. The notification itself contains a pointer to the following notification, which again starts null. Every time the list is extended, the sender must also signal the receiving end to consume the new notifications; this can be done in a number of ways, such as a unix signal, or perhaps on Linux by using futex(2) on the memory location itself. The sender of a notification can free the notification's linked list node as soon as the peer has send a notification indicating that they have consumed it.

Note that I don't have any idea if signals or futex will be any faster in practice than sending notifications to the unix socket. (Maybe Andy could comment on this?)

Note also that I don't actually have much experience with shared memory, so it's possible that this isn't a great idea and there's something much better out there. Perhaps it would make more sense for the notifications to be kept in a ring buffer rather than a linked list, for example.

If you want to use Cap'n Proto RPC on top of your shared memory message passing, you just need to write a custom subclass of capnp::VatNetwork based on your code. You can look at capnp::TwoPartyVatNetwork as a starting point for this.

Eventually I would like to extend TwoPartyVatNetwork itself to auto-detect when the connection in a Unix socket and use the strategy above. I would probably start out by writing notifications to the socket rather than try to do the linked list thing.

-Kenton

Andrew Lutomirski

unread,

Nov 18, 2014, 8:24:25 PM11/18/14

to Kenton Varda, sanj...@gmail.com, capnproto

On Sun, Nov 16, 2014 at 4:25 PM, Kenton Varda <ken...@sandstorm.io> wrote:
> Hi Sandeep,
>
> I haven't worked out all the details yet, but I think it would look
> something like this:
>
> Each end of the connection performs the following steps to set up:
> - Establish a regular unix socket connection.
> - Create a temporary file for outgoing messages. Try to create the file on a
> tmpfs if possible so that it isn't written to disk.
> - mmap the temporary file into memory read-write. This is the "outgoing"
> buffer.
> - Open a second file descriptor to the same file with O_RDONLY.

In the interest of avoiding nasty DoS issues, I would suggest
depending on very new kernels and doing it a bit differently:

- Create a memfd using memfd_create with mode 0000 or perhaps 0400
- Set its size
- Set F_SEAL_SHRINK using fcntl
- Open a read-only fd using /proc (hmm, not ideal)
- Pass that fd using SCM_RIGHTS
- Receiver checks for the F_SEAL_SHRINK seal and mmaps it.

If the /proc dependency is a problem, maybe that can be fixed in the kernel.

--Andy

Kenton Varda

unread,

Nov 19, 2014, 12:57:49 AM11/19/14

to Andrew Lutomirski, sanj...@gmail.com, capnproto

On Tue, Nov 18, 2014 at 5:24 PM, Andrew Lutomirski <an...@luto.us> wrote:

In the interest of avoiding nasty DoS issues, I would suggest
depending on very new kernels and doing it a bit differently:

Is this just SIGBUS you're talking about?

(For other's sake: The problem is that if a file that has been mmap'd is then truncated, and then the memory map is accessed beyond the truncation point, SIGBUS may be raised. This could potentially be exploited to crash the peer process.)

I think the general advice here should be: Do not use a shared memory transport to talk to a process you don't trust unless you are really sure you know what you're doing. Even with memfd and seal, TOCTOU vulnerabilities are still a big problem.

For Sandstorm, SIGBUS is a non-issue since the supervisor process is specific to the grain anyway and killing it will just cause the grain to die. TOCTOU can be solved by making sure that each field is read at most once. (Of course, Cap'n Proto itself is currently vulnerable to TOCTOU issues which will need to be fixed before Sandstorm uses a shared memory transport.)

-Kenton

Alex Elsayed

unread,

Nov 19, 2014, 2:26:32 AM11/19/14

to capn...@googlegroups.com

As I understand it, TOCTOU is considerably less of an issue with memfds
because seals can only be added, never removed.

So, you could actually choose where on the TOCTOU tradeoff continuum you
want to sit:

F_SEAL_SHRINK: Can't be SIGBUS'd
F_SEAL_SHRINK|F_SEAL_WRITE: Can't be TOCTOU'd, but need a new buffer for
each message sent (performace cost)

After all, memfds were originally designed _explicitly_ for cases where
TOCTOU attacks are unacceptable - passing buffers to Wayland compositors, or
large messages with KDBUS.

Just a heads-up performance wise, though - Linus is fond of saying that
zero-copy is almost never worth it, and generally speaking he has a point.

A significant amount of benchmarking by the KDBUS guys showed that 512K is a
surprisingly universal tipping point for zero-copy messages (where you
fiddle with memory mappings once per message).

It holds across architectures (ARM, x86, x86-64, PPC) and machine size
(phones and tablets through many-core huge servers) - below that, you lose
more performance to poking the necessary bits related to memory protection
domains than you gain from avoiding copies.

And if you do a persistent shared region, you tend to then start losing on
cache contention surprisingly early.

Sandeep Joshi

unread,

Nov 19, 2014, 2:58:16 AM11/19/14

to Kenton Varda, capnproto

On Mon, Nov 17, 2014 at 5:55 AM, Kenton Varda <ken...@sandstorm.io> wrote:

Hi Sandeep,

I haven't worked out all the details yet, but I think it would look something like this:

Each end of the connection performs the following steps to set up:
- Establish a regular unix socket connection.
- Create a temporary file for outgoing messages. Try to create the file on a tmpfs if possible so that it isn't written to disk.
- mmap the temporary file into memory read-write. This is the "outgoing" buffer.
- Open a second file descriptor to the same file with O_RDONLY.
- Send this read-only file descriptor over the unix socket as an SCM_RIGHTS message (see http://man7.org/linux/man-pages/man7/unix.7.html).
- Receive the read-only file descriptor sent by the peer for their end of the connection.
- mmap the received descriptor read-only. This is the "incoming" buffer.

Now, communication can happen something like this:
- You have "incoming" and "outgoing" buffers (per above).
- When you construct a new message, allocate space for it from your "outgoing" buffer. You will need to write a custom subclass of capnp::MessageBuilder which does this allocation, as an alternative to MallocMessageBuilder. Many different allocation strategies could make sense.
- Once the message is complete and ready to be sent, you need to notify the peer that a message is available and transmit the segment table -- that is, a list of offsets within the buffer where each message segment appears and the size of each segment. (More on this below.)
- The peer receives notification that your message is available and creates a MessageReader for it.
- The MessageReader is passed to application code, etc.
- When the MessageReader is destroyed, the peer sends notification back that the message is no longer needed.
- You may then mark the buffer space used by the message for reuse.

Two problems left:
- How exactly does notification (of message availability, and then message consumption) work?
- How is the segment table transmitted?

It may make sense to just use the unix socket for these. These notifications would be short so should find their way to the other process quickly. The segment table could be transmitted as part of this message. This keeps things simple.

The Shared message builder and reader are not complicated to implement but I had a couple of questions regarding notifications after looking at the code. I guess I am not yet intimate on the design philosophy which was followed.

1) Should notifications be exchanged in an arbitrary format or is there some wire protocol to be followed ? Is it expected that any data structure that goes over the wire has be in CapnProto format (i.e segment table + data if non-shared ) ? How does one handle the case where the notification, instead of being transmitted over a socket fd, gets stored in a file and reread later ?

2) If one has to add a ring buffer in shared memory to CapnProto, can one use Boost or does one have to reimplement the logic so as to fit into the "kj" framework (like the existing kj::vector, string, tuple classes) ? I noticed that CapnProto doesn't use Boost. Was this a conscious decision?

-Sandeep

Kenton Varda

unread,

Nov 20, 2014, 12:01:36 AM11/20/14

to Alex Elsayed, capnproto

On Tue, Nov 18, 2014 at 11:26 PM, Alex Elsayed <etern...@gmail.com> wrote:

As I understand it, TOCTOU is considerably less of an issue with memfds
because seals can only be added, never removed.

So, you could actually choose where on the TOCTOU tradeoff continuum you
want to sit:

F_SEAL_SHRINK: Can't be SIGBUS'd
F_SEAL_SHRINK|F_SEAL_WRITE: Can't be TOCTOU'd, but need a new buffer for
each message sent (performace cost)

Right. My sense is that this performance cost would defeat the purpose. It's almost certainly cheaper to just do a memcpy of the content than to allocate a whole new memfd for every message sent.

A significant amount of benchmarking by the KDBUS guys showed that 512K is a
surprisingly universal tipping point for zero-copy messages (where you
fiddle with memory mappings once per message).

Looks like my sense is right.

And if you do a persistent shared region, you tend to then start losing on
cache contention surprisingly early.

This I'd like to know more about. Why is a persistent shared region bad for cache? Shouldn't it be strictly better than allocating the message in one region in the sender process, then copying to kernel buffers, then copying back out into space in the receiver?

-Kenton

Kenton Varda

unread,

Nov 20, 2014, 12:12:58 AM11/20/14

to Sandeep Joshi, capnproto

On Tue, Nov 18, 2014 at 11:58 PM, Sandeep Joshi <sanj...@gmail.com> wrote:

The Shared message builder and reader are not complicated to implement but I had a couple of questions regarding notifications after looking at the code. I guess I am not yet intimate on the design philosophy which was followed.

1) Should notifications be exchanged in an arbitrary format or is there some wire protocol to be followed ?

You're inventing the protocol. :) This need for "notifications" is specific to shared memory messaging -- the "notification" being "a new message is now available in the buffer" or "I am not with this message; you can free it".

Is it expected that any data structure that goes over the wire has be in CapnProto format (i.e segment table + data if non-shared ) ?

Again, up to you. There are no existing expectations about how this should work.

If the goal is to support Cap'n Proto RPC, then what you're ultimately trying to implement is a two-way stream of Cap'n Proto messages (in the format defined in rpc.capnp). But at a lower level, you may have information (especially these "notifications") that is represented in some other format.

How does one handle the case where the notification, instead of being transmitted over a socket fd, gets stored in a file and reread later ?

In general the RPC protocol makes no sense if saved to a file.

2) If one has to add a ring buffer in shared memory to CapnProto, can one use Boost or does one have to reimplement the logic so as to fit into the "kj" framework (like the existing kj::vector, string, tuple classes) ? I noticed that CapnProto doesn't use Boost. Was this a conscious decision?

You can use whatever you want, but keep in mind that the memory segment may be mapped at difference addresses in the two processes, therefore pointers written by the sending process won't point to the right place on the receiving end. You will need to use relative pointers instead, which probably means you can't pass any regular C++ class over shared memory. (This is part of Cap'n Proto's reason for existing: Cap'n Proto messages use relative pointers and therefore can be transmitted.)

Cap'n Proto does not use boost and even avoids a lot of the C++ standard library because I feel that C++11 is such a huge change in the nature of C++ that a lot of existing libraries are now obsolete -- designed in ways that are no longer ideal. KJ is a clean framework library in pure C++11 style. But, there's nothing preventing you from using Boost or other C++ libraries together with it.

-Kenton

Alex Elsayed

unread,

Nov 20, 2014, 12:13:09 AM11/20/14

to capn...@googlegroups.com

Kenton Varda wrote:

> On Tue, Nov 18, 2014 at 11:26 PM, Alex Elsayed <etern...@gmail.com>
> wrote:
>
>> As I understand it, TOCTOU is considerably less of an issue with memfds
>> because seals can only be added, never removed.
>
>
>> So, you could actually choose where on the TOCTOU tradeoff continuum you
>> want to sit:
>>
>> F_SEAL_SHRINK: Can't be SIGBUS'd
>> F_SEAL_SHRINK|F_SEAL_WRITE: Can't be TOCTOU'd, but need a new buffer for
>> each message sent (performace cost)
>>
>
> Right. My sense is that this performance cost would defeat the purpose.
> It's almost certainly cheaper to just do a memcpy of the content than to
> allocate a whole new memfd for every message sent.
>
>> A significant amount of benchmarking by the KDBUS guys showed that 512K
>> is a
>> surprisingly universal tipping point for zero-copy messages (where you
>> fiddle with memory mappings once per message).
>>
>
> Looks like my sense is right.

Yes and no. For small messages, definitely, but there are plenty of
exceptions.

One exception, and why memfds are interesting to the Wayland folks, is that
an uncompressed 8-bit-per-channel 1080p buffer is right around six
_megabytes_ - far beyond the turnover point.

Even if it was subsampled YUV, it'd still be _well_ past 512k.

>> And if you do a persistent shared region, you tend to then start losing
>> on cache contention surprisingly early.
>>
>
> This I'd like to know more about. Why is a persistent shared region bad
> for cache? Shouldn't it be strictly better than allocating the message in
> one region in the sender process, then copying to kernel buffers, then
> copying back out into space in the receiver?
>
> -Kenton
>

This paper provides a good look at the issues:
http://sbesc.lisha.ufsc.br/sbesc2014/dl219

"Results have shown that the execution time of an application is affected by
the contention for shared memory (up to 3.8 times slower)"

For reference, MOESI and MESIF are the most-current cache control protocols
used by AMD and Intel, respectively. MESI is the direct predecessor of both.

Kenton Varda

unread,

Nov 20, 2014, 6:13:57 PM11/20/14

to Alex Elsayed, capnproto

On Wed, Nov 19, 2014 at 9:12 PM, Alex Elsayed <etern...@gmail.com> wrote:

> Looks like my sense is right.

Yes and no. For small messages, definitely, but there are plenty of
exceptions.

Right, but Cap'n Proto RPC messages are rarely more than a couple kB.

This paper provides a good look at the issues:
http://sbesc.lisha.ufsc.br/sbesc2014/dl219

Thanks. Will try to read this at some point.

Naively, it seems to me that even if you make a copy (e.g. by writing through a unix socket), the data still needs to change hands from core A to core B at some point, and you'll suffer the same caching performance hit then.

If you simply make sure that each cache line belongs to exactly one message -- and therefore is not going to be accessed by both cores at the same time -- does that solve the problem?

-Kenton

Andrew Lutomirski

unread,

Nov 20, 2014, 6:25:37 PM11/20/14

to Kenton Varda, Alex Elsayed, capnproto

I think that, if you cacheline align things, then sharing the memory
straight through is likely to be at least as fast and probably faster
than any other way of sending data back and forth. (With one
exception: memcpy will cause the hardware to notice that the copy is
streaming, so it will prefault well. If you randomly poke at
different cachelines in shared memory, you might defeat that.)

--Andy

Rajiv Kurian

unread,

Nov 21, 2014, 6:15:30 PM11/21/14

to capn...@googlegroups.com, ken...@sandstorm.io, etern...@gmail.com, an...@luto.us

Assuming a single reader and a single writer, why not just use a couple of SPSC ring buffer (for bi directional communication) each written on top of a memory mapped file (tmfps or otherwise)? The producer and consumer sequences would be in the mapped buffer too. This would lead to a wait free algorithm that is easy to code using C++11 or languages like Java with a sophisticated enough memory model (memory_order_consume + memory_order_release). In fact the producer and consumer could be written in different languages (Java/C++?). There is inherent back pressure due to the fixed size of the ring buffer. The consumer would have to poll with some kind of waiting strategy (some combination of busy spin + _mm_pause, sleep, yield whatever) to detect new messages. If you need notifications then it would have to provided out of band.

Rajiv

Rajiv Kurian

unread,

Nov 21, 2014, 6:24:03 PM11/21/14

to capn...@googlegroups.com, ken...@sandstorm.io, etern...@gmail.com, an...@luto.us

Actually just memory_order_acquire and memory_order_release would be sufficient. Given that the messages are variable length, we'll need some kind of way to notify the consumer to skip to the end of the buffer in case the producer requires more space than is available at the end of the buffer. Maybe a one byte header for every message, a bit in which identifies that the consumer needs to skip ahead. The producer can then start writing from the beginning of the buffer again, knowing that the consumer will skip the unused space..

Kenton Varda

unread,

Nov 21, 2014, 7:53:29 PM11/21/14

to Rajiv Kurian, capnproto, Alex Elsayed, Andrew Lutomirski

Hi Rajiv,

Yes, a ring buffer might also be a good option (though wait-freeness and back pressure can both be accomplished with a linked list as well).

But I don't think spinning or polling when the queue is empty is a good idea. The queue is going to spend most of its time empty, but when a message arrives, you want to respond to it with minimal latency. Spinning makes sense only when you expect waits to be short. In fact, now that I think of it, even futex() is probably the wrong primitive here, because it is again designed for short waits. In particular, a futex is not epoll-able, which is a pretty critical feature is a message pipe. I suspect that ultimately writing a byte to a unix socket is the best signaling mechanism here.

-Kenton

--

Andrew Lutomirski

unread,

Nov 21, 2014, 8:11:04 PM11/21/14

to Kenton Varda, Rajiv Kurian, capnproto, Alex Elsayed

On Fri, Nov 21, 2014 at 4:53 PM, Kenton Varda <ken...@sandstorm.io> wrote:
> Hi Rajiv,
>
> Yes, a ring buffer might also be a good option (though wait-freeness and
> back pressure can both be accomplished with a linked list as well).
>
> But I don't think spinning or polling when the queue is empty is a good
> idea. The queue is going to spend most of its time empty, but when a message
> arrives, you want to respond to it with minimal latency. Spinning makes
> sense only when you expect waits to be short. In fact, now that I think of
> it, even futex() is probably the wrong primitive here, because it is again
> designed for short waits. In particular, a futex is not epoll-able, which is
> a pretty critical feature is a message pipe. I suspect that ultimately
> writing a byte to a unix socket is the best signaling mechanism here.

Blech. We have eventfd for that :)

--Andy

Kenton Varda

unread,

Nov 21, 2014, 8:14:44 PM11/21/14

to Andrew Lutomirski, Rajiv Kurian, capnproto, Alex Elsayed

On Fri, Nov 21, 2014 at 5:10 PM, Andrew Lutomirski <an...@luto.us> wrote:

Blech. We have eventfd for that :)

Indeed! Forgot about that. (But is it any faster, or just a nicer interface?)

Andrew Lutomirski

unread,

Nov 21, 2014, 8:18:27 PM11/21/14

to Kenton Varda, Rajiv Kurian, capnproto, Alex Elsayed

It's considerably faster than pipes due to (I think) mtime issues.
It's probably faster than sockets, too, because it won't have to
allocate memory.

Tim Brandt

unread,

Oct 19, 2015, 5:52:09 PM10/19/15

to Cap'n Proto, ken...@sandstorm.io, geet...@gmail.com, etern...@gmail.com, an...@luto.us

Has anyone made a shared memory transport? I am working on one and it is not fitting into my architecture.

Tim

jus...@specialbusservice.com

unread,

Feb 14, 2017, 7:13:53 PM2/14/17

to Cap'n Proto, ken...@sandstorm.io, geet...@gmail.com, etern...@gmail.com, an...@luto.us

I made a Go memfd shared memory Capnproto transport https://godoc.org/github.com/justincormack/go-memfd

My primary use case is to experiment with sending messages to privileged processes, so fully sealed

memfd without reuse, but you can do what you like with the seals. I have not looked at performance,

I havent implemented proper remap support yet, so growing the arena is definitely going to be slow.

Justin

Reply all

Reply to author

Forward