Hi Sandeep,
I haven't worked out all the details yet, but I think it would look something like this:
Each end of the connection performs the following steps to set up:
- Establish a regular unix socket connection.
- Create a temporary file for outgoing messages. Try to create the file on a tmpfs if possible so that it isn't written to disk.
- mmap the temporary file into memory read-write. This is the "outgoing" buffer.
- Open a second file descriptor to the same file with O_RDONLY.
- Receive the read-only file descriptor sent by the peer for their end of the connection.
- mmap the received descriptor read-only. This is the "incoming" buffer.
Now, communication can happen something like this:
- You have "incoming" and "outgoing" buffers (per above).
- When you construct a new message, allocate space for it from your "outgoing" buffer. You will need to write a custom subclass of capnp::MessageBuilder which does this allocation, as an alternative to MallocMessageBuilder. Many different allocation strategies could make sense.
- Once the message is complete and ready to be sent, you need to notify the peer that a message is available and transmit the segment table -- that is, a list of offsets within the buffer where each message segment appears and the size of each segment. (More on this below.)
- The peer receives notification that your message is available and creates a MessageReader for it.
- The MessageReader is passed to application code, etc.
- When the MessageReader is destroyed, the peer sends notification back that the message is no longer needed.
- You may then mark the buffer space used by the message for reuse.
Two problems left:
- How exactly does notification (of message availability, and then message consumption) work?
- How is the segment table transmitted?
It may make sense to just use the unix socket for these. These notifications would be short so should find their way to the other process quickly. The segment table could be transmitted as part of this message. This keeps things simple.
Another option would be to maintain a linked list of "notifications" inside the buffers themselves. Start from the first 8 bytes of the buffer. These act as a pointer to the first notification. Initially, the pointer is null. When the first notification is ready, its offset is written to these first 8 bytes. The notification itself contains a pointer to the following notification, which again starts null. Every time the list is extended, the sender must also signal the receiving end to consume the new notifications; this can be done in a number of ways, such as a unix signal, or perhaps on Linux by using futex(2) on the memory location itself. The sender of a notification can free the notification's linked list node as soon as the peer has send a notification indicating that they have consumed it.
Note that I don't have any idea if signals or futex will be any faster in practice than sending notifications to the unix socket. (Maybe Andy could comment on this?)
Note also that I don't actually have much experience with shared memory, so it's possible that this isn't a great idea and there's something much better out there. Perhaps it would make more sense for the notifications to be kept in a ring buffer rather than a linked list, for example.
If you want to use Cap'n Proto RPC on top of your shared memory message passing, you just need to write a custom subclass of capnp::VatNetwork based on your code. You can look at capnp::TwoPartyVatNetwork as a starting point for this.
Eventually I would like to extend TwoPartyVatNetwork itself to auto-detect when the connection in a Unix socket and use the strategy above. I would probably start out by writing notifications to the socket rather than try to do the linked list thing.
-Kenton