RPC protocol straw man

Kenton Varda

unread,

Jun 27, 2013, 5:50:30 PM6/27/13

to capnproto

https://gist.github.com/kentonv/5880714

Thoughts?

Andrew Lutomirski

unread,

Jun 27, 2013, 8:54:34 PM6/27/13

to Kenton Varda, capnproto

On Thu, Jun 27, 2013 at 2:50 PM, Kenton Varda <temp...@gmail.com> wrote:
> https://gist.github.com/kentonv/5880714

32-bit callIds may not be enough without some wrapping support.

What does releasing a capability mean? Is it guaranteed that a new
call cannot target a released capability? Can the endpoint that
created a capability destroy it asynchronously?

Cancel may be problematic -- the callee will probably have to store
the list of capabilities returned by a successful call indefinitely so
it knows to release them if the call is later canceled.

Should Fail contain an Object to encode a richer exception? (Quite
possibly not -- the runtime interface could get unnecessarily
complicated.)

--Andy

>
> Thoughts?
>
> --
> You received this message because you are subscribed to the Google Groups
> "Cap'n Proto" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to capnproto+...@googlegroups.com.
> Visit this group at http://groups.google.com/group/capnproto.
>
>

Kenton Varda

unread,

Jun 27, 2013, 10:17:06 PM6/27/13

to Andrew Lutomirski, capnproto

Thanks for the feedback!

On Thu, Jun 27, 2013 at 5:54 PM, Andrew Lutomirski <an...@luto.us> wrote:

On Thu, Jun 27, 2013 at 2:50 PM, Kenton Varda <temp...@gmail.com> wrote:
> https://gist.github.com/kentonv/5880714

32-bit callIds may not be enough without some wrapping support.

Hmm. I think it would be safe to reuse call IDs. The rule would be that a client cannot reuse a call ID until the previous call has returned or failed, to avoid ambiguity. Also, a Cancel frame always refers to the most-recent call with the corresponding ID... though I'll have to think about how this affects out-of-order packets.

OTOH, extending the call ID to 64 bits is probably OK, too...

What does releasing a capability mean? Is it guaranteed that a new
call cannot target a released capability? Can the endpoint that
created a capability destroy it asynchronously?

It means that the caller no longer holds the capability, which implies that no further calls will be made to it.

The endpoint can do whatever it wants. The target object can spontaneously become inaccessible (i.e. all calls fail) at any time. Releasing is only advisory.

Cancel may be problematic -- the callee will probably have to store
the list of capabilities returned by a successful call indefinitely so
it knows to release them if the call is later canceled.

If the capability cares about Release then presumably the callee has to keep track of it anyway. Once all the capabilities returned by a call are released, the call ID can be forgotten entirely.

That said, now that I think about it, it makes a lot more sense to just say that if the call returns normally after a Cancel then the caps are live and need to be released.

Should Fail contain an Object to encode a richer exception? (Quite
possibly not -- the runtime interface could get unnecessarily
complicated.)

I don't think so. My philosophy on exceptions is that they aren't meant to be handled programmatically, except for the limited purpose of deciding whether to retry and what to log/display to the user. The RPC exception struct corresponds closely to kj::Exception, which does not contain any such auxiliary data and is not intended to. Any kind of error that needs to be handled programmatically should be encoded in the return value instead of thrown.

But this could easily be added later if we really want it.

-Kenton

Kenton Varda

unread,

Jun 28, 2013, 5:14:37 PM6/28/13

to capnproto

Hmm, one thing that's definitely missing from my straw man is the ability to pass a capability back to its origin and not have it proxy back and forth over the connection redundantly. Although, you could make an argument that the proxying is a feature, not a bug, in that it's useful to be able to revoke all capabilities passed over a particular connection by simply killing the connection. But in the case of passing a capability back to its own source, that's probably a strained argument...

A simple solution is to extend CapDescriptor with a bit saying "I came from you". The recipient then knows to interpret the cap ID as living on its own end, rather than the other end, of the connection.

That solves the case where a capability is sent back through the path from which it came, which is probably enough to serve 95% of use cases well. It doesn't solve the case where a capability is passed around in a loop, eventually returning to its own origin. Unfortunately I think this case gets massively more complicated. If a capability crosses multiple hops, presumably the cap ID needs to be wrapped somehow. To be secure, either the wrapping has to involve cryptography or each hop has to keep some local state that allows it to route calls. Either approach would make it non-trivial to detect a loop, especially if we want to keep crypto out of the core protocol (which seems wise).

I think the core protocol should not attempt to solve the loop case. Instead, this problem should be lumped into the more general problem of short-circuiting capabilities that have taken multiple hops. That is, if a capability is passed from A to B to C to D, it's useful for performance if D has some way to connect directly to A to make calls, rather than proxying through C and B. This won't be part of the core protocol because it depends deeply on network topology, something to which the core protocol is completely agnostic. It may also involve crypto, as I mentioned. In any case, a solution to the short-circuit problem could obviously be used to solve the loop problem as well.

-Kenton

On Thu, Jun 27, 2013 at 2:50 PM, Kenton Varda <temp...@gmail.com> wrote:

https://gist.github.com/kentonv/5880714

Thoughts?

Geoffrey Romer

unread,

Jun 28, 2013, 6:15:13 PM6/28/13

to Kenton Varda, capnproto

How will you handle fault-tolerance? You mention out-of-order delivery, but what about frames that are delivered more or less than one time?

On Thu, Jun 27, 2013 at 2:50 PM, Kenton Varda <temp...@gmail.com> wrote:

https://gist.github.com/kentonv/5880714

Thoughts?

--

Kenton Varda

unread,

Jun 28, 2013, 8:05:01 PM6/28/13

to Geoffrey Romer, capnproto

On Fri, Jun 28, 2013 at 3:15 PM, Geoffrey Romer <gro...@google.com> wrote:

How will you handle fault-tolerance? You mention out-of-order delivery, but what about frames that are delivered more or less than one time?

An excellent question.

I don't think that this is something that can be done completely transparently. Two generals' problem, etc. At the end of the day, all RPC methods are going to have to be idempotent at the application level. At that point, the RPC system itself can implement optimizations which squelch known repeats, but these don't have to be 100% reliable. The callee app may receive a call twice, and simply has to carry it out twice. On the caller side, the RPC system could retry automatically, although my experience at Google is that most apps disabled this and implemented their own retry semantics (to be fair, I mostly worked on interactive servers), so I'm not sure how much effort should be put into that.

All that said, I do want the protocol to be amenable to UDP. This probably means we need to define retry rules for individual frames. Maybe something like...

- Call should be retried until it is ack'd by a Return or a Fail. (Do we need a separate way to ack calls that take a while?)

- Cancel is not ack'd; the caller knows it succeeded when it gets a Return or Fail.

- If a spurious Cancel (one whose call ID matches no current call) is received, a Fail should be sent back.

- Spurious Returns or Fails should be ignored.

Open problems:

- I think Release needs to be ack'd. This may imply that Release messages need their own IDs.

- I don't think call IDs are reusable in this scenario, at least in theory. In practice, wrap-around would probably take long enough to purge any hung calls. But perhaps we should go ahead and use 64-bit IDs anyway?

- A lower layer managing the UDP connection may want to send pings and such. Should it define its own outer frame to distinguish these from RPC packets or should we extend Frame itself?

- Anything else?

-Kenton

Andrew Lutomirski

unread,

Jun 28, 2013, 9:05:47 PM6/28/13

to Kenton Varda, Geoffrey Romer, capnproto

On Fri, Jun 28, 2013 at 5:05 PM, Kenton Varda <temp...@gmail.com> wrote:
> On Fri, Jun 28, 2013 at 3:15 PM, Geoffrey Romer <gro...@google.com> wrote:
>>
>> How will you handle fault-tolerance? You mention out-of-order delivery,
>> but what about frames that are delivered more or less than one time?
>
>
> An excellent question.
>
> I don't think that this is something that can be done completely
> transparently. Two generals' problem, etc. At the end of the day, all RPC
> methods are going to have to be idempotent at the application level.

[...]

Please don't do any of the above (or below), or at least leave it to
whatever session / transport layer lives underneath the RPC layer. I
think that an RPC session should speak (reliable) messages below and
capnp/function calls above. Function calls and callIds don't outlast
their RPC session.

If the RPC session wants to have graceful (and non-idempotent, please)
support for unreliable transports, that could be useful. But IMO this
should be a conscious choice of the user (i.e. one RPC user could run
in reliable mode -- the whole session can die, but other than that,
everything is in order; another could use idempotent mode, where
reordering and repeats could happen; yet another could have spurious
failures due to drops (in either direction, perhaps) but would never
repeat anything).

I use both protobufs and thrift here (b/c thrift predates open-source
protobufs (and yes, I was at Google long enough ago to have used
protobuf v1)), and in both cases I don't use their RPC systems at all
because they force me into infrastructure that I don't want. If capnp
ends up flexible enough that I can run it with sensible semantics over
my own transport and doesn't get in the way, then I'll use it. "My
own transport" is awesome, very fast, in-order, never loses messages,
and is not really idempotent. I'll open-source it someday.

(Shameless plug: if any of you know of a really good infrastructure
developer who wants an unusual sort of infrastructure job, please send
them to www.amacapital.net -- this will significantly speed up the
open-sourcing of said transport. These are fun, too:
https://en.wikipedia.org/wiki/NVDIMM)

--Andy

--Andy

Kenton Varda

unread,

Jun 29, 2013, 3:11:20 PM6/29/13

to Andrew Lutomirski, Geoffrey Romer, capnproto

Oh, to be clear, all the retry semantics I described would be optional. You'd tell the RPC layer whether your transport was reliable or not and it would behave accordingly.

I think there is potential for big wins from using UDP rather than TCP, but I think most of those wins require building fault tolerance into the RPC protocol itself rather than pushing it down to a lower layer. So, I want to keep that possibility available, but not impose it.

Geoffrey Romer

unread,

Jul 1, 2013, 1:11:40 PM7/1/13

to Kenton Varda, capnproto

First, a caveat: this is very much not my area of expertise, so take everything I say with plenty of salt.

On Fri, Jun 28, 2013 at 5:05 PM, Kenton Varda <temp...@gmail.com> wrote:

On Fri, Jun 28, 2013 at 3:15 PM, Geoffrey Romer <gro...@google.com> wrote:

How will you handle fault-tolerance? You mention out-of-order delivery, but what about frames that are delivered more or less than one time?

An excellent question.

I don't think that this is something that can be done completely transparently. Two generals' problem, etc. At the end of the day, all RPC methods are going to have to be idempotent at the application level. At that point, the RPC system itself can implement optimizations which squelch known repeats, but these don't have to be 100% reliable. The callee app may receive a call twice, and simply has to carry it out twice. On the caller side, the RPC system could retry automatically, although my experience at Google is that most apps disabled this and implemented their own retry semantics (to be fair, I mostly worked on interactive servers), so I'm not sure how much effort should be put into that.

That doesn't match my experience, but then my experience is with backend servers. I've certainly seen systems implement their own retry semantics, but I don't recall seeing them turn off the lower-level retry. In any event, Google is such an outlier in most respects that experience there may not be directly applicable: Cap'n Proto should certainly support Google-scale applications, but it also needs to scale down to support projects with much more constrained bandwidth, less reliable networks, and fewer engineers to throw at RPC optimization.

All that said, I do want the protocol to be amenable to UDP. This probably means we need to define retry rules for individual frames. Maybe something like...

- Call should be retried until it is ack'd by a Return or a Fail. (Do we need a separate way to ack calls that take a while?)

Yes, I think you need a separate ack, at least as an option (probably the default, since it's doubtful if most clients will have access to Andrew's magical transport layer).

As you probably know, tail latency (latency of slow requests, relative to the median latency) is a critical performance characteristic of many distributed systems, and retries are a key tool for controlling it. However, retry-based strategies only work if a likely failure can be detected within a small fraction of the total expected latency (if you can't detect a failure except by noticing that the final answer is slow to arrive, you've doubled the failure-case latency). This is compounded by the fact that Cap'n Proto doesn't know how long a given call is expected to take, so it can't tell on its own when a reply is delayed enough to warrant a retry. The client could of course tell it how long to wait, but this adds complexity to the client, and isn't always feasible in the first place (my perception may be skewed here, since I maintain an RPC service that answers SQL queries, so the expected latency is extremely difficult to predict even approximately).

An ack message neatly solves these problems, since a missing ack message can be noticed quickly, potentially well before the whole operation is expected to complete, and the expected latency of an ack message is far more predictable than the latency of the final reply (the RPC layer could potentially even infer it, without guidance from the client). Without an ack message, though, clients who have tight latency requirements will have to fake it by specifying their protocol to return placeholder messages right away, with the 'real' reply sent later as a separate Call message in the reverse direction. In other words, these clients will be forced to specify an asynchronous protocol.

This sucks because protocol definitions should ideally be synchronous at the specification level, and Cap'n Proto should avoid forcing protocol designers to use asynchronous designs. There are several related reasons for this:

- Synchrony is a better default choice: sync APIs are much easier to work with, and async APIs are generally avoided unless forced by performance requirements.

- Given a synchronous protocol definition, Cap'n Proto can (and should) optionally generate an async client API, but the reverse is not true.

- By the same token, the choice to make a protocol definition asynchronous is irreversible, whereas clients can switch at will (and independently) between sync and async generated APIs if the protocol specification is synchronous.

- The RPC implementation is likely to be asynchronous, which means it can provide an async API efficiently; it would be wasteful for client code to implement another independent layer of async bookkeeping.

These are all basically different ways of saying that synchrony is a higher level of description, and asynchrony is an implementation detail; protocol definitions should avoid specifying implementation details if possible, in order to maximize implementation freedom.

A harder problem is how the client can detect failures after the initial ack; it's obviously the destination's job to retry the Return/Fail until it gets an acknowledgement, but the source may need to be able to recover from things like server crashes and network partitions. One obvious option would be for the destination to send periodic "keepalive" messages.

(Side note: I'm trying to avoid saying "client" and "server" here since the protocol is symmetric, but might it make sense to use those terms in the context of a single Call? Speaking of which, you use those terms in the documentation of CapDescriptor.needsRelease, but it's not clear to me what they mean in that context.)

A related question: have you thought about supporting segmented messages and/or streaming at the protocol level? Ack and keepalive messages could conceivably be implemented as initial segments of a multi-segment message or stream (although the semantics are a little different since you don't need to fill in gaps).

- Cancel is not ack'd; the caller knows it succeeded when it gets a Return or Fail.
- If a spurious Cancel (one whose call ID matches no current call) is received, a Fail should be sent back.

Doesn't this make Cancels non-idempotent? How should a client handle a Fail response?

- Spurious Returns or Fails should be ignored.

I assume you mean they're not exposed to the client code; they should still at least be acked, right?

Open problems:
- I think Release needs to be ack'd. This may imply that Release messages need their own IDs.

You're probably going to have to recover from misbehaving clients that fail to Release properly for one reason or another anyway, and the mechanism for doing so (e.g. LRU garbage collection?) might handle dropped Releases without the need for explicit acks. OTOH it might not (e.g. if the strategy is to just detect and disconnect misbehaving clients), so maybe you do need this to maximize implementation flexibility.

The only reason I see not to support this (with maybe an option to turn it off) is the concern about adding an extra field to Release, but why not just reuse the capID itself in the ack or, failing that, a hash of the capID?

- I don't think call IDs are reusable in this scenario, at least in theory. In practice, wrap-around would probably take long enough to purge any hung calls. But perhaps we should go ahead and use 64-bit IDs anyway?

I can't think of any plausible use case where 32 bits would be insufficient (assuming of course that you skip IDs that are still live), but this is starting to feel like "64K ought to be enough for anybody". Among other things, 64 bits would give you the option of assigning IDs randomly, which could be useful in some settings.

- A lower layer managing the UDP connection may want to send pings and such. Should it define its own outer frame to distinguish these from RPC packets or should we extend Frame itself?

There's nothing forcing this decision now, right? I suggest assuming the former unless/until you find a reason it doesn't work; you can always extend Frame later if you have to.

- Anything else?

-Kenton

Kenton Varda

unread,

Jul 1, 2013, 9:17:01 PM7/1/13

to Geoffrey Romer, capnproto

Thanks, Geoffrey, you brought up a lot of good points I hadn't really thought of. (Unfortunately, I am not an expert at RPC either... only at serialization.)

Yes, I agree that calls need to be ack'd -- even over TCP -- to keep tail latency low.

I wonder if we could cleverly separate these transport issues from the handling of the RPC layer itself. Imagine this interface...

class Transport {

public:

class Request {

public:

virtual void wait(Time timeout = -1) = 0;

// Start trying to send the request, blocking until a reply is received.

// Throws an exception if the timeout is reached or if the transport

// fails permanently. The transport takes care of re-sending the message

// as appropriate depending on the reliability of the underlying medium.

virtual void receivedReply() = 0;

// Call when a reply to this request has been received. wait() will

// then return.

};

virtual Own<Request> startRequest(const Message& message) = 0;

// Start a new request. Call wait() on it to begin sending the message.

virtual void sendReply(const Message& message) = 0;

// Sends a reply message. There is no guarantee that the message actually

// reaches its recipient, but the assumption is that the requester will retry

// the request if it does not receive a reply.

virtual void setReceiveCallback(Function<Message>) = 0;

// Set callback to be called when a message is received. The callback is

// only executed once at a time, so it must return quickly to ensure the

// network remains responsive. Normally all the callback should do is

// call receivedReply() on the appropriate outstanding request or arrange

// for further processing on some other thread.

//

// Some received messages may be duplicates. The receiver should be

// designed to deal gracefully with this possibility.

};

I think this interface encapsulates the different ack strategies needed by different transports.

TCP: wait() sends the message and waits for an ack. The receiver sends the ack back immediately on receiving the request, before it begins processing. If the sender does not receive an ack within a reasonable amount of time (determined as a function of the link latency), it is assumed the connection is dead, and some handler is invoked to re-establish the connection or give up. Otherwise, if an ack is received, the sender politely waits for the reply. Perhaps a periodic ping is sent to verify connection health.

UDP: wait() sends the message repeatedly -- with some heuristic back-off -- until it receives an ack or the timeout expires. The receiver ack's requests immediately before processing only so that the sender knows it can stop sending.

Unix domain socket: No retry mechanism is necessary.

On Mon, Jul 1, 2013 at 10:11 AM, Geoffrey Romer <gro...@google.com> wrote:

That doesn't match my experience, but then my experience is with backend servers. I've certainly seen systems implement their own retry semantics, but I don't recall seeing them turn off the lower-level retry.

Yeah, probably the difference is that I mostly worked on servers that dealt with interactive requests. Batch processing is completely different. Cap'n Proto should handle both.

In any event, Google is such an outlier in most respects that experience there may not be directly applicable: Cap'n Proto should certainly support Google-scale applications, but it also needs to scale down to support projects with much more constrained bandwidth, less reliable networks, and fewer engineers to throw at RPC optimization.

Agreed.

- Cancel is not ack'd; the caller knows it succeeded when it gets a Return or Fail.
- If a spurious Cancel (one whose call ID matches no current call) is received, a Fail should be sent back.

Doesn't this make Cancels non-idempotent? How should a client handle a Fail response?

The idea here is that a spurious Cancel either belongs to a Call that already returned or a Call that was never received. In the former case, returning a Fail is fine because the caller will have already received the Return and so will ignore the Fail. In the latter case, returning a Fail is correct because that's what would be returned anyway if the Call had actually been received.

- Spurious Returns or Fails should be ignored.

I assume you mean they're not exposed to the client code; they should still at least be acked, right?

I'm not sure they need to be ack'd, since the caller will retry if it doesn't get a response. The callee can hold on to the response for a little while in order to be able to re-send it in response to duplicate calls.

If the response contains capabilities that need release, things get a little tricky. If the callee receives a duplicate Call shortly after sending the corresponding Return, it doesn't know if the caller sent this Call just before getting the Return (in which case it successfully received and is using the returned caps) or if it failed to receive the Return altogether (in which case it will use the caps in the second Return). So, the callee better return exactly the same response with the same caps. To that end, it will have to hold on to the response message for a little while. Once it has dropped the response message, it will have to respond to any duplicate Call by releasing the caps and replying with a Fail.

Open problems:
- I think Release needs to be ack'd. This may imply that Release messages need their own IDs.

You're probably going to have to recover from misbehaving clients that fail to Release properly for one reason or another anyway, and the mechanism for doing so (e.g. LRU garbage collection?) might handle dropped Releases without the need for explicit acks. OTOH it might not (e.g. if the strategy is to just detect and disconnect misbehaving clients), so maybe you do need this to maximize implementation flexibility.

All caps are implicitly released on disconnect. Disconnecting clients that allocate too many caps seems like a good idea regardless.

But yes, it seems clear that Releases need to be ack'd.

- I don't think call IDs are reusable in this scenario, at least in theory. In practice, wrap-around would probably take long enough to purge any hung calls. But perhaps we should go ahead and use 64-bit IDs anyway?

I can't think of any plausible use case where 32 bits would be insufficient (assuming of course that you skip IDs that are still live), but this is starting to feel like "64K ought to be enough for anybody". Among other things, 64 bits would give you the option of assigning IDs randomly, which could be useful in some settings.

I think it's plausible to imagine a long-running connection that performs more than 4G requests. Remotely plausible, but plausible, especially if you implemented a persistent transport layer. I don't think IDs can be reused even after the call completes since this would make it hard for the server to detect duplicate calls and efficiently reply with the same response...

Kenton Varda

unread,

Jul 1, 2013, 11:09:38 PM7/1/13

to Geoffrey Romer, capnproto

On Mon, Jul 1, 2013 at 6:17 PM, Kenton Varda <temp...@gmail.com> wrote:

- Spurious Returns or Fails should be ignored.

I assume you mean they're not exposed to the client code; they should still at least be acked, right?

I'm not sure they need to be ack'd, since the caller will retry if it doesn't get a response. The callee can hold on to the response for a little while in order to be able to re-send it in response to duplicate calls.

If the response contains capabilities that need release, things get a little tricky. If the callee receives a duplicate Call shortly after sending the corresponding Return, it doesn't know if the caller sent this Call just before getting the Return (in which case it successfully received and is using the returned caps) or if it failed to receive the Return altogether (in which case it will use the caps in the second Return). So, the callee better return exactly the same response with the same caps. To that end, it will have to hold on to the response message for a little while. Once it has dropped the response message, it will have to respond to any duplicate Call by releasing the caps and replying with a Fail.

I guess the other approach here is for a Return to expect an ack... but to avoid a two-generals problem, the callee has to keep trying to send the Return until it is either acked or the connection is deemed dead. Otherwise there's no way to know whether the caps are live.

Kenton Varda

unread,

Jul 2, 2013, 12:26:16 AM7/2/13

to Geoffrey Romer, capnproto

Missed these...

On Mon, Jul 1, 2013 at 10:11 AM, Geoffrey Romer <gro...@google.com> wrote:

(Side note: I'm trying to avoid saying "client" and "server" here since the protocol is symmetric, but might it make sense to use those terms in the context of a single Call? Speaking of which, you use those terms in the documentation of CapDescriptor.needsRelease, but it's not clear to me what they mean in that context.)

I try to use the terms "caller" and "callee", although I keep catching myself mixing them up. I think it's fine to say "client" and "server" instead (when talking about specific interactions where it's obvious which is which). In the case of needsRelease, I meant "client" to be the side holding the reference, and "server" to be the side implementing it; I'll try to make it clearer.

A related question: have you thought about supporting segmented messages and/or streaming at the protocol level? Ack and keepalive messages could conceivably be implemented as initial segments of a multi-segment message or stream (although the semantics are a little different since you don't need to fill in gaps).

Messages are already arranged in segments, though the intent was to allow progressive memory allocation, not so much to help with transport.

I think using the Transport interface I specified, it would make a lot of sense for the transport itself to interleave large messages so that they don't starve smaller ones. Interestingly, we can actually allow the receiver to start operating on a message as soon as the first segment is received, since it'll make a virtual call the first time each segment is accessed, giving the transport an opportunity to block.

All that said, for a protocol involving large file transfers, the right thing to do is to create a capability representing the stream and then make multiple calls to that capability to transfer small chunks of data. This way the app has the ability to display a progress bar or do other things that require being aware of the transfer progress... and the serialization layer doesn't need to be well-optimized for ginormous messages.

Andrew Lutomirski

unread,

Jul 2, 2013, 3:58:34 PM7/2/13

to Geoffrey Romer, Kenton Varda, capnproto

On Mon, Jul 1, 2013 at 10:11 AM, Geoffrey Romer <gro...@google.com> wrote:
>
>

> Yes, I think you need a separate ack, at least as an option (probably the
> default, since it's doubtful if most clients will have access to Andrew's
> magical transport layer).

Mine's local anyway, so it doesn't help with this particular use.

>
> As you probably know, tail latency (latency of slow requests, relative to
> the median latency) is a critical performance characteristic of many
> distributed systems, and retries are a key tool for controlling it. However,
> retry-based strategies only work if a likely failure can be detected within
> a small fraction of the total expected latency (if you can't detect a
> failure except by noticing that the final answer is slow to arrive, you've
> doubled the failure-case latency).

I wonder if this will be necessary once TCP tail loss probing is
widely available. See:

https://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6ba8a3b19e764b6a65e4030ab0999be50c291e6c

The TCP / other lower layer stack probably has a better idea of RTT
than capnp anyway.

Kenton Varda

unread,

Jul 2, 2013, 4:20:55 PM7/2/13

to Andrew Lutomirski, Geoffrey Romer, capnproto

On Tue, Jul 2, 2013 at 12:58 PM, Andrew Lutomirski <an...@luto.us> wrote:

On Mon, Jul 1, 2013 at 10:11 AM, Geoffrey Romer <gro...@google.com> wrote:
>
>

> Yes, I think you need a separate ack, at least as an option (probably the
> default, since it's doubtful if most clients will have access to Andrew's
> magical transport layer).

Mine's local anyway, so it doesn't help with this particular use.

Local as in IPC? Well that's... somewhat less magical. :P Do you use shared memory? I'm planning to build a transport that uses shared memory and futexes to mostly avoid the kernel.

> As you probably know, tail latency (latency of slow requests, relative to
> the median latency) is a critical performance characteristic of many
> distributed systems, and retries are a key tool for controlling it. However,
> retry-based strategies only work if a likely failure can be detected within
> a small fraction of the total expected latency (if you can't detect a
> failure except by noticing that the final answer is slow to arrive, you've
> doubled the failure-case latency).

I wonder if this will be necessary once TCP tail loss probing is
widely available. See:

https://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6ba8a3b19e764b6a65e4030ab0999be50c291e6c

The TCP / other lower layer stack probably has a better idea of RTT
than capnp anyway.

This helps, but note that this is a different meaning of "tail" than what Geoffrey meant. It looks like TLP helps recover quickly in the case of one-off packet loss, but if the target machine has died completely then we need some mechanism to quickly time out the whole connection and move on to a different server.

Andrew Lutomirski

unread,

Jul 2, 2013, 6:04:17 PM7/2/13

to Kenton Varda, Geoffrey Romer, capnproto

On Tue, Jul 2, 2013 at 1:20 PM, Kenton Varda <temp...@gmail.com> wrote:
> On Tue, Jul 2, 2013 at 12:58 PM, Andrew Lutomirski <an...@luto.us> wrote:
>>
>> On Mon, Jul 1, 2013 at 10:11 AM, Geoffrey Romer <gro...@google.com> wrote:
>> >
>> >
>> > Yes, I think you need a separate ack, at least as an option (probably
>> > the
>> > default, since it's doubtful if most clients will have access to
>> > Andrew's
>> > magical transport layer).
>>
>> Mine's local anyway, so it doesn't help with this particular use.
>
>
> Local as in IPC? Well that's... somewhat less magical. :P Do you use
> shared memory? I'm planning to build a transport that uses shared memory
> and futexes to mostly avoid the kernel.

My transport has two main modes: one-to-one and one-to-many. In both
cases, the sender has a background thread that fallocates and mmaps a
new file whenever output space is getting low, and the receiver has a
background thread that maps those files read-only. (This is a
terrible idea if you care about jitter, as it turns out. Linux
handles this workload very poorly if you try to prefault the pages.)

The presence of a complete message is signalled by a write to the byte
just before that message. This requires no barrier for the writer and
is unlikely to dirty an unnecessary cacheline. (Down side: if the
reader polls, then it's possible to bounce that cacheline a couple
times.)

This transport is persistent in the sense that I keep the entire
transcript forever. The interesting part is that messages aren't just
sequences of bytes; they're sequences of bytes with extra metadata.
The metadata allows me to tag messages with zero or more "tags", where
each occurrence of a tag type must have an assoicated value that's
strictly greater than the previous value. This gives me log-time
search by timestamp (and more). The whole backend is compressed, too,
since the rate of uncompressed data I produce is insane.

Using futexes would be a little bit complicated in broadcast mode, so
I use eventfd instead. (Another reason I use eventfd is that
everything is asynchronous, and futexes don't have a usable
nonblocking mode.)

If I were doing this from scratch (which I will some day), each writer
would allocate a circular buffer backed by tmpfs, ramfs, or NVDIMM and
a background process would persist data from that buffer.

With futexes, the IMO most interesting part is how to arrange all the
control metadata to minimize cache coherency traffic.

--Andy

Kenton Varda

unread,

Jul 2, 2013, 6:32:21 PM7/2/13

to Andrew Lutomirski, Geoffrey Romer, capnproto

Wow... I really hope you're interested in helping with or at least reviewing the Cap'n Proto shared memory transport, when I get to it. :)

Andrew Lutomirski

unread,

Jul 2, 2013, 6:48:29 PM7/2/13

to Kenton Varda, Geoffrey Romer, capnproto

On Tue, Jul 2, 2013 at 3:32 PM, Kenton Varda <temp...@gmail.com> wrote:
> Wow... I really hope you're interested in helping with or at least
> reviewing the Cap'n Proto shared memory transport, when I get to it. :)
>

Certainly :)

The main reason that I'm so interested in Cap'n Proto is that I want
to run it over my transport. I currently have an unholy mix of Thrift
(dog-slow), protocol buffers (better AFAICT, but far from perfect),
and a custom binary thing that's decently fast but inflexible, and I'm
hoping that capnp will eventually replace all of the above.

(If you ever want to learn how not to design a serialization format,
take a look at the FIX protocol. Apparently someone at the FIX
protocol standards body thinks that ASN.1 would be an improvement!)

--Andy

iisv...@gmail.com

unread,

Jul 4, 2013, 12:36:19 PM7/4/13

to capn...@googlegroups.com

This discussion is really interesting. Kudos to everyone.

To @Andrew Lutomirski : Could you outlight me reasons for thrift being "dog slow" ? I run Thrift in several projects and I haven't noticed performances issues so far...but I don't run a
massive network service.

Kenton Varda

unread,

Jul 4, 2013, 5:38:11 PM7/4/13

to iisv...@gmail.com, capnproto

On Thu, Jul 4, 2013 at 9:36 AM, <iisv...@gmail.com> wrote:

To @Andrew Lutomirski : Could you outlight me reasons for thrift

being "dog slow" ? I run Thrift in several projects and I haven't

noticed performances issues so far...but I don't run a
massive network service.

I'm not Andrew, but last I knew -- which was many years ago, so things could easily have changed -- Thrift's C++ implementation simply wasn't very well-optimized. E.g. in an attempt to allow pluggable low-level encodings they would make a virtual call to write each integer to the wire, whereas protobufs inlined such calls. If you aren't pushing a lot of data it's not a big deal.

Andrew Lutomirski

unread,

Jul 4, 2013, 8:41:57 PM7/4/13

to Kenton Varda, iisv...@gmail.com, capnproto

I admit that I haven't benchmarked against protobufs in a fair
comparison, and our thrift messaging is a bit, ahem, verbose, but we
take tens of microseconds in thrift serialization and deserialization
where I really don't think that much time is deserved. Several years
ago, we explicitly benchmarked it, and IIRC protobuf won by a
considerable margin. But protobufs win on more than just speed -- the
API is far less unpleasant.

It doesn't help that we have to talk to the thrift library via shared
pointers even for small messages.

To be somewhat more fair, we're on an old version of thrift. That's
because thrift's packaging is atrocious (I don't blame distros in the
slightest for not packaging thrift) and I don't want to deal with
upgrading it. At some point we'll replace it with Cap'n Proto,
protobufs, or possibly an entirely new design that gets rid of the
remaining bit of our code that still uses thrift.

FWIW, this (actual, production) code is an embarrassment:

// May throw apache::thrift::TException
template<typename Message>
void DeserializeThrift(Message *msg, const void *buf, size_t len)
{
apache::thrift::transport::TMemoryBuffer
tbuf(const_cast<uint8_t *>((uint8_t*)buf), len);
apache::thrift::protocol::TBinaryProtocol
proto(boost::shared_ptr<apache::thrift::transport::TMemoryBuffer>(&tbuf,
thrift_util_detail::null_deleter));
msg->read(&proto);
if (tbuf.available_read() != 0)
throw apache::thrift::transport::TTransportException(

apache::thrift::transport::TTransportException::CORRUPTED_DATA,
"read did not consume all available data");
}

The protobuf equivalent is a function called ParseFromArray. To use
it, all you have to do is call it.

Geoffrey Romer

unread,

Jul 6, 2013, 12:56:06 AM7/6/13

to Kenton Varda, capnproto

On Mon, Jul 1, 2013 at 6:17 PM, Kenton Varda <temp...@gmail.com> wrote:

Thanks, Geoffrey, you brought up a lot of good points I hadn't really thought of. (Unfortunately, I am not an expert at RPC either... only at serialization.)

Yes, I agree that calls need to be ack'd -- even over TCP -- to keep tail latency low.

I wonder if we could cleverly separate these transport issues from the handling of the RPC layer itself. Imagine this interface...

class Transport {
public:
class Request {
public:
virtual void wait(Time timeout = -1) = 0;

// Start trying to send the request, blocking until a reply is received.

// Throws an exception if the timeout is reached or if the transport

// fails permanently. The transport takes care of re-sending the message
// as appropriate depending on the reliability of the underlying medium.

This doesn't seem right to me; the timeout seems like a concern of the RPC layer, not the transport, and permanent transport failure needs to be handled by another mechanism anyway (so the RPC layer can find out about failures that occur after a Call is acked, but before the Return/Fail arrives). And if wait() can't fail, there's no need for it to block.

BTW, are you assuming that threads are plentiful and context switches are cheap? I hope so, since that programming model is so much cleaner, but it's not trivial to achieve, and if not, blocking here could be a substantial performance problem.

virtual void receivedReply() = 0;

I'd suggest calling this something more generic like "stop"; it seems like there may be situations where you want the transport layer to give up even if no reply has been seen, and calling "recievedReply" in those circumstances would be confusing.

// Call when a reply to this request has been received. wait() will

// then return.

};

virtual Own<Request> startRequest(const Message& message) = 0;
// Start a new request. Call wait() on it to begin sending the message.

Why have separate startRequest() and wait() calls? Can't startRequest() do both?

That makes sense, so long as a Cancel is not treated as spurious if its Call has returned, but the Return has not been acked.

- Spurious Returns or Fails should be ignored.

I assume you mean they're not exposed to the client code; they should still at least be acked, right?

I'm not sure they need to be ack'd, since the caller will retry if it doesn't get a response. The callee can hold on to the response for a little while in order to be able to re-send it in response to duplicate calls.

I don't think that works; the RPC layer can't handle retrying Calls that are acked but not completed, because it can't tell when to expect completion, and the reliability of the RPC layer shouldn't depend on the behavior of the client in such a way. By the same token, the RPC layer can't know how long to hold onto the response to re-send. Furthermore, waiting for a call-retry adds latency in proportion to the amount by which the client overestimates the completion time; I don't think performance should depend on correct estimation to such an extent.

If the response contains capabilities that need release, things get a little tricky. If the callee receives a duplicate Call shortly after sending the corresponding Return, it doesn't know if the caller sent this Call just before getting the Return (in which case it successfully received and is using the returned caps) or if it failed to receive the Return altogether (in which case it will use the caps in the second Return). So, the callee better return exactly the same response with the same caps. To that end, it will have to hold on to the response message for a little while. Once it has dropped the response message, it will have to respond to any duplicate Call by releasing the caps and replying with a Fail.

As you noted in a followup, the alternative is for Returns to require an ack. That seems simpler and easier to get right, and for the reasons noted above, I think it's necessary anyway.

Open problems:
- I think Release needs to be ack'd. This may imply that Release messages need their own IDs.

You're probably going to have to recover from misbehaving clients that fail to Release properly for one reason or another anyway, and the mechanism for doing so (e.g. LRU garbage collection?) might handle dropped Releases without the need for explicit acks. OTOH it might not (e.g. if the strategy is to just detect and disconnect misbehaving clients), so maybe you do need this to maximize implementation flexibility.

All caps are implicitly released on disconnect. Disconnecting clients that allocate too many caps seems like a good idea regardless.

But yes, it seems clear that Releases need to be ack'd.

- I don't think call IDs are reusable in this scenario, at least in theory. In practice, wrap-around would probably take long enough to purge any hung calls. But perhaps we should go ahead and use 64-bit IDs anyway?

I can't think of any plausible use case where 32 bits would be insufficient (assuming of course that you skip IDs that are still live), but this is starting to feel like "64K ought to be enough for anybody". Among other things, 64 bits would give you the option of assigning IDs randomly, which could be useful in some settings.

I think it's plausible to imagine a long-running connection that performs more than 4G requests. Remotely plausible, but plausible, especially if you implemented a persistent transport layer. I don't think IDs can be reused even after the call completes since this would make it hard for the server to detect duplicate calls and efficiently reply with the same response...

In principle, I think IDs can be reused once enough time has elapsed that the risk of a duplicate call is negligible, but it's probably not worth the effort to track which IDs are viable for reuse.

On Mon, Jul 1, 2013 at 9:26 PM, Kenton Varda <temp...@gmail.com> wrote:

Missed these...

On Mon, Jul 1, 2013 at 10:11 AM, Geoffrey Romer <gro...@google.com> wrote:

(Side note: I'm trying to avoid saying "client" and "server" here since the protocol is symmetric, but might it make sense to use those terms in the context of a single Call? Speaking of which, you use those terms in the documentation of CapDescriptor.needsRelease, but it's not clear to me what they mean in that context.)

I try to use the terms "caller" and "callee", although I keep catching myself mixing them up. I think it's fine to say "client" and "server" instead (when talking about specific interactions where it's obvious which is which). In the case of needsRelease, I meant "client" to be the side holding the reference, and "server" to be the side implementing it; I'll try to make it clearer.

A related question: have you thought about supporting segmented messages and/or streaming at the protocol level? Ack and keepalive messages could conceivably be implemented as initial segments of a multi-segment message or stream (although the semantics are a little different since you don't need to fill in gaps).

Messages are already arranged in segments, though the intent was to allow progressive memory allocation, not so much to help with transport.

I think using the Transport interface I specified, it would make a lot of sense for the transport itself to interleave large messages so that they don't starve smaller ones. Interestingly, we can actually allow the receiver to start operating on a message as soon as the first segment is received, since it'll make a virtual call the first time each segment is accessed, giving the transport an opportunity to block.

All that said, for a protocol involving large file transfers, the right thing to do is to create a capability representing the stream and then make multiple calls to that capability to transfer small chunks of data. This way the app has the ability to display a progress bar or do other things that require being aware of the transfer progress... and the serialization layer doesn't need to be well-optimized for ginormous messages.

You're probably right that the decomposition of a stream into manageably-sized messages should be visible to the client, but I think there's still a role for the RPC layer in e.g. deciding when to transfer each chunk, and optimizing the underlying protocol for streaming (for example, I think you might be able to omit a lot of acks). Definitely not a v1 feature, but something to keep in mind for the future.

Kenton Varda

unread,

Jul 6, 2013, 3:41:56 AM7/6/13

to capnproto

I met with Mark Miller today to discuss CapTP, the capability-based RPC protocol used by E, and probably the definitive work in this space currently.

http://www.erights.org/elib/distrib/captp/ -- I think I actually understand everything on this page now. :)

We discussed lots of stuff, but the most relevant points at the moment are:

- CapTP is based on one-way messages with two-way request-repely interactions built on top, whereas Cap'n Proto implements only two-way RPC. This is not necessarily a problem, but we need to think carefully about it. My current intuition is that at the application level you always want request-reply anyway.

- Mark has experience suggesting that out-of-order call delivery is a common source of bugs for applications. We'll need to be careful that calls made from the same client thread retain their ordering on the server side, even if the calls don't appear to depend on each other. (Calls from independent threads obviously have no expectation of ordering, though.)

- It's important that capabilities be equals-comparable (by identity), even if they were received across different connections. Certain important capability-based design patterns depend on this. Fortunately CapTP has solved this problem before and we can emulate it.

I'll soon overhaul the protocol proposal to incorporate what I've learned, and then it'll probably make sense to reboot the discussion as a lot of sub-threads here might not be relevant anymore.

On Thu, Jun 27, 2013 at 2:50 PM, Kenton Varda <temp...@gmail.com> wrote:

https://gist.github.com/kentonv/5880714

Thoughts?

Stanislav Ivochkin

unread,

Jul 6, 2013, 9:09:25 AM7/6/13

to Kenton Varda, capnproto

Hi, all!

- CapTP is based on one-way messages with two-way request-repely interactions built on top, whereas Cap'n Proto implements only two-way RPC. This is not necessarily a problem, but we need to think carefully about it. My current intuition is that at the application level you always want request-reply anyway.

I think one-way calls are also needed. For instance, to implement server-side notifications without fake replies from client.

--

Regards,

Stas.

Kenton Varda

unread,

Jul 6, 2013, 2:08:25 PM7/6/13

to Stanislav Ivochkin, capnproto

You can always have a call that returns Void.

The trouble with one-way messages is that they don't have a clear way to propagate application errors. Usually one-way messages are ack'd, but the ack happens before processing, because there is no clear point where processing of the message is "done". So if the receiver acks the message and then promptly loses power, the sender never knows about it. There may be some use cases where the sender really doesn't care, but in the majority of cases it's a bug if the sender ignores the possibility of error. So I'd rather have people build their interfaces using call/return and let them return Void to emulate "one-way" messages.

Stanislav Ivochkin

unread,

Jul 7, 2013, 7:04:53 AM7/7/13

to Kenton Varda, capnproto

06.07.2013, в 22:08, Kenton Varda написал(а):

> The trouble with one-way messages is that they don't have a clear way to propagate application errors. Usually one-way messages are ack'd, but the ack happens before processing, because there is no clear point where processing of the message is "done". So if the receiver acks the message and then promptly loses power, the sender never knows about it. There may be some use cases where the sender really doesn't care, but in the majority of cases it's a bug if the sender ignores the possibility of error. So I'd rather have people build their interfaces using call/return and let them return Void to emulate "one-way" messages.

Image a public service based on the symmetric RPC protocol - either server or client can act as a caller/callee. Client-side requests can use call/return schema, even if the return value is Void, there is no issue. But for server-side request this approach can lead to potential vulnerability. If the server's call is acked, the server must keep the context of the call to handle response properly. After the call is acked, the callee(client-side, potentially unfriendly) may (occasionally or intentionally) forget to send response message keeping the connection/call alive. A bunch of such clients can perform a DoS attack, because the server can not distinguish between the real long-running call and attack. The idea is similar to HTTP slow writers vulnerability in threaded/prefork HTTP servers. But if the call was not originally acked, the server may drop the client after some attempts to reinitialize a RCP session.

P.S. please excuse me if I'am telling something obvious, I just want to get the original intention.

--
Regards,

Stas.

Kenton Varda

unread,

Jul 7, 2013, 2:27:36 PM7/7/13

to Stanislav Ivochkin, capnproto

I'm not sure how one-way messages solve that problem. A malicious client could refuse to ack the TCP packets just as easily as it could refuse to send a return message at the RPC level. How can you have guaranteed delivery without this vulnerability?

Note that RPCs can have timeouts. In the scenario you describe, I'd advise the server to protect itself by setting a reasonable timeout.

Stanislav Ivochkin

unread,

Jul 8, 2013, 3:57:27 AM7/8/13

to Kenton Varda, capnproto

2013/7/7 Kenton Varda <temp...@gmail.com>

I'm not sure how one-way messages solve that problem. A malicious client could refuse to ack the TCP packets just as easily as it could refuse to send a return message at the RPC level. How can you have guaranteed delivery without this vulnerability?

Packet losses should be handled at TCP level not to break isolation. The RPC transport layer built on top of TCP should not care about it. Am I right?

Note that RPCs can have timeouts. In the scenario you describe, I'd advise the server to protect itself by setting a reasonable timeout.

I agree that timeouts can solve it. I'm not arguing that the server should protect itself in the case it waits for response from the client. I just want to note that the issue may be solved for this particular case (server-to-client call with no mandatory "return" message) without introducing timeouts to the RPC layer. I don't think it is an option to forbid.

One more example to illustrate what I'am talking about: a multiplayer game. Server keeps the current state of the in-game world and notifies clients on world updates. The rate of these updates varies and has high peaks corresponding to active in-game interactions. Although updates may be received in any order, they must be handled consequentially. Timeout for handling such a message may be rather small, because it doesn't imply any complex computations to handle. But due to low network bandwidth or high packet loss rate, particular 'world-update" could be postponed until the previous updates arrive. Reaching the timeout then leads to message retransmission and further degradation of client's latency.

--

Regards,

Stas.

Kenton Varda

unread,

Jul 8, 2013, 4:15:22 AM7/8/13

to Stanislav Ivochkin, capnproto

On Mon, Jul 8, 2013 at 12:57 AM, Stanislav Ivochkin <i...@extrn.org> wrote:

2013/7/7 Kenton Varda <temp...@gmail.com>

I'm not sure how one-way messages solve that problem. A malicious client could refuse to ack the TCP packets just as easily as it could refuse to send a return message at the RPC level. How can you have guaranteed delivery without this vulnerability?

Packet losses should be handled at TCP level not to break isolation. The RPC transport layer built on top of TCP should not care about it. Am I right?

That's beside the point. A malicious client can attack the TCP layer just as easily as the RPC layer. So it doesn't make sense to go out of our way to defend against attacks at the RPC layer which aren't defended at the TCP layer.

One more example to illustrate what I'am talking about: a multiplayer game. Server keeps the current state of the in-game world and notifies clients on world updates. The rate of these updates varies and has high peaks corresponding to active in-game interactions. Although updates may be received in any order, they must be handled consequentially. Timeout for handling such a message may be rather small, because it doesn't imply any complex computations to handle. But due to low network bandwidth or high packet loss rate, particular 'world-update" could be postponed until the previous updates arrive. Reaching the timeout then leads to message retransmission and further degradation of client's latency.

You can always write a client that just puts the message in a queue and then returns immediately, if that's what you want. I'm not preventing that.

Stanislav Ivochkin

unread,

Jul 8, 2013, 7:03:46 AM7/8/13

to Kenton Varda, capnproto

2013/7/8 Kenton Varda <temp...@gmail.com>

On Mon, Jul 8, 2013 at 12:57 AM, Stanislav Ivochkin <i...@extrn.org> wrote:

2013/7/7 Kenton Varda <temp...@gmail.com>

I'm not sure how one-way messages solve that problem. A malicious client could refuse to ack the TCP packets just as easily as it could refuse to send a return message at the RPC level. How can you have guaranteed delivery without this vulnerability?

Packet losses should be handled at TCP level not to break isolation. The RPC transport layer built on top of TCP should not care about it. Am I right?

That's beside the point. A malicious client can attack the TCP layer just as easily as the RPC layer. So it doesn't make sense to go out of our way to defend against attacks at the RPC layer which aren't defended at the TCP layer.

Agreed.

One more example to illustrate what I'am talking about: a multiplayer game. Server keeps the current state of the in-game world and notifies clients on world updates. The rate of these updates varies and has high peaks corresponding to active in-game interactions. Although updates may be received in any order, they must be handled consequentially. Timeout for handling such a message may be rather small, because it doesn't imply any complex computations to handle. But due to low network bandwidth or high packet loss rate, particular 'world-update" could be postponed until the previous updates arrive. Reaching the timeout then leads to message retransmission and further degradation of client's latency.

You can always write a client that just puts the message in a queue and then returns immediately, if that's what you want. I'm not preventing that.

Thanks for your clarification. +1 for call/return semantics.

--

Regards,

Stas.

Reply all

Reply to author

Forward