capnp vs thrift rpc benchmark

1,064 views
Skip to first unread message

sanj...@gmail.com

unread,
Jan 31, 2016, 2:54:23 AM1/31/16
to Cap'n Proto

Kenton,

I am finding that Capnp does almost twice as worse as Apache Thrift for a simple RPC call which echoes back an 8 byte call from the server.

I was wondering if you could take a look to see if I am doing something wrong or if there is anything in capnp that can be optimized.

See the code at
https://github.com/sanjosh/rpcbench/

I am running Cap'n Proto version 0.6-dev against Thrift version 1.0.0-dev

-Sandeep

David Renshaw

unread,
Feb 2, 2016, 11:02:51 PM2/2/16
to Sandeep Joshi, Cap'n Proto
Hi Sandeep,

I tried running your benchmark and I was able to approximately reproduce your results.

A major difference between the Cap'n Proto example and the Thrift example is that the former uses asynchronous I/O while the latter is apparently totally synchronous. It's possible to adjust Cap'n Proto example to take fuller advantage of async I/O; instead of waiting for each call to return before sending the next, we can send off all calls immediately and then only deal with the responses when they come in. I tried [1] implementing this and found that it made Cap'n Proto's numbers look much better. Do you know whether your end application is likely to benefit from such asynchrony? Does Thrift have any similar way of making asynchronous calls?

[1] https://github.com/dwrensha/rpcbench/blob/253c10ee22fcff65d30c36bcee43e8127b499279/capnp/CapnpClient.cpp



- David



--
You received this message because you are subscribed to the Google Groups "Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to capnproto+...@googlegroups.com.
Visit this group at https://groups.google.com/group/capnproto.

Andrew Lutomirski

unread,
Feb 3, 2016, 12:27:31 AM2/3/16
to David Renshaw, Sandeep Joshi, Cap'n Proto
I tried playing with the original asynchronous version. Performance
is *much* worse than it deserves to be.

For simplicity, I forced everything onto one cpu (taskset -c 0). It
claims about 84 usec per call. On my laptop, a cross-process context
switch is about 1.1µs + some userspace performance loss due to TLB
misses [1]. A syscall has about 50ns of overhead. That means that a
whole ton of time is being spent doing useless things.

The server seems to do roughly this for each RPC:

read(7, "\0\0\0\0\4\0\0\0", 8) = 8
read(7, "\0\0\0\0\1\0\1\0\4\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0",
32) = 32
read(7, "\0\0\0\0\20\0\0\0", 8) = 8
read(7, "\0\0\0\0\1\0\1\0\2\0\0\0\0\0\0\0\0\0\0\0\3\0\3\0\0\0\0\0\0\0\0\0"...,
128) = 128
read(7, 0x1fe77d4, 8) = -1 EAGAIN (Resource
temporarily unavailable)
writev(7, [{"\0\0\0\0\v\0\0\0", 8},
{"\0\0\0\0\1\0\1\0\3\0\0\0\0\0\0\0\0\0\0\0\2\0\1\0\0\0\0\0\1\0\0\0"...,
88}], 2) = 96
epoll_wait(3, [{EPOLLOUT, {u32=33431864, u64=33431864}}], 16, -1) = 1
epoll_wait(3, [{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=33431864,
u64=33431864}}], 16, -1) = 1

This isn't obviously horrible except that the first epoll_wait seems
pointless (and possibly problematic under some conditions, given that
the server appears to want to read, not write, so if the write buffer
is full this could block unnecessarily.)


The client is doing:

writev(6, [{"\0\0\0\0\4\0\0\0", 8},
{"\0\0\0\0\1\0\1\0\4\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0",
32}], 2) = 40
writev(6, [{"\0\0\0\0\20\0\0\0", 8},
{"\0\0\0\0\1\0\1\0\2\0\0\0\0\0\0\0\0\0\0\0\3\0\3\0\0\0\0\0\0\0\0\0"...,
128}], 2) = 136
epoll_wait(3, [{EPOLLIN|EPOLLOUT, {u32=34385496, u64=34385496}}], 16, -1) = 1
read(6, "\0\0\0\0\v\0\0\0", 8) = 8
read(6, "\0\0\0\0\1\0\1\0\3\0\0\0\0\0\0\0\0\0\0\0\2\0\1\0\0\0\0\0\1\0\0\0"...,
88) = 88
read(6, 0x20cd3a4, 8) = -1 EAGAIN (Resource
temporarily unavailable)

It would be nicer to coalesce these IOs, but it still doesn't look
that terrible.

The biggest single userspace offender shown by perf is malloc.

Most of the kernel time is scattered all over the place in the network
stack, TCP code, netfilter, etc. I bet performance would improve a
*lot* if you switched to unix sockets.

[1] The penalty for cross-process switches (as opposed to thread
switches within a process) may drop significantly in a future version
of Linux. :)

--Andy

Kenton Varda

unread,
Feb 3, 2016, 3:39:50 AM2/3/16
to Andrew Lutomirski, David Renshaw, Sandeep Joshi, Cap'n Proto
On Tue, Feb 2, 2016 at 9:27 PM, Andrew Lutomirski <an...@luto.us> wrote:
...

epoll_wait(3, [{EPOLLOUT, {u32=33431864, u64=33431864}}], 16, -1) = 1
epoll_wait(3, [{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=33431864,
u64=33431864}}], 16, -1) = 1

This isn't obviously horrible except that the first epoll_wait seems
pointless (and possibly problematic under some conditions, given that
the server appears to want to read, not write, so if the write buffer
is full this could block unnecessarily.)

I think you're misinterpreting the second parameter as input; it's actually output. We're not asking the kernel to wait for writability, we're asking it for the next event and it's telling us "ok, the next event is that the socket has become writable".

-Kenton

Kenton Varda

unread,
Feb 3, 2016, 4:07:14 AM2/3/16
to Sandeep Joshi, Cap'n Proto
Hi Sandeep,

Some comments:

- An 8-byte byte string as the parameter / result is not really exercising Cap'n Proto's serialization layer much, which is the part that we claim to be much faster than alternatives. For a much larger, structurally complex payload, I'd expect Cap'n Proto to do better.

- If you are testing local loopback (not over a slow network), then network latency is effectively zero, and the "latency" you are measuring is really CPU time. Since your application is a no-op, you are basically measuring the CPU complexity of the RPC stack. Note that in most real applications, the RPC stack itself -- aside from the serialization -- is not a particularly hot spot, so probably doesn't impact overall application performance that much.

- Cap'n Proto's RPC is much more complicated than Thrift's. Last I knew, Thrift RPC was FIFO, which makes for a pretty trivial protocol and state machine, but tends to turn problematic quickly in complicated distributed systems. Cap'n Proto, meanwhile, is not just asynchronous, but is a full capability protocol with promise pipelining. This allows some very powerful designs and avoidance of network latency, but it means that basic operations are going to be slower. It does not surprise me at all that it would use 3x the CPU time -- in fact I'm surprised it's only 3x.

- We haven't done much serious optimization work on Cap'n Proto's RPC layer.

- Andy notes that a lot of time is spent in malloc (unsurprising, since promises do a lot of heap allocation), so the first thing you might want to try is using a different allocator, like tcmalloc or jemalloc.

-Kenton

--

sanj...@gmail.com

unread,
Feb 3, 2016, 5:38:42 AM2/3/16
to Cap'n Proto, sanj...@gmail.com

Hi Kenton and David,

I provided an 8-byte payload example to help narrow down the capnp rpc overhead.  I have been running other benchmarks on larger payloads and have also used an asynchronous capnp client.  With an asynchronous capnp client which issued 100 commands at a time and used kj::joinPromises() to wait on  them, I was able to do about 26K rpc cmds/sec.

i.e.
   array = kj::heapArrayBuilder<kj::Promise<...()
   dispatch 100 commands at a time
   auto comboPromise = kj::joinPromises(array.finish())
   comboPromise.wait(ioContext.waitScope);

On our system, a basic tcp echo client shows 12 microsec RTT in loopback and 20 microsec over a LAN.  The synchronous Thrift client adds about 5-10 microsec to this value. 

I realize that capnp is explicitly designed to be asynchronous.  In the "perf" output, I saw lot of calls related to TransformPromiseNode, ChainPromiseNode, etc which are required because the server assumes the client has made an asynchronous call.

Thrift also has an asynchronous client which I can compare with.   Let me see if changing the allocator improves the performance.

Is there any chance that capnp might offer a low-overhead synchronous rpc call, by generating a synchronous variant for each or some functions in the IDL ?

-Sandeep

sanj...@gmail.com

unread,
Feb 4, 2016, 3:00:24 AM2/4/16
to Cap'n Proto, sanj...@gmail.com


On Wednesday, February 3, 2016 at 2:37:14 PM UTC+5:30, Kenton Varda wrote:
Hi Sandeep,

Some comments:

- An 8-byte byte string as the parameter / result is not really exercising Cap'n Proto's serialization layer much, which is the part that we claim to be much faster than alternatives. For a much larger, structurally complex payload, I'd expect Cap'n Proto to do better.

- If you are testing local loopback (not over a slow network), then network latency is effectively zero, and the "latency" you are measuring is really CPU time. Since your application is a no-op, you are basically measuring the CPU complexity of the RPC stack. Note that in most real applications, the RPC stack itself -- aside from the serialization -- is not a particularly hot spot, so probably doesn't impact overall application performance that much.

- Cap'n Proto's RPC is much more complicated than Thrift's. Last I knew, Thrift RPC was FIFO, which makes for a pretty trivial protocol and state machine, but tends to turn problematic quickly in complicated distributed systems. Cap'n Proto, meanwhile, is not just asynchronous, but is a full capability protocol with promise pipelining. This allows some very powerful designs and avoidance of network latency, but it means that basic operations are going to be slower. It does not surprise me at all that it would use 3x the CPU time -- in fact I'm surprised it's only 3x.

- We haven't done much serious optimization work on Cap'n Proto's RPC layer.

- Andy notes that a lot of time is spent in malloc (unsurprising, since promises do a lot of heap allocation), so the first thing you might want to try is using a different allocator, like tcmalloc or jemalloc.

Switching to tcmalloc saves about 20 microseconds.

 

Kenton Varda

unread,
Feb 4, 2016, 3:29:56 AM2/4/16
to Sandeep Joshi, Cap'n Proto
Hi Sandeep,

If you're OK with synchronous, FIFO behavior, it should be pretty easy to write such a thing on top of Cap'n Proto serialization, skipping the RPC system. The server would, in a loop, use StreamFdMessageReader to read a message, process it, and writeMessage() the result. Instead of declaring an interface with methods, you would probably want to declare a big union of all the request types.

-Kenton

sanj...@gmail.com

unread,
Feb 4, 2016, 6:09:14 AM2/4/16
to Cap'n Proto, sanj...@gmail.com


On Thursday, February 4, 2016 at 1:59:56 PM UTC+5:30, Kenton Varda wrote:
Hi Sandeep,

If you're OK with synchronous, FIFO behavior, it should be pretty easy to write such a thing on top of Cap'n Proto serialization, skipping the RPC system. The server would, in a loop, use StreamFdMessageReader to read a message, process it, and writeMessage() the result. Instead of declaring an interface with methods, you would probably want to declare a big union of all the request types.

-Kenton



Let me try that. 

Related question.  Right now, all newly accepted sockets continue to be handled by the same thread.   If the server is using the capnp::TwoPartyServer, is it possible to provide a thread pool to handle the accepted connections.  Can I specialize a TwoPartyVatNetwork or ConnectionReceiver and provide it a parameter somehow ?  Are there any non-thread-safe objects like Promises that I need to worry about ?




 

Kenton Varda

unread,
Feb 4, 2016, 5:34:51 PM2/4/16
to Sandeep Joshi, Cap'n Proto
On Thu, Feb 4, 2016 at 3:09 AM, <sanj...@gmail.com> wrote:
Related question.  Right now, all newly accepted sockets continue to be handled by the same thread.   If the server is using the capnp::TwoPartyServer, is it possible to provide a thread pool to handle the accepted connections.  Can I specialize a TwoPartyVatNetwork or ConnectionReceiver and provide it a parameter somehow ?  Are there any non-thread-safe objects like Promises that I need to worry about ?

You could write an old-fashion accept() loop (not using KJ) which, upon accepting a new connection, starts a thread, and then that thread sets up a KJ event loop for itself and a TwoPartyVatNetwork around that one socket. However, you'll have to keep in mind that promises created in one thread cannot be used in any way in another thread. You'll need to mutex-protect shared data and use pipes (or similar) to signal cross-thread events.

-Kenton

sanj...@gmail.com

unread,
Mar 28, 2016, 7:32:56 AM3/28/16
to Cap'n Proto, sanj...@gmail.com



Follow-up on an earlier thread.

I decided to test how far the throughput scales with the number of clients and found that the number of requests flattens at about 80K req/sec.  This is on a 20-core machine with the capnproto client-server connected on localhost (127.0.0.1)

See https://github.com/sanjosh/rpcbench/blob/master/async_capnp

The benchmark.txt file has a record of the run as well as the system configuration.

number of clients  : request/sec
----------------------------------------------
1  : 37322
2  : 58619
4  : 80756
8  : 84614
16 : 83693












I haven't done any analysis yet.  I just wanted to first verify if this is expected.

-Sandeep

Kenton Varda

unread,
Mar 28, 2016, 1:46:10 PM3/28/16
to Sandeep Joshi, Cap'n Proto
Hi Sandeep,

Is it maxing out a core at that point?

In order to take advantage of 20 cores you would of course need to run 20 instances of the server, since Cap'n Proto is currently single-threaded.

-Kenton

--
You received this message because you are subscribed to the Google Groups "Cap'n Proto" group.

Sandeep Joshi

unread,
Mar 28, 2016, 11:48:11 PM3/28/16
to Kenton Varda, Cap'n Proto
On Mon, Mar 28, 2016 at 11:15 PM, Kenton Varda <ken...@sandstorm.io> wrote:
Hi Sandeep,

Is it maxing out a core at that point?

In order to take advantage of 20 cores you would of course need to run 20 instances of the server, since Cap'n Proto is currently single-threaded.

-Kenton


yes, it is maxing out the core.  Forgot to mention that I had bound to a core using 'taskset'

So I guess this is the max throughput you can get with one capnp server in the current setup ?

-Sandeep

Kenton Varda

unread,
Apr 1, 2016, 8:50:32 PM4/1/16
to Sandeep Joshi, Cap'n Proto
Hi Sandeep,

Possibly. There is probably optimization that could be done in the Cap'n Proto code to improve this, but it likely won't match Thrift's FIFO-oriented protocol when the payload is a simple byte array.

-Kenton
Reply all
Reply to author
Forward
0 new messages