Benchmarking gRPC Stack

Kostis Kaffes

unread,

Jan 14, 2019, 6:59:56 PM1/14/19

to grpc.io

Hi folks,

As part of a research project, I am trying to benchmark a C++ gRPC application. More specifically, I want to find out how much time is spent in each layer of the stack as it is described here. I tried using perf but the output is too convoluted. Any idea on tools I could use or existing results on this type of benchmarking?

Thanks!

Kostis

robert engels

unread,

Jan 14, 2019, 7:11:15 PM1/14/19

to Kostis Kaffes, grpc.io

If you use the “perf report per thread” you should have all the information you need, unless you are using a single threaded test.

Stating “convoluted” doesn’t really help - maybe an example of what you mean?

--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/26259f10-a18c-45c1-a247-5356424bd096%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

robert engels

unread,

Jan 14, 2019, 7:16:21 PM1/14/19

to Kostis Kaffes, grpc.io

But I will also save you some time - it is a fraction of the time spent in IO - so don’t even both measuring it. gRPC is simple buffer translation at its heart - trivially simple.

MAYBE if you had a super complex protocol message you could get it to register CPU time in those areas with any significance in compared to the IO time, but doubtful.

By IO time, I mean even on a local machine with no “physical network”.

Any CPU time used will be dominated by malloc/free - so a no dynamic memory messaging system will probably out perform gRPC - but still it will be dominated by the IO.

This is based on my testing of gRPC in Go.

To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/C296B1F6-90D7-451A-A6FB-A8E909AB40B4%40earthlink.net.

Kostis Kaffes

unread,

Jan 14, 2019, 7:34:13 PM1/14/19

to grpc.io

Thanks! I have tried the per thread option. Attached you will find a call graph and see what I mean by convoluted. There are also some unknowns that do not help the situation.

I am using the in-process transport in order to avoid being dominated by IO. My goal is to see if it is feasible to lower gRPC latency to a few μs and what that might require.

Hence, even small overheads might matter.

output.png

robert engels

unread,

Jan 14, 2019, 7:42:13 PM1/14/19

to Kostis Kaffes, grpc.io

I think the tree view rather than the graph would be easier to understand.

To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/587f4e91-c3fc-4f56-96a2-81755f8efe72%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

<output.png>

robert engels

unread,

Jan 14, 2019, 7:43:03 PM1/14/19

to Kostis Kaffes, grpc.io

Also, remove any reports that are less than 1% of the total time - much easier to see the dominators.

robert engels

unread,

Jan 14, 2019, 7:44:19 PM1/14/19

to Kostis Kaffes, grpc.io

Lastly, you have a lot of “unknown”. You need to compile without the stack frame being omitted, and make sure you have all debug symbols.

To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/2C3483F0-FB62-4E59-B69F-01B71F74E4B8%40earthlink.net.

hcas...@google.com

unread,

Jan 23, 2019, 2:27:08 PM1/23/19

to grpc.io

Hi Kostis,

One tool you might find useful is FlameGraph, which will visualize data collected from perf (https://github.com/brendangregg/FlameGraph).

I will describe the in process transport architecture a bit so you get a better idea of what gRPC overheads are included in your measurements. The architecture centers around the following ideas:

Avoid serialization, framing, wire-formatting

Transfer metadata and messages as slices/slice-buffers, unchanged from how they enter the transport (note that while this avoids serializing from slices to HTTP2 frames, this still performs serialization from protos to byte buffers)

Avoid polling or other external notification

Each side of a stream directly triggers the scheduling of other side’s operation completion tags

Maintain communication and concurrency model of gRPC core

No direct invocation of procedures from opposite side of stream
No direct memory sharing; data shared only as RPC requests and responses

Some possible performance optimizations for gRPC/ in process transport:

Optimized implementations of structs for small cases
E.g., investigate more efficient completion queue for small # of concurrent events
Where can we replace locks with atomics or avoid atomics altogether

For tiny messages over the in process transport, it should be feasible to get a few microseconds of latency, but it may not be possible with moderately sized messages because of serialization/deserialization costs between proto and ByteBuffer.

Hope this helps!

Konstantinos Kaffes

unread,

Jan 23, 2019, 2:44:20 PM1/23/19

to hcas...@google.com, grpc.io

Thank you! I will let you know when I have an update.

On Wed, 23 Jan 2019 at 11:27, hcaseyal via grpc.io <grp...@googlegroups.com> wrote:

Hi Kostis,

One tool you might find useful is FlameGraph, which will visualize data collected from perf (https://github.com/brendangregg/FlameGraph).

I will describe the in process transport architecture a bit so you get a better idea of what gRPC overheads are included in your measurements. The architecture centers around the following ideas:

Avoid serialization, framing, wire-formatting

Transfer metadata and messages as slices/slice-buffers, unchanged from how they enter the transport (note that while this avoids serializing from slices to HTTP2 frames, this still performs serialization from protos to byte buffers)

Avoid polling or other external notification

Each side of a stream directly triggers the scheduling of other side’s operation completion tags

Maintain communication and concurrency model of gRPC core

No direct invocation of procedures from opposite side of stream
No direct memory sharing; data shared only as RPC requests and responses

Some possible performance optimizations for gRPC/ in process transport:

Optimized implementations of structs for small cases

E.g., investigate more efficient completion queue for small # of concurrent events

Where can we replace locks with atomics or avoid atomics altogether

For tiny messages over the in process transport, it should be feasible to get a few microseconds of latency, but it may not be possible with moderately sized messages because of serialization/deserialization costs between proto and ByteBuffer.

Hope this helps!

To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/1dcd55d8-e4fd-4a64-ab00-e6328b38a0f7%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Kostis Kaffes

PhD Student in Electrical Engineering

Stanford University

Reply all

Reply to author

Forward