Benchmarking gRPC Stack

112 views
Skip to first unread message

Kostis Kaffes

unread,
Jan 14, 2019, 6:59:56 PM1/14/19
to grpc.io
Hi folks,

As part of a research project, I am trying to benchmark a C++ gRPC application. More specifically, I want to find out how much time is spent in each layer of the stack as it is described here. I tried using perf but the output is too convoluted. Any idea on tools I could use or existing results on this type of benchmarking?

Thanks!
Kostis

robert engels

unread,
Jan 14, 2019, 7:11:15 PM1/14/19
to Kostis Kaffes, grpc.io
If you use the “perf report per thread” you should have all the information you need, unless you are using a single threaded test.

Stating “convoluted” doesn’t really help - maybe an example of what you mean?

--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/26259f10-a18c-45c1-a247-5356424bd096%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

robert engels

unread,
Jan 14, 2019, 7:16:21 PM1/14/19
to Kostis Kaffes, grpc.io
But I will also save you some time - it is a fraction of the time spent in IO - so don’t even both measuring it. gRPC is simple buffer translation at its heart - trivially simple.

MAYBE if you had a super complex protocol message you could get it to register CPU time in those areas with any significance in compared to the IO time, but doubtful.

By IO time, I mean even on a local machine with no “physical network”.

Any CPU time used will be dominated by malloc/free - so a no dynamic memory messaging system will probably out perform gRPC - but still it will be dominated by the IO.

This is based on my testing of gRPC in Go.



Kostis Kaffes

unread,
Jan 14, 2019, 7:34:13 PM1/14/19
to grpc.io
Thanks! I have tried the per thread option. Attached you will find a call graph and see what I mean by convoluted. There are also some unknowns that do not help the situation.

I am using the in-process transport in order to avoid being dominated by IO. My goal is to see if it is feasible to lower gRPC latency to a few μs and what that might require.
Hence, even small overheads might matter.
output.png

robert engels

unread,
Jan 14, 2019, 7:42:13 PM1/14/19
to Kostis Kaffes, grpc.io
I think the tree view rather than the graph would be easier to understand.


For more options, visit https://groups.google.com/d/optout.
<output.png>

robert engels

unread,
Jan 14, 2019, 7:43:03 PM1/14/19
to Kostis Kaffes, grpc.io
Also, remove any reports that are less than 1% of the total time - much easier to see the dominators.

robert engels

unread,
Jan 14, 2019, 7:44:19 PM1/14/19
to Kostis Kaffes, grpc.io
Lastly, you have a lot of “unknown”. You need to compile without the stack frame being omitted, and make sure you have all debug symbols.

hcas...@google.com

unread,
Jan 23, 2019, 2:27:08 PM1/23/19
to grpc.io
Hi Kostis,

One tool you might find useful is FlameGraph, which will visualize data collected from perf (https://github.com/brendangregg/FlameGraph). 

I will describe the in process transport architecture a bit so you get a better idea of what gRPC overheads are included in your measurements. The architecture centers around the following ideas:
  • Avoid serialization, framing, wire-formatting
    • Transfer metadata and messages as slices/slice-buffers, unchanged from how they enter the transport (note that while this avoids serializing from slices to HTTP2 frames, this still performs serialization from protos to byte buffers)
  • Avoid polling or other external notification
    • Each side of a stream directly triggers the scheduling of other side’s operation completion tags
  • Maintain communication and concurrency model of gRPC core
    • No direct invocation of procedures from opposite side of stream
    • No direct memory sharing; data shared only as RPC requests and responses
Some possible performance optimizations for gRPC/ in process transport: 
    • Optimized implementations of structs for small cases
    • E.g., investigate more efficient completion queue for small # of concurrent events
    • Where can we replace locks with atomics or avoid atomics altogether
      For tiny messages over the in process transport, it should be feasible to get a few microseconds of latency, but it may not be possible with moderately sized messages because of serialization/deserialization costs between proto and ByteBuffer.

      Hope this helps!

      Konstantinos Kaffes

      unread,
      Jan 23, 2019, 2:44:20 PM1/23/19
      to hcas...@google.com, grpc.io
      Thank you! I will let you know when I have an update.

      On Wed, 23 Jan 2019 at 11:27, hcaseyal via grpc.io <grp...@googlegroups.com> wrote:
      Hi Kostis,

      One tool you might find useful is FlameGraph, which will visualize data collected from perf (https://github.com/brendangregg/FlameGraph). 

      I will describe the in process transport architecture a bit so you get a better idea of what gRPC overheads are included in your measurements. The architecture centers around the following ideas:
      • Avoid serialization, framing, wire-formatting
        • Transfer metadata and messages as slices/slice-buffers, unchanged from how they enter the transport (note that while this avoids serializing from slices to HTTP2 frames, this still performs serialization from protos to byte buffers)
      • Avoid polling or other external notification
        • Each side of a stream directly triggers the scheduling of other side’s operation completion tags
      • Maintain communication and concurrency model of gRPC core
        • No direct invocation of procedures from opposite side of stream
        • No direct memory sharing; data shared only as RPC requests and responses
      Some possible performance optimizations for gRPC/ in process transport: 
      • Optimized implementations of structs for small cases
      • E.g., investigate more efficient completion queue for small # of concurrent events
      • Where can we replace locks with atomics or avoid atomics altogether
      For tiny messages over the in process transport, it should be feasible to get a few microseconds of latency, but it may not be possible with moderately sized messages because of serialization/deserialization costs between proto and ByteBuffer.

      Hope this helps!


      For more options, visit https://groups.google.com/d/optout.


      --
      Kostis Kaffes
      PhD Student in Electrical Engineering
      Stanford University
      Reply all
      Reply to author
      Forward
      0 new messages