How does number of channels (and other config) affect QPS in grpc java?

3,481 views
Skip to first unread message

niko...@gmail.com

unread,
Jun 3, 2018, 7:33:09 AM6/3/18
to grpc.io
Hi,

Disclaimer: I'm not very familiar with gRPC internals and netty, but I'm trying to get a solid grasp of both technologies.

I have been running Java gRPC benchmarks and figured out that adding more channels improves the throughput. I have read somewhere (can't find the link) that the gRPC channels should be reused, which makes sense because they represent the connection AFAIK and it would be expensive to open a new connection often.

However, I wasn't able to find in the official documentation anything that suggests how many channels to use or how to configure gRPC server and client to get the best performance.

My benchmarks show that each bidi stream has a cap at about 30-40k Ops /s on my machine and it's required to add more streams if I want to get higher throughput. It also shows that adding more streams via a single channel doesn't always improve the throughput, however, creating a new channel and adding more stream improves the throughput.

Could anyone from the gRPC team confirm or negate the sanity of the above results and conclusions? In particular, is it expected to have more gRPC channels and reuse them or should I be able to get the same results even if I used only one gRPC channel?

Additionally, it would be very nice if anyone could explain the correlation between the number of worker event loops (in both client and server) with respect to the number of channels / streams. I understand that netty assigns a channel to exactly one worker event loop (and I don't think it ever moves them around?) -- please correct me if I'm wrong -- which means that if there was a single gRPC channel, having more than one worker event loops wouldn't help at all.

Also, how does the gRPC channel relate to the netty channel? I thought that netty channel is created per gRPC channel, but I may be wrong according to this post: https://groups.google.com/forum/#!searchin/grpc-io/grpc$20reuse$20channels|sort:date/grpc-io/LrnAbWFozb0/8_te_-GmAgAJ (explanation about Boss Event Loop Group suggests otherwise).

Thanks in advance.

Cheers,
Nikola

Carl Mastrangelo

unread,
Jun 4, 2018, 9:55:44 PM6/4/18
to grpc.io
Hello!  Responses inline:


On Sunday, June 3, 2018 at 4:33:09 AM UTC-7, Nikola Stojiljkovic wrote:
Hi,

Disclaimer: I'm not very familiar with gRPC internals and netty, but I'm trying to get a solid grasp of both technologies.

I have been running Java gRPC benchmarks and figured out that adding more channels improves the throughput. I have read somewhere (can't find the link) that the gRPC channels should be reused, which makes sense because they represent the connection AFAIK and it would be expensive to open a new connection often.

Channels should be reused.  The connection expense is usually low enough that it isn't an issue, though if you are sending more than 10gbps per connection it start to have impact.   Very few users are able to reach this, and if they do, they probably don't use java!   More seriously, how many connections are you using, and to how many backend servers?  Are the client and server running on the same machine?
 

However, I wasn't able to find in the official documentation anything that suggests how many channels to use or how to configure gRPC server and client to get the best performance.

My benchmarks show that each bidi stream has a cap at about 30-40k Ops /s on my machine and it's required to add more streams if I want to get higher throughput. It also shows that adding more streams via a single channel doesn't always improve the throughput, however, creating a new channel and adding more stream improves the throughput.

Can I ask how many concurrent RPCs you are using?  Also, can I ask the payload size?  While each RPC can be interleaved, the messages of a given RPC must be processed in order.  Also, if you aren't using ForkJoinPool as your executor, you are almost certainly running into contention at these levels.   
 

Could anyone from the gRPC team confirm or negate the sanity of the above results and conclusions? In particular, is it expected to have more gRPC channels and reuse them or should I be able to get the same results even if I used only one gRPC channel?

Source code exhibiting the slow down is ideal, and would allow us to find common bottlenecks.   Aside from that, profiling the job is the next best thing. 
 

Additionally, it would be very nice if anyone could explain the correlation between the number of worker event loops (in both client and server) with respect to the number of channels / streams. I understand that netty assigns a channel to exactly one worker event loop (and I don't think it ever moves them around?) -- please correct me if I'm wrong -- which means that if there was a single gRPC channel, having more than one worker event loops wouldn't help at all.

Netty does assign a *netty* channel to a single loop, but not a gRPC channel.  gRPC's channel is a multiplexing channel across each of the subchannels provided by the load balancer.  The subchannels are more aligned with a netty channel, but this is only approximately the truth (I can explain further if interested, but I don't think it's relevant to your question).  

As for event loops:   The event loop usage in gRPC is limited to SSL coding, HTTP/2 framing, HPACK coding, and wire level flow control as well as the actual syscalls.  While each of these operations are pretty fast, they aren't free.   Thus, if each connection is being heavily used, it makes sense to have multiple connections.  (Again, you need to be talking about hundreds of megabits - gigabits per connection before this really applies to you).  This data is passed up to the gRPC layer, and handled on the Executor that you pass in at Channel / Server creation.  In particular, the ClientCall.Listener and ServerCall.Listener callbacks represent the boundary between application threads and netty threads.  

This is a pretty large departure from the way netty code is normally written.   In gRPC its is more common to only have a handful of netty threads (like 1), and many more application threads.

nik...@improbable.io

unread,
Jun 18, 2018, 8:34:28 AM6/18/18
to grpc.io
Hi Carl,

Thanks for the response.

Sorry I wasn't able to reply earlier.

Answers inline.

On Tuesday, 5 June 2018 02:55:44 UTC+1, Carl Mastrangelo wrote:
Hello!  Responses inline:

On Sunday, June 3, 2018 at 4:33:09 AM UTC-7, Nikola Stojiljkovic wrote:
Hi,

Disclaimer: I'm not very familiar with gRPC internals and netty, but I'm trying to get a solid grasp of both technologies.

I have been running Java gRPC benchmarks and figured out that adding more channels improves the throughput. I have read somewhere (can't find the link) that the gRPC channels should be reused, which makes sense because they represent the connection AFAIK and it would be expensive to open a new connection often.

Channels should be reused.  The connection expense is usually low enough that it isn't an issue, though if you are sending more than 10gbps per connection it start to have impact.   Very few users are able to reach this, and if they do, they probably don't use java!   More seriously, how many connections are you using, and to how many backend servers?  Are the client and server running on the same machine?
 
Ideally I would like to scale this on dozens or hundreds of machines. At the moment I've benchmarked the end-to-end throughput with up to 8 servers and 8 client machines with 32 cores each.
Regarding the number of connections, it depends on the benchmark, but I'm trying to optimize for a single connection / channel to start with. The machines communicate with each other internally, so there won't be many connections.

The client and the server do not run on the same machine. However, I have benchmarks that run both client and the server on the same machine (within the same process), as well as on different machines. The throughput doesn't seem to be affected by the network at this stage.

 

However, I wasn't able to find in the official documentation anything that suggests how many channels to use or how to configure gRPC server and client to get the best performance.

My benchmarks show that each bidi stream has a cap at about 30-40k Ops /s on my machine and it's required to add more streams if I want to get higher throughput. It also shows that adding more streams via a single channel doesn't always improve the throughput, however, creating a new channel and adding more stream improves the throughput.

Can I ask how many concurrent RPCs you are using?  Also, can I ask the payload size?  While each RPC can be interleaved, the messages of a given RPC must be processed in order.  Also, if you aren't using ForkJoinPool as your executor, you are almost certainly running into contention at these levels.   
 
I tried sending between 100-1000 concurrent RPCs and the payload size of 50-60 bytes. I am using ForkJoinPool. Number of concurrent RPCs doesn't affect the throughput (as long as it's above some threshold).

 

Could anyone from the gRPC team confirm or negate the sanity of the above results and conclusions? In particular, is it expected to have more gRPC channels and reuse them or should I be able to get the same results even if I used only one gRPC channel?

Source code exhibiting the slow down is ideal, and would allow us to find common bottlenecks.   Aside from that, profiling the job is the next best thing. 
 
I have spent some time profiling the parts of the system, as well as the whole system. The benchmarks show that as soon as I start sending requests through the gRPC channel, the throughput drops significantly -- I have many benchmarks to measure the throughput of each layer, e.g. how does the data-structure throughput compare to the throughput of the service that wraps the data-structure.

This made me write JMH benchmarks for the gRPC itself too, with a dummy service that implements unary and streaming call. The service sends DummyResponse for each DummyRequest with a payload size specified by client (about 50 bytes). I have run these benchmarks on a single 64 core machine within a single process, not over the network, but using NettyChannelBuilder and NettyServer. Both unary and streaming benchmarks have shown that adding more channels improves the throughput significantly.

I can extract the source code of the gRPC benchmark if that helps since I'm observing similar behaviour in the gRPC benchmarks of DummyService and benchmarks of the actual system.

Profiler shows that both server and channel worker event loops are not doing IO all the time (see the screenshot below). This probably means that I am not using the channel's full capacity, but somehow adding more channels makes up for some bottleneck that I can't spot. The machines are not CPU bound. There doesn't seem to be any obvious thread-contention, and if it is there it is not present if the requests don't go through the gRPC service.

Adding more JMH worker threads that use the same channel to send more requests in parallel doesn't improve the throughput. Neither does the target number of outstanding RPCs.

I've also tried changing the flow control window sizes, however, this doesn't improve anything -- this seems expected as the data going over the HTTP/2 shouldn't be large in this case. Underlying TCP window scaling is enabled so I don't think that is the problem neither. 

I do believe that I can end up using a single channel between the client and the server, but I'm not sure what exactly needs to be configured to get the same throughput as with multiple channels. I am using NioEventLoopGroups instead of Epoll, but if I understand correctly this shouldn't make a difference for a single (or a few) connection(s).

 

Additionally, it would be very nice if anyone could explain the correlation between the number of worker event loops (in both client and server) with respect to the number of channels / streams. I understand that netty assigns a channel to exactly one worker event loop (and I don't think it ever moves them around?) -- please correct me if I'm wrong -- which means that if there was a single gRPC channel, having more than one worker event loops wouldn't help at all.

Netty does assign a *netty* channel to a single loop, but not a gRPC channel.  gRPC's channel is a multiplexing channel across each of the subchannels provided by the load balancer.  The subchannels are more aligned with a netty channel, but this is only approximately the truth (I can explain further if interested, but I don't think it's relevant to your question).  

While this may not be relevant to the question, I am keen on learning a bit more about the gRPC internals. Can I find out more about it online or is the best way to look around the gRPC github repo?
 

As for event loops:   The event loop usage in gRPC is limited to SSL coding, HTTP/2 framing, HPACK coding, and wire level flow control as well as the actual syscalls.  While each of these operations are pretty fast, they aren't free.   Thus, if each connection is being heavily used, it makes sense to have multiple connections.  (Again, you need to be talking about hundreds of megabits - gigabits per connection before this really applies to you).  This data is passed up to the gRPC layer, and handled on the Executor that you pass in at Channel / Server creation.  In particular, the ClientCall.Listener and ServerCall.Listener callbacks represent the boundary between application threads and netty threads.  

This suggests that I can make much better use of a single channel -- while the system may end up sending huge amounts of data between the client and the server in the near future (hundreds of megabits per second), it is not there yet.
 

This is a pretty large departure from the way netty code is normally written.   In gRPC its is more common to only have a handful of netty threads (like 1), and many more application threads.

 

Also, how does the gRPC channel relate to the netty channel? I thought that netty channel is created per gRPC channel, but I may be wrong according to this post: https://groups.google.com/forum/#!searchin/grpc-io/grpc$20reuse$20channels|sort:date/grpc-io/LrnAbWFozb0/8_te_-GmAgAJ (explanation about Boss Event Loop Group suggests otherwise).

Thanks in advance.

Cheers,
Nikola


To extract some keypoints from the above answers:
  • I have written benchmarks for my service, as well as gRPC service that sends dummy requests and responses.
  • All benchmarks that go through the gRPC channel and service show that adding more channels improves the overall throughput.
  • Server and channel worker event loops don't seem to be spending all of their time doing IO.
  • I would like to focus on getting the best throughput over a single gRPC channel for a DummyService to start with, i.e. the service that sends response back to the client for each request, if possible.
Assuming the client is sending about 100 Mbps of data per second and the service responds with the same amount of data as soon as it gets request (doesn't run any logic except that it generates a random string of the specified length, about 50 bytes), how can I find out where is the bottleneck? What should I be searching for in the profiler since most of the threads are not doing much work? Note that the throughput is better simply by adding more channels (assuming there are multiple streams in case of streaming benchmarks).

Here is a thread profile example of the experiment that runs with 4 bidi streams over a single channel and a ForkJoinPool of size 4 on both server and client side within a single process. Each bidi stream has a target of 200 outstanding RPCs at any time and runs on jmh worker thread -- that's why these threads are blocked most of the time, i.e. it uses semaphore to ensure there is at most 200 in-flight requests.



This gets up to 120,000 QPS. 

If it uses 2 channels and server worker event loop group and channel event loop group both have a size of 2, the QPS gets up to 160,000. Do you think something is misconfigured in our system that would cause this?


Thanks.

Kind regards,
Nikola

Carl Mastrangelo

unread,
Jun 19, 2018, 6:04:14 PM6/19/18
to grpc.io
Inline:


On Monday, June 18, 2018 at 5:34:28 AM UTC-7, Nikola Stojiljkovic wrote:
Hi Carl,

Thanks for the response.

Sorry I wasn't able to reply earlier.

Answers inline.

On Tuesday, 5 June 2018 02:55:44 UTC+1, Carl Mastrangelo wrote:
Hello!  Responses inline:

On Sunday, June 3, 2018 at 4:33:09 AM UTC-7, Nikola Stojiljkovic wrote:
Hi,

Disclaimer: I'm not very familiar with gRPC internals and netty, but I'm trying to get a solid grasp of both technologies.

I have been running Java gRPC benchmarks and figured out that adding more channels improves the throughput. I have read somewhere (can't find the link) that the gRPC channels should be reused, which makes sense because they represent the connection AFAIK and it would be expensive to open a new connection often.

Channels should be reused.  The connection expense is usually low enough that it isn't an issue, though if you are sending more than 10gbps per connection it start to have impact.   Very few users are able to reach this, and if they do, they probably don't use java!   More seriously, how many connections are you using, and to how many backend servers?  Are the client and server running on the same machine?
 
Ideally I would like to scale this on dozens or hundreds of machines. At the moment I've benchmarked the end-to-end throughput with up to 8 servers and 8 client machines with 32 cores each.
Regarding the number of connections, it depends on the benchmark, but I'm trying to optimize for a single connection / channel to start with. The machines communicate with each other internally, so there won't be many connections.

The client and the server do not run on the same machine. However, I have benchmarks that run both client and the server on the same machine (within the same process), as well as on different machines. The throughput doesn't seem to be affected by the network at this stage.

 

However, I wasn't able to find in the official documentation anything that suggests how many channels to use or how to configure gRPC server and client to get the best performance.

My benchmarks show that each bidi stream has a cap at about 30-40k Ops /s on my machine and it's required to add more streams if I want to get higher throughput. It also shows that adding more streams via a single channel doesn't always improve the throughput, however, creating a new channel and adding more stream improves the throughput.

Can I ask how many concurrent RPCs you are using?  Also, can I ask the payload size?  While each RPC can be interleaved, the messages of a given RPC must be processed in order.  Also, if you aren't using ForkJoinPool as your executor, you are almost certainly running into contention at these levels.   
 
I tried sending between 100-1000 concurrent RPCs and the payload size of 50-60 bytes. I am using ForkJoinPool. Number of concurrent RPCs doesn't affect the throughput (as long as it's above some threshold).

This is probably not enough.  Our own benchmarks are either open loop (i.e. send as fast as possible), or closed loop with ~20,000 active RPCs.  FJP (and other executors) go to sleep when there isn't any work, and wake back up when there is work.  When the work level is too low, they spend a huge amount of time waking up and sleeping.  See answer below for more detail.
 

 

Could anyone from the gRPC team confirm or negate the sanity of the above results and conclusions? In particular, is it expected to have more gRPC channels and reuse them or should I be able to get the same results even if I used only one gRPC channel?

Source code exhibiting the slow down is ideal, and would allow us to find common bottlenecks.   Aside from that, profiling the job is the next best thing. 
 
I have spent some time profiling the parts of the system, as well as the whole system. The benchmarks show that as soon as I start sending requests through the gRPC channel, the throughput drops significantly -- I have many benchmarks to measure the throughput of each layer, e.g. how does the data-structure throughput compare to the throughput of the service that wraps the data-structure.

This made me write JMH benchmarks for the gRPC itself too, with a dummy service that implements unary and streaming call. The service sends DummyResponse for each DummyRequest with a payload size specified by client (about 50 bytes). I have run these benchmarks on a single 64 core machine within a single process, not over the network, but using NettyChannelBuilder and NettyServer. Both unary and streaming benchmarks have shown that adding more channels improves the throughput significantly.

I can extract the source code of the gRPC benchmark if that helps since I'm observing similar behaviour in the gRPC benchmarks of DummyService and benchmarks of the actual system.

Profiler shows that both server and channel worker event loops are not doing IO all the time (see the screenshot below). This probably means that I am not using the channel's full capacity, but somehow adding more channels makes up for some bottleneck that I can't spot. The machines are not CPU bound. There doesn't seem to be any obvious thread-contention, and if it is there it is not present if the requests don't go through the gRPC service.

Adding more JMH worker threads that use the same channel to send more requests in parallel doesn't improve the throughput. Neither does the target number of outstanding RPCs.

I've also tried changing the flow control window sizes, however, this doesn't improve anything -- this seems expected as the data going over the HTTP/2 shouldn't be large in this case. Underlying TCP window scaling is enabled so I don't think that is the problem neither. 

I do believe that I can end up using a single channel between the client and the server, but I'm not sure what exactly needs to be configured to get the same throughput as with multiple channels. I am using NioEventLoopGroups instead of Epoll, but if I understand correctly this shouldn't make a difference for a single (or a few) connection(s).

 

Additionally, it would be very nice if anyone could explain the correlation between the number of worker event loops (in both client and server) with respect to the number of channels / streams. I understand that netty assigns a channel to exactly one worker event loop (and I don't think it ever moves them around?) -- please correct me if I'm wrong -- which means that if there was a single gRPC channel, having more than one worker event loops wouldn't help at all.

Netty does assign a *netty* channel to a single loop, but not a gRPC channel.  gRPC's channel is a multiplexing channel across each of the subchannels provided by the load balancer.  The subchannels are more aligned with a netty channel, but this is only approximately the truth (I can explain further if interested, but I don't think it's relevant to your question).  

While this may not be relevant to the question, I am keen on learning a bit more about the gRPC internals. Can I find out more about it online or is the best way to look around the gRPC github repo?

It's complicate (for reasons), but the Subchannel can itself have multiple connections to the same "equivalent" backend.  Typically this happens if there is a GOAWAY sent due the the stream id limit being hit.   In the gRPC Java repo this is the "InternalSubchannel" class.  Like I said, it's complicated in order to make the LoadBalancer API simpler to implement.  (i.e. we eat the complexity).


 
 

As for event loops:   The event loop usage in gRPC is limited to SSL coding, HTTP/2 framing, HPACK coding, and wire level flow control as well as the actual syscalls.  While each of these operations are pretty fast, they aren't free.   Thus, if each connection is being heavily used, it makes sense to have multiple connections.  (Again, you need to be talking about hundreds of megabits - gigabits per connection before this really applies to you).  This data is passed up to the gRPC layer, and handled on the Executor that you pass in at Channel / Server creation.  In particular, the ClientCall.Listener and ServerCall.Listener callbacks represent the boundary between application threads and netty threads.  

This suggests that I can make much better use of a single channel -- while the system may end up sending huge amounts of data between the client and the server in the near future (hundreds of megabits per second), it is not there yet.
 

This is a pretty large departure from the way netty code is normally written.   In gRPC its is more common to only have a handful of netty threads (like 1), and many more application threads.

 

Also, how does the gRPC channel relate to the netty channel? I thought that netty channel is created per gRPC channel, but I may be wrong according to this post: https://groups.google.com/forum/#!searchin/grpc-io/grpc$20reuse$20channels|sort:date/grpc-io/LrnAbWFozb0/8_te_-GmAgAJ (explanation about Boss Event Loop Group suggests otherwise).

Thanks in advance.

Cheers,
Nikola


To extract some keypoints from the above answers:
  • I have written benchmarks for my service, as well as gRPC service that sends dummy requests and responses.
  • All benchmarks that go through the gRPC channel and service show that adding more channels improves the overall throughput.
  • Server and channel worker event loops don't seem to be spending all of their time doing IO.
  • I would like to focus on getting the best throughput over a single gRPC channel for a DummyService to start with, i.e. the service that sends response back to the client for each request, if possible.
Assuming the client is sending about 100 Mbps of data per second and the service responds with the same amount of data as soon as it gets request (doesn't run any logic except that it generates a random string of the specified length, about 50 bytes), how can I find out where is the bottleneck? What should I be searching for in the profiler since most of the threads are not doing much work? Note that the throughput is better simply by adding more channels (assuming there are multiple streams in case of streaming benchmarks).

My personal favorite is the sampling CPU profiler, which fires ~100x a second and collects stack traces of some running thread.  Then sort bottom-up to see which stack frame it's in most often.    Regarding FJP, you will notice in this profiling FJP.scan or park/unpark if there is not enough work.  If this is the case, consider cutting the parallelism to the next lower power of 2.  This reduces the number of buckets than need to be scanned for work stealing, and also keeps threads from going to sleep so often.  

Nikola Stojiljkovic

unread,
Jun 20, 2018, 5:54:44 AM6/20/18
to grpc.io
Hi Carl,

Thanks for the response.

I don't think the number of concurrent RPCs is the problem at this stage. I believe the bottleneck is somewhere else.

The reason I believe that is because changing the parallelism from 100 to 3200 per thread makes no difference to the throughput -- note that this is per jmh worker thread, where each thread handles a single bidi stream. If there wasn't enough work, then increasing the parallelism would significantly change the results, but it didn't.

I think that you are right that executors go to sleep and have to be woken up which probably takes time, however, I don't know how to give them more work. Also, using direct executor instead of fork join pool on both server and client improves the throughput 10 times. I have no explanation for this -- it's possible that the hop to an executor is more expensive than the time required to process the request, but I would not expect 90% reduction in the throughput. 

Is there any configuration for the channels or server that I'm missing that is required for high QPS environments? 
Do I need to set some options to the channel?


these benchmarks say that java can process about 680k ping-pong QPS using bidi streams. What is the configuration used to achieve this? Do these benchmarks use ForkJoinPool and a single or multiple channels?

Thanks,
Nikola

Carl Mastrangelo

unread,
Jun 20, 2018, 1:14:01 PM6/20/18
to grpc.io
Well, as I said, you can try lowering the number of threads in your pool to see if it goes faster.  If you are on a 32 core machine, try using parallelism of 16.  Direct executor effectively handles RPC callbacks on the Netty Threads.   Very few people should do this, unless they know they won't block.  The thread hop time is about 10-25us, so if your work load is less than that per RPC, you may be just seeing overhead.

The performance dashboard is running code at https://github.com/grpc/grpc-java/tree/master/benchmarks/src/main/java/io/grpc/benchmarks   It isn't particularly easy to run, but all the code is open source.  Search for ForkJoinPool in that directory to see how we set ours up.
Reply all
Reply to author
Forward
0 new messages