Hi,Disclaimer: I'm not very familiar with gRPC internals and netty, but I'm trying to get a solid grasp of both technologies.I have been running Java gRPC benchmarks and figured out that adding more channels improves the throughput. I have read somewhere (can't find the link) that the gRPC channels should be reused, which makes sense because they represent the connection AFAIK and it would be expensive to open a new connection often.
However, I wasn't able to find in the official documentation anything that suggests how many channels to use or how to configure gRPC server and client to get the best performance.My benchmarks show that each bidi stream has a cap at about 30-40k Ops /s on my machine and it's required to add more streams if I want to get higher throughput. It also shows that adding more streams via a single channel doesn't always improve the throughput, however, creating a new channel and adding more stream improves the throughput.
Could anyone from the gRPC team confirm or negate the sanity of the above results and conclusions? In particular, is it expected to have more gRPC channels and reuse them or should I be able to get the same results even if I used only one gRPC channel?
Additionally, it would be very nice if anyone could explain the correlation between the number of worker event loops (in both client and server) with respect to the number of channels / streams. I understand that netty assigns a channel to exactly one worker event loop (and I don't think it ever moves them around?) -- please correct me if I'm wrong -- which means that if there was a single gRPC channel, having more than one worker event loops wouldn't help at all.
Hello! Responses inline:
On Sunday, June 3, 2018 at 4:33:09 AM UTC-7, Nikola Stojiljkovic wrote:Hi,Disclaimer: I'm not very familiar with gRPC internals and netty, but I'm trying to get a solid grasp of both technologies.I have been running Java gRPC benchmarks and figured out that adding more channels improves the throughput. I have read somewhere (can't find the link) that the gRPC channels should be reused, which makes sense because they represent the connection AFAIK and it would be expensive to open a new connection often.Channels should be reused. The connection expense is usually low enough that it isn't an issue, though if you are sending more than 10gbps per connection it start to have impact. Very few users are able to reach this, and if they do, they probably don't use java! More seriously, how many connections are you using, and to how many backend servers? Are the client and server running on the same machine?
However, I wasn't able to find in the official documentation anything that suggests how many channels to use or how to configure gRPC server and client to get the best performance.My benchmarks show that each bidi stream has a cap at about 30-40k Ops /s on my machine and it's required to add more streams if I want to get higher throughput. It also shows that adding more streams via a single channel doesn't always improve the throughput, however, creating a new channel and adding more stream improves the throughput.Can I ask how many concurrent RPCs you are using? Also, can I ask the payload size? While each RPC can be interleaved, the messages of a given RPC must be processed in order. Also, if you aren't using ForkJoinPool as your executor, you are almost certainly running into contention at these levels.
Could anyone from the gRPC team confirm or negate the sanity of the above results and conclusions? In particular, is it expected to have more gRPC channels and reuse them or should I be able to get the same results even if I used only one gRPC channel?Source code exhibiting the slow down is ideal, and would allow us to find common bottlenecks. Aside from that, profiling the job is the next best thing.
Additionally, it would be very nice if anyone could explain the correlation between the number of worker event loops (in both client and server) with respect to the number of channels / streams. I understand that netty assigns a channel to exactly one worker event loop (and I don't think it ever moves them around?) -- please correct me if I'm wrong -- which means that if there was a single gRPC channel, having more than one worker event loops wouldn't help at all.Netty does assign a *netty* channel to a single loop, but not a gRPC channel. gRPC's channel is a multiplexing channel across each of the subchannels provided by the load balancer. The subchannels are more aligned with a netty channel, but this is only approximately the truth (I can explain further if interested, but I don't think it's relevant to your question).
As for event loops: The event loop usage in gRPC is limited to SSL coding, HTTP/2 framing, HPACK coding, and wire level flow control as well as the actual syscalls. While each of these operations are pretty fast, they aren't free. Thus, if each connection is being heavily used, it makes sense to have multiple connections. (Again, you need to be talking about hundreds of megabits - gigabits per connection before this really applies to you). This data is passed up to the gRPC layer, and handled on the Executor that you pass in at Channel / Server creation. In particular, the ClientCall.Listener and ServerCall.Listener callbacks represent the boundary between application threads and netty threads.
This is a pretty large departure from the way netty code is normally written. In gRPC its is more common to only have a handful of netty threads (like 1), and many more application threads.Also, how does the gRPC channel relate to the netty channel? I thought that netty channel is created per gRPC channel, but I may be wrong according to this post: https://groups.google.com/forum/#!searchin/grpc-io/grpc$20reuse$20channels|sort:date/grpc-io/LrnAbWFozb0/8_te_-GmAgAJ (explanation about Boss Event Loop Group suggests otherwise).Thanks in advance.Cheers,Nikola
Hi Carl,Thanks for the response.Sorry I wasn't able to reply earlier.
Answers inline.
On Tuesday, 5 June 2018 02:55:44 UTC+1, Carl Mastrangelo wrote:Hello! Responses inline:
On Sunday, June 3, 2018 at 4:33:09 AM UTC-7, Nikola Stojiljkovic wrote:Hi,Disclaimer: I'm not very familiar with gRPC internals and netty, but I'm trying to get a solid grasp of both technologies.I have been running Java gRPC benchmarks and figured out that adding more channels improves the throughput. I have read somewhere (can't find the link) that the gRPC channels should be reused, which makes sense because they represent the connection AFAIK and it would be expensive to open a new connection often.Channels should be reused. The connection expense is usually low enough that it isn't an issue, though if you are sending more than 10gbps per connection it start to have impact. Very few users are able to reach this, and if they do, they probably don't use java! More seriously, how many connections are you using, and to how many backend servers? Are the client and server running on the same machine?Ideally I would like to scale this on dozens or hundreds of machines. At the moment I've benchmarked the end-to-end throughput with up to 8 servers and 8 client machines with 32 cores each.Regarding the number of connections, it depends on the benchmark, but I'm trying to optimize for a single connection / channel to start with. The machines communicate with each other internally, so there won't be many connections.The client and the server do not run on the same machine. However, I have benchmarks that run both client and the server on the same machine (within the same process), as well as on different machines. The throughput doesn't seem to be affected by the network at this stage.However, I wasn't able to find in the official documentation anything that suggests how many channels to use or how to configure gRPC server and client to get the best performance.My benchmarks show that each bidi stream has a cap at about 30-40k Ops /s on my machine and it's required to add more streams if I want to get higher throughput. It also shows that adding more streams via a single channel doesn't always improve the throughput, however, creating a new channel and adding more stream improves the throughput.Can I ask how many concurrent RPCs you are using? Also, can I ask the payload size? While each RPC can be interleaved, the messages of a given RPC must be processed in order. Also, if you aren't using ForkJoinPool as your executor, you are almost certainly running into contention at these levels.I tried sending between 100-1000 concurrent RPCs and the payload size of 50-60 bytes. I am using ForkJoinPool. Number of concurrent RPCs doesn't affect the throughput (as long as it's above some threshold).
Could anyone from the gRPC team confirm or negate the sanity of the above results and conclusions? In particular, is it expected to have more gRPC channels and reuse them or should I be able to get the same results even if I used only one gRPC channel?Source code exhibiting the slow down is ideal, and would allow us to find common bottlenecks. Aside from that, profiling the job is the next best thing.I have spent some time profiling the parts of the system, as well as the whole system. The benchmarks show that as soon as I start sending requests through the gRPC channel, the throughput drops significantly -- I have many benchmarks to measure the throughput of each layer, e.g. how does the data-structure throughput compare to the throughput of the service that wraps the data-structure.This made me write JMH benchmarks for the gRPC itself too, with a dummy service that implements unary and streaming call. The service sends DummyResponse for each DummyRequest with a payload size specified by client (about 50 bytes). I have run these benchmarks on a single 64 core machine within a single process, not over the network, but using NettyChannelBuilder and NettyServer. Both unary and streaming benchmarks have shown that adding more channels improves the throughput significantly.I can extract the source code of the gRPC benchmark if that helps since I'm observing similar behaviour in the gRPC benchmarks of DummyService and benchmarks of the actual system.Profiler shows that both server and channel worker event loops are not doing IO all the time (see the screenshot below). This probably means that I am not using the channel's full capacity, but somehow adding more channels makes up for some bottleneck that I can't spot. The machines are not CPU bound. There doesn't seem to be any obvious thread-contention, and if it is there it is not present if the requests don't go through the gRPC service.Adding more JMH worker threads that use the same channel to send more requests in parallel doesn't improve the throughput. Neither does the target number of outstanding RPCs.I've also tried changing the flow control window sizes, however, this doesn't improve anything -- this seems expected as the data going over the HTTP/2 shouldn't be large in this case. Underlying TCP window scaling is enabled so I don't think that is the problem neither.I do believe that I can end up using a single channel between the client and the server, but I'm not sure what exactly needs to be configured to get the same throughput as with multiple channels. I am using NioEventLoopGroups instead of Epoll, but if I understand correctly this shouldn't make a difference for a single (or a few) connection(s).Additionally, it would be very nice if anyone could explain the correlation between the number of worker event loops (in both client and server) with respect to the number of channels / streams. I understand that netty assigns a channel to exactly one worker event loop (and I don't think it ever moves them around?) -- please correct me if I'm wrong -- which means that if there was a single gRPC channel, having more than one worker event loops wouldn't help at all.Netty does assign a *netty* channel to a single loop, but not a gRPC channel. gRPC's channel is a multiplexing channel across each of the subchannels provided by the load balancer. The subchannels are more aligned with a netty channel, but this is only approximately the truth (I can explain further if interested, but I don't think it's relevant to your question).While this may not be relevant to the question, I am keen on learning a bit more about the gRPC internals. Can I find out more about it online or is the best way to look around the gRPC github repo?
As for event loops: The event loop usage in gRPC is limited to SSL coding, HTTP/2 framing, HPACK coding, and wire level flow control as well as the actual syscalls. While each of these operations are pretty fast, they aren't free. Thus, if each connection is being heavily used, it makes sense to have multiple connections. (Again, you need to be talking about hundreds of megabits - gigabits per connection before this really applies to you). This data is passed up to the gRPC layer, and handled on the Executor that you pass in at Channel / Server creation. In particular, the ClientCall.Listener and ServerCall.Listener callbacks represent the boundary between application threads and netty threads.This suggests that I can make much better use of a single channel -- while the system may end up sending huge amounts of data between the client and the server in the near future (hundreds of megabits per second), it is not there yet.This is a pretty large departure from the way netty code is normally written. In gRPC its is more common to only have a handful of netty threads (like 1), and many more application threads.Also, how does the gRPC channel relate to the netty channel? I thought that netty channel is created per gRPC channel, but I may be wrong according to this post: https://groups.google.com/forum/#!searchin/grpc-io/grpc$20reuse$20channels|sort:date/grpc-io/LrnAbWFozb0/8_te_-GmAgAJ (explanation about Boss Event Loop Group suggests otherwise).Thanks in advance.Cheers,NikolaTo extract some keypoints from the above answers:
- I have written benchmarks for my service, as well as gRPC service that sends dummy requests and responses.
- All benchmarks that go through the gRPC channel and service show that adding more channels improves the overall throughput.
- Server and channel worker event loops don't seem to be spending all of their time doing IO.
- I would like to focus on getting the best throughput over a single gRPC channel for a DummyService to start with, i.e. the service that sends response back to the client for each request, if possible.
Assuming the client is sending about 100 Mbps of data per second and the service responds with the same amount of data as soon as it gets request (doesn't run any logic except that it generates a random string of the specified length, about 50 bytes), how can I find out where is the bottleneck? What should I be searching for in the profiler since most of the threads are not doing much work? Note that the throughput is better simply by adding more channels (assuming there are multiple streams in case of streaming benchmarks).