gRPC 1.55 Throughput Drop when Client Send Rate Increase

128 views
Skip to first unread message

Yunhao Bai

unread,
May 12, 2024, 2:03:59 PM5/12/24
to grpc.io
Hi gRPC dev team,
  Recently, I am working on a pub/sub project on rhel7 system. We use gRPC as a messaging tool, and issue unary RPC call to publish to the gRPC server. Then server will send the message to the subscriber with server streaming RPC.
When we are performing performance test and try to test the throughput for the gRPC server, we find there is a very strange behavior: When we increase the rate for unary RPC call to the gRPC server (e.g., from 15k rps to 50 rps), we see a performance drop after some peak point.
throughput.png
After our investigate, we find that there are many "default-executor" threads occupying CPU when there is a performance drop, and when there is no such phenomenon, everything goes fine. I also post the CPU utils when there is a performance drop and not with client rps equaling 50k.
rps_50k.png
rps_50k_2.png
Is there any reasonable explanation for this? technically, when the system is saturated, the throughput should become a stable not a drop. 

P.S., we know that gRPC has moved from "default-executor" to "Event-engine" thread model, but we also see a performance drop when we adopt event-engine in gRPC 1.62 for our cq-based async server. Could you help us to understand why  it behaves such?

Thanks in advance!, I have also attached all the code we use to re-produce the issue. please check.

Eric Anderson

unread,
May 14, 2024, 2:09:40 PM5/14/24
to Yunhao Bai, grpc.io
On Sun, May 12, 2024 at 11:03 AM Yunhao Bai <cloudb...@gmail.com> wrote:
When we are performing performance test and try to test the throughput for the gRPC server, we find there is a very strange behavior: When we increase the rate for unary RPC call to the gRPC server (e.g., from 15k rps to 50 rps), we see a performance drop after some peak point.

I'm surprised that you are surprised. I thought it was well known that when you over-saturate a server performance eventually decreases. Although that looks pretty dramatic (not that I can read the image).

After our investigate, we find that there are many "default-executor" threads occupying CPU when there is a performance drop, and when there is no such phenomenon, everything goes fine. I also post the CPU utils when there is a performance drop and not with client rps equaling 50k.

Are you using grpc-java? "default-executor" definitely sounds like Java, but you mention event engine later. Event engine is only for the C-based implementation.

Assuming you are using grpc-java, specifying executor() on the server (and channels) to a thread pool you manage can yield more performance. The default executor is a basic unbounded Executors.newCachedThreadPool(). If you know your maximum concurrency needs, limiting the maximum number of threads can improve performance (either via ThreadPoolExecutor or newFixedThreadPool()). Also, some workloads see benefit in using ForkJoinPool (which is a fixed-sized thread pool).

If your application is mostly asynchronous, then choosing the number of threads based on number of cores makes sense, somewhere between a bit less than the number of cores to 2x. If you have a lot of blocking, then you'll just have to experiment.

Yunhao Bai

unread,
May 15, 2024, 12:53:15 PM5/15/24
to grpc.io
Hi Eric,
  Thanks for the quick reply, I did not the answer come so fast. Actually, we are using gRPC C++ APIs. We implement the server with cq-based async way, just like the qps test server in grpc repo, and the client will issue the unary RPC with a fix interval. We run the test on a 48 core server machine with CPU E5-2650 2.20gHZ. The test code is here: https://github.com/yuanyuanJ/gRPC-pertest.
  In C++, we cannot control the number of "default-executor" thread in the gRPC server, so eventually, we end up with 96 in total (2x number of CPU cores in our machine).
  For the performance drop, we will not see the drop if these "default-executor" threads are not utilizing CPU. However, when the default executor threads comes in, we see a CPU utilization drop in our cq-processing thread (pubhandleRPCT), and there will be a performance drop.
For example, in the second image, we only see pubhandleRPCT threads occupying CPU, so the throughput for the gRPC server will be flat and stable even if over-saturation happens. However, in the third image, when there are "default-executor" occupying CPU, there will be throughput drop.
  What confuse us is that in what circumstance will the "default executor" comes into play in gRPC 1.55 (or version before event engine comes into play). If it is used for alleviating the saturation in I/O event, why does it make the thing even worse?

Thanks,
Cloud
Reply all
Reply to author
Forward
0 new messages