C++ Async server performance of pollset_work

154 views
Skip to first unread message

Pere Díaz Bou

unread,
Feb 2, 2023, 11:09:10 AM2/2/23
to grpc.io
Hello to whoever sees this :)

I've been working on adding a gRPC server to one of my projects I work on, and I've been having performance "issues" when it comes to handling heavy loads of incoming requests. The benchmark is simple, having  nthreads call ntimes a simple endpoint which returns global_variable++;.

After running with 4 threads and 100k parallelized requests the flamegraph produced (attached below) displayed that CompletionQueue::next was taking 80% of the cpu during the high load time. In specific begin_worker and end_worker took a lot of cpu time instead o real processing.

The server had 1 CompletionQueue for each handleRpcs thread with 1 CallData instance in each CompletionQueue.

I'm not entirely sure on what is the best way to handle incoming rpcs, and the most performant way of reusing CallData instances. Maybe this overhead of pollset_work is expected? Anyways, some pointers here would be helpful if possible.

2023-02-02_17-05.png

Craig Tiller

unread,
Feb 3, 2023, 2:15:32 AM2/3/23
to Pere Díaz Bou, grpc.io
For the async API you need to ensure that the number of queued request calls on the server is sufficient to never hit zero entries in that queue: performance is worst case when the queue depletes.

--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/b01cf2c2-d483-46d2-9016-2afe733d07a6n%40googlegroups.com.

Pere Díaz Bou

unread,
Feb 3, 2023, 4:42:58 AM2/3/23
to grpc.io
How can I explicitly debug the size of the completion queue? I tried with `export GRPC_TRACE=pending_tags` and there were always pending tags as I fill the queue with the RPC that I'm benchmarking and another one which is unsued. I've also tried with replicating the rpc on the same queue multiple times and the performance only changed when adding huge numbers of rpc replicas, which caused performance degradation.

Craig Tiller

unread,
Feb 3, 2023, 11:07:15 AM2/3/23
to Pere Díaz Bou, grpc.io
For our benchmarks it looks like we've hard coded requesting 5000 calls per completion queue: https://github.com/grpc/grpc/blob/master/test/cpp/qps/server_async.cc.
And I think we target 3 threads per completion queue to manage contention.

I'd hazard that 5000 was roughly the smallest number for that benchmark that gave good results.

Reply all
Reply to author
Forward
0 new messages