Hello to whoever sees this :)
I've been working on adding a gRPC server to one of my projects I work on, and I've been having performance "issues" when it comes to handling heavy loads of incoming requests. The benchmark is simple, having nthreads call ntimes a simple endpoint which returns global_variable++;.
After running with 4 threads and 100k parallelized requests the flamegraph produced (attached below) displayed that CompletionQueue::next was taking 80% of the cpu during the high load time. In specific begin_worker and end_worker took a lot of cpu time instead o real processing.
The server had 1 CompletionQueue for each handleRpcs thread with 1 CallData instance in each CompletionQueue.
I'm not entirely sure on what is the best way to handle incoming rpcs, and the most performant way of reusing CallData instances. Maybe this overhead of pollset_work is expected? Anyways, some pointers here would be helpful if possible.