Finding out why grpc async service got a longer latency in client side than server

river

unread,

Feb 8, 2021, 12:28:42 AM2/8/21

to grpc.io

I got a grpc c++ async server, setup like the tutorial. I add 20 (when doing benchmark I changed it to 8/64/256, it doesn’t make any difference) completion-queue and respectively one thread for one completion-queue.

And each thread just endless looping, fetching tag for the corresponding completion queue then do our business transaction. And then call the responder->Finish method.

But on a quite high pressure, the clients side got a longer latency than the time our business transaction taken (about 5-6 seconds, it couldn’t be network lagging) I think the requests were queued a long time in the completion-queue, but as I said, adding more threads and completion-queue seemed not helping.

Can I find out some way to debug this? I tried “grpc_trace=all” the logs are too many and the tags report in these log are different with the tags we set. It’s weird too

river

unread,

Feb 8, 2021, 10:45:53 AM2/8/21

to grpc.io

I got a reason that maybe related to the problem, we spawn a new CallData after current process done, everything is almost ok:

if (status_ == CREATE) {

// Make this instance progress to the PROCESS state.

status_ = PROCESS;

// As part of the initial CREATE state, we *request* that the system

// start processing SayHello requests. In this request, "this" acts are

// the tag uniquely identifying the request (so that different CallData

// instances can serve different requests concurrently), in this case

// the memory address of this CallData instance.

service_->RequestSayHello(&ctx_, &request_, &responder_, cq_, cq_, this);

} else if (status_ == PROCESS) {

// our own business for the service

// add a new CallData to the completion queue after handled the current request

// otherwise grpc will use the completion queue again

// in fact, without some like work stealing algorithm, we shouldn't use one thread to one completion queue

// but should try something like reactor model

// Spawn a new CallData instance to serve new clients while we process

// the one for this CallData. The instance will deallocate itself as

// part of its FINISH state.

new CallDataAsync(service_, cq_);

// And we are done! Let the gRPC runtime know we've finished, using the

// memory address of this instance as the uniquely identifying tag for

// the event.

status_ = FINISH;

responder_.Finish(reply_, grpc::Status::OK, this);

} else {

GPR_ASSERT(status_ == FINISH);

// Once in the FINISH state, deallocate ourselves (CallData).

delete this;

river

unread,

Feb 8, 2021, 10:49:53 AM2/8/21

to grpc.io

In a high pressure our own business for the service took almost 4s to finish. If we spawn a new calldata first, most case it doubles the time used

veb...@google.com

unread,

May 26, 2022, 4:51:19 PM5/26/22

to grpc.io

Trace isn't that helpful to understand what happened if you don't know gRPC's internal. Maybe grpc_trace=http gives you some clue as to whether it's a network issue or not.

For the latency of the client, I don't understand. Server code looks good and since it's working well without heavy workload so I don't think there is obvious mistake there.

Reply all

Reply to author

Forward