Finding out why grpc async service got a longer latency in client side than server

314 views
Skip to first unread message

river

unread,
Feb 8, 2021, 12:28:42 AM2/8/21
to grpc.io


I got a grpc c++ async server, setup like the tutorial. I add 20 (when doing benchmark I changed it to 8/64/256, it doesn’t make any difference) completion-queue and respectively one thread for one completion-queue.

And each thread just endless looping, fetching tag for the corresponding completion queue then do our business transaction. And then call the responder->Finish method.


But on a quite high pressure, the clients side got a longer latency than the time our business transaction taken (about 5-6 seconds, it couldn’t be network lagging) I think the requests were queued a long time in the completion-queue, but as I said, adding more threads and completion-queue seemed not helping.


Can I find out some way to debug this? I tried “grpc_trace=all” the logs are too many and the tags report in these log are different with the tags we set. It’s weird too

river

unread,
Feb 8, 2021, 10:45:53 AM2/8/21
to grpc.io
I got a reason that maybe related to the problem, we spawn a new CallData after current process done, everything is almost ok:

    if (status_ == CREATE) {
        // Make this instance progress to the PROCESS state.
        status_ = PROCESS;

        // As part of the initial CREATE state, we *request* that the system
        // start processing SayHello requests. In this request, "this" acts are
        // the tag uniquely identifying the request (so that different CallData
        // instances can serve different requests concurrently), in this case
        // the memory address of this CallData instance.
        service_->RequestSayHello(&ctx_, &request_, &responder_, cq_, cq_, this);
    } else if (status_ == PROCESS) {
        // our own business for the service

        // add a new CallData to the completion queue after handled the current request
        // otherwise grpc will use the completion queue again
        // in fact, without some like work stealing algorithm, we shouldn't use one thread to one completion queue
        // but should try something like reactor model

        // Spawn a new CallData instance to serve new clients while we process
        // the one for this CallData. The instance will deallocate itself as
        // part of its FINISH state.
        new CallDataAsync(service_, cq_);

        // And we are done! Let the gRPC runtime know we've finished, using the
        // memory address of this instance as the uniquely identifying tag for
        // the event.
        status_ = FINISH;
        responder_.Finish(reply_, grpc::Status::OK, this);
    } else {
        GPR_ASSERT(status_ == FINISH);
        // Once in the FINISH state, deallocate ourselves (CallData).
        delete this;

river

unread,
Feb 8, 2021, 10:49:53 AM2/8/21
to grpc.io
In a high pressure our own business for the service took almost 4s to finish. If we spawn a new calldata first, most case it doubles the time used

veb...@google.com

unread,
May 26, 2022, 4:51:19 PM5/26/22
to grpc.io
Trace isn't that helpful to understand what happened if you don't know gRPC's internal. Maybe grpc_trace=http gives you some clue as to whether it's a network issue or not. 
For the latency of the client, I don't understand. Server code looks good and since it's working well without heavy workload so I don't think there is obvious mistake there.

Reply all
Reply to author
Forward
0 new messages