Hello, I have a client(A)/server(B) program backed by GRPC 1.37.
Here we are using BlockingUnaryCall mainly in a multithread environment and no timeout set yet from client side.
And we find that after several calls(with different RPC methods) , there is a possibility that some pending RPCs never return and the callstack like below:
Thread 31 (Thread 0x7ff695ffb700 (LWP 3073) "grpcpp_sync_ser"):
#0 0x00007ff707ee87f9 in syscall () from /lib64/libc.so.6
#1 0x00007ff708509987 in absl::lts_20230802::synchronization_internal::FutexWaiter::WaitUntil(std::atomic<int>*, int, absl::lts_20230802::synchronization_internal::KernelTimeout) () from /usr/lib64/libabsl_synchronization.so.2308.0.0
#2 0x00007ff708509a6a in absl::lts_20230802::synchronization_internal::FutexWaiter::Wait(absl::lts_20230802::synchronization_internal::KernelTimeout) () from /usr/lib64/libabsl_synchronization.so.2308.0.0
#3 0x00007ff708509c71 in AbslInternalPerThreadSemWait_lts_20230802 () from /usr/lib64/libabsl_synchronization.so.2308.0.0
#4 0x00007ff70850bbd3 in absl::lts_20230802::Mutex::Block(absl::lts_20230802::base_internal::PerThreadSynch*) () from /usr/lib64/libabsl_synchronization.so.2308.0.0
#5 0x00007ff70850c776 in absl::lts_20230802::Mutex::LockSlowLoop(absl::lts_20230802::SynchWaitParams*, int) () from /usr/lib64/libabsl_synchronization.so.2308.0.0
#6 0x00007ff70850cdac in absl::lts_20230802::Mutex::LockSlowWithDeadline(absl::lts_20230802::MuHowS const*, absl::lts_20230802::Condition const*, absl::lts_20230802::synchronization_internal::KernelTimeout, int) () from /usr/lib64/libabsl_synchronization.so.2308.0.0
#7 0x00007ff70850934a in absl::lts_20230802::Mutex::LockSlow(absl::lts_20230802::MuHowS const*, absl::lts_20230802::Condition const*, int) () from /usr/lib64/libabsl_synchronization.so.2308.0.0
#8 0x00007ff708a3c216 in ?? () from /usr/lib64/libgrpc.so.37
#9 0x00007ff708a3fef5 in ?? () from /usr/lib64/libgrpc.so.37
#10 0x00007ff708a49265 in grpc_pollset_work(grpc_pollset*, grpc_pollset_worker**, grpc_core::Timestamp) () from /usr/lib64/libgrpc.so.37
#11 0x00007ff708b5215e in ?? () from /usr/lib64/libgrpc.so.37
#12 0x00007ff709067a5d in grpc::CompletionQueue::Pluck (this=0x7ff695ff9b60, tag=0x7ff695ff9ba0) at /usr/include/grpcpp/completion_queue.h:322
#13 0x00007ff709071bce in grpc::internal::BlockingUnaryCallImpl<google::protobuf::MessageLite, google::protobuf::MessageLite>::BlockingUnaryCallImpl (this=0x7ff695ff9f00, channel=0xb14cc0, method=..., context=0x7ff695ffa030, request=..., result=0x7ff695ffa210) at /usr/include/grpcpp/impl/client_unary_call.h:80
#14 0x00007ff70906ed76 in grpc::internal::BlockingUnaryCall<bam_grpc::bam_get_prealloc_chunks_args, bam_grpc::bam_get_prealloc_chunks_res, google::protobuf::MessageLite, google::protobuf::MessageLite> (channel=0xb14cc0, method=..., context=0x7ff695ffa030, request=..., result=0x7ff695ffa210) at /usr/include/grpcpp/impl/client_unary_call.h:51
......
However we can see the underlying socket as data needs to be drained:
e5b2384394a9:/ # ss -antp | grep 49495
LISTEN 0 4096 [::ffff:127.0.0.46]:49495 *:* users:(("B",pid=3003,fd=11))
CLOSE-WAIT 397 0 [::ffff:127.0.0.1]:58096 [::ffff:127.0.0.46]:49495 users:(("A",pid=3042,fd=8))
There are 397 bytes in the RECV-Q but never get a chance to be read. (It's in CLOSE-WAIT due to B has close the connection.)
Can anyone help how to further debugging? Thanks.