gRPC C++ connection failure (getaddrinfo) and deadlock

438 views
Skip to first unread message

Amit Waisel

unread,
Apr 24, 2017, 6:53:56 AM4/24/17
to grpc.io
I have a C++ client, that connects to a C# server. The connection is being made by a RPC function (called 
InitializeStream()
), that sends a single request and receives a stream of responses from the server. This RPC function is executed with 'max' timeout (if the server is unavailable, later call to stream->Read() will return an error. This is good enough for me).
I encountered a weird bug, which happened on a VM. (I couldn't reproduce it on any other machine, but it reproduces easily on that VM). On that single VM, the 
InitializeStream()
 RPC function never returns.

Further debugging of this issue reveled the following:
  1. The main thread (thread #1) is blocked inside InitializeStream(), in 
    grpc_iocp_work()
    . The exact line is 
    iocp_windows.c@83
     - in Windows's 
    GetQueuedCompletionStatus()
     function.
    As far as I understand, here we wait for a task completion, for unlimited timeout (I used the 'max' timeout).
      [External Code]
    > Test.exe!grpc_iocp_work(grpc_exec_ctx * exec_ctx, gpr_timespec deadline) Line 83 C
     
    Test.exe!grpc_pollset_work(grpc_exec_ctx * exec_ctx, grpc_pollset * pollset, grpc_pollset_worker * * worker_hdl, gpr_timespec now, gpr_timespec deadline) Line 140 C
     
    Test.exe!grpc_completion_queue_pluck(grpc_completion_queue * cc, void * tag, gpr_timespec deadline, void * reserved) Line 614 C
     
    Test.exe!grpc::CoreCodegen::grpc_completion_queue_pluck(grpc_completion_queue * cq, void * tag, gpr_timespec deadline, void * reserved) Line 70 C++
     
    Test.exe!grpc::CompletionQueue::Pluck(grpc::CompletionQueueTag * tag) Line 230 C++
     
    Test.exe!grpc::ClientReader<test::TestRequest>::ClientReader<test::TestRequest><test::InitMessage>(grpc::ChannelInterface * channel, const grpc::RpcMethod & method, grpc::ClientContext * context, const test::InitMessage & request) Line 151 C++
     
    Test.exe!test::testInterface::Stub::InitializeStreamRaw(grpc::ClientContext * context, const test::InitMessage & request) Line 46 C++
     
    Test.exe!test::testInterface::Stub::InitializeStream(grpc::ClientContext * context, const test::InitMessage & request) Line 86 C++
     
    Test.exe!WinMain(HINSTANCE__ * __formal, HINSTANCE__ * __formal, char * __formal, int __formal) Line 17 C++
     
    [External Code]

  2. One of gRPC's threads [from the thread pool] (thread #2), called the function 
    do_request_thread()
     in 
    resolve_address_windows.c@153
    , which called 
    grpc_blocking_resolve_address()
     (blocking function, by its name) that called 
    getaddrinfo()
     that never returns!
My guess is that thread #1 waits (GetQueuedCompletionStatus) for thread #2's task completion. 
getaddrinfo()
 never returns, so 
GetQueuedCompletionStatus()
 is blocking, and the main thread is stuck.

Have you encountered this error before? Do you have any idea what can I do (beside adding a timeout to the function, which I consider as a bypass and not a solution).
I use gRPC v1.2.0 for both C++ and C#.

Thanks

Nicolas Noble

unread,
Aug 8, 2017, 5:57:56 PM8/8/17
to grpc.io
Having getaddrinfo() not returning is disturbing. While it's true that all of the OS' DNS resolution functions are synchronous, and will block until the OS comes back with a response, it's usually expected that the OS returns eventually. Either with an error (such as a timeout), or with some results. Not returning at all isn't a sane nor expected behavior.

Now your phrasing is a bit confusing. Are you saying that the DNS resolution thread is stuck on resolving address ? Or that you think it somehow did, and got the rest of the library confused and stuck ?

Amit Waisel

unread,
Aug 9, 2017, 5:01:12 AM8/9/17
to grpc.io
Hi Nicolas,
Thank you for your help.
I agree that this behavior is disturbing. I also believed that getaddrinfo should return at some point. But on that case, it never did.
I must say that it happened on one VM machine, and never reproduced on any other machine. I haven't encountered this bug since.

As far as I understand the design of gRPC's thread pool, I debugged the process and found our that the RPC function called "ClientReader" constructor, which issued an async name resolve operation and waited for its completion. An arbitrary thread from the pool picked up the resolve task, and called (eventually) getaddrinfo. Because this call never returned, the thread in the thread pool never completed the task, so the ClientReader constructor never finished the wait for the async name resolve operation.

I hope this clears things up a bit.
Reply all
Reply to author
Forward
0 new messages