gRPC streams unexpected disconnections

1,639 views
Skip to first unread message

Amit Waisel

unread,
Sep 18, 2017, 9:25:18 AM9/18/17
to grpc.io

I encountered a weird behavior in gRPC.


The symptom - an active RPC stream is signaled as cancelled on server side (happens from time to time, I couldn't find any correspondence with other events in the environment) although the client is active and the stream shouldn't be closed.

It happens for streams initialized as response streams in RPC calls from both C++ and NodeJS clients. It happened on gRPC 1.3.6 and still happens on gRPC v1.6.0.

The problem does not reproduced easily - the system is executed under heavy load for many hours until this happens.


In my code, I have 2 main types of streams:

  1. Control stream (C++→C#) - the client initiates an RPC call to the server, which keeps the RPC's response stream opened.
    Those streams are used as control channels with the C++ clients and are kept open to allow server-to-client requests. When they are closed, both client and server clean up all data related to the connection. So, the control stream is critical to the session.
    The server registers on call cancellation notification:
     ServerCallContext context; // Received from RPC call as a parameter
     
    // ...
     context
    .CancellationToken.Register(() => System.Threading.ThreadPool.QueueUserWorkItem(async obj => { handle_disconnection(...); }));

    The total number of opened control streams (AKA number of connected C++ clients) is ~1200.
  2. Command stream (NodeJS→C#) - There are many many other streams for server-to-client command response communication, which are kept opened in parallel by the server with NodeJS clients. The total number of opened streams is 20K-30K.

The problem is noticeable when the control streams get disconnected.

Further investigation of the client (C++) and server (C#) logs of control stream disconnection, revealed to following:

  1. For some reason, the server's cancellation token (the one registered above) is signaled - and the server does its cleanup (`handle_disconnection` which also closes many command streams intentionally). According to the client, the connection should have remained opened.
  2. After some time, the client realizes the connection was closed unexpectedly and does its cleanup - throwing the error I posted here (NodeJS in that case). The clients disconnects itself only after the server disconnects the connection and control stream.

Another note - I set the servers' RequestCallTokensPerCompletionQueue value for both C++/NodeJS client interfaces, to 32768 (32K) per completion queue.

I have 2 server interfaces (for node clients and C++ clients, which have different API), and 4 completion queues (for 8 cores machine). I don't really know if the 4 completion queues are global, or per-server.

Do you think it might cause those streams to be closed under heavy load?

 

In any case, my suspicious is on the C# server behavior - the CancellationToken is signaled for no apparent reason.

I didn't rule out network instability yet - although both clients and server are located on the same ESX server with 10-gig virtual adapters between them, so this is quite a long-shot.

 

Do you have any idea how to solve this?

Thanks!

Jan Tattermusch

unread,
Oct 6, 2017, 3:37:30 AM10/6/17
to grpc.io
The "serverCallContext.CancellationToken" is triggered any time the server-side call terminates prematurely with an error (it doesn't necessarily mean there was a cancellation, it happens anytime there's some error). Based on what you're saying (it takes hours of heavy load to reproduce), it might well just be that this is a bug in C# or C-core stack, that rarely gets triggered - but it's hard to tell with certainty unless you're able to come up with some evidence (like traces showing what exactly went wrong). 
Can you reproduce this on a single machine (would rule out network problems)? Can you reproduce when there's much lower number of concurrently open streams? Can you reproduce when only using C++ clients a servers (that would point to a potential problem in C-core)?
Have you checked values of relevant channel arguments? (e.g. https://github.com/grpc/grpc/blob/master/include/grpc/impl/codegen/grpc_types.h#L146)
Reply all
Reply to author
Forward
0 new messages