Client channel unusable after a network reset

937 views
Skip to first unread message

antons...@verbsurgical.com

unread,
Apr 1, 2019, 1:10:40 PM4/1/19
to grpc.io
grpc github issue: https://github.com/grpc/grpc/issues/18554

Summary: If a client channel is in a READY state and the network is disconnected, the channel becomes unusable and the client will not attempt to reconnect to the server once the network connection is re-established. The channel does not transition from a READY state to a TRANSIENT_FAILURE on a DEADLINE_EXCEEDED error (deadline set by my client application).


What version of gRPC and what language are you using?

1.17.2
Same issue experience in version 1.11.x
C++


What operating system (Linux, Windows,...) and version?

Client running on Ubuntu 16.04.
Server running Windows Enterprise.


What did you do?

Server and client are both started on a connected network. I can successfully make calls and receive responses from the server. When the network is turned off, the server receives a "Disconnected client - Endpoint read failed" error. Some other relevant fields in this debug message - "grpc_status":14 (UNAVAILABLE), "occured_during_write":0, "description":"An established connection was aborted by the software in your host machine".

At the time of network disconnect, the client does not print out any logs at all (using GRPC_TRACE=connectivity_state,call_error,op_failure,server_channel,client_channel,channel GRPC_VERBOSITY=DEBUG).

Once the network is turned on again there are no logs experienced on neither the server nor the client. Attempting to make a call using the client (send a launch request) results in a repeated DEADLINE_EXCEEDED error. Turning off the network connection at this time does not result in a server side "Disconnected client" error.

The client context is set to use a deadline (tested with 2 and 10 seconds). Synchronous calls are used in this case.


Code sniplets:
/rpc_service.proto

syntax = "proto3";

import "google/rpc/status.proto";

message
RpcRequest {
}

message
RpcResponse {
}

service
RpcService{
rpc
Call(RpcRequest) returns (RpcResponse);
}


/client.cc
Initialization:

std::unique_ptrRpcService::Stub stub_ = RpcService::NewStub(::grpc::CreateChannel(
server_endpoint
, ::grpc::InsecureChannelCredentials()));

Sending a rpc request:

::grpc::ClientContext context;
context
.set_deadline(
gpr_time_from_micros
(call_timeout_.InMicroseconds(), GPR_TIMESPAN));
RpcRequest request;
RpcResponse response;
::grpc::Status grpc_status = stub_->Call(&context, request, &response);

/server.cc

grpc::ServerBuilder builder;
builder
.AddListeningPort(endpoint, ::grpc::InsecureServerCredentials());
builder
.RegisterService(&rpc_service);
std
::unique_ptrgrpc::Server grpc_server_ = builder.BuildAndStart();

What did you expect to see?

Client should make a successful call after a network reset.


What did you see instead?

Client fails to receive a response from the server.


Anything else we should know about your project / environment?

When the network connection is re-established and the client fails to receive a response from the server, tcpdump captures the client sending out some packets.
Starting up both client and server with network ON, and then unplugging the network does not result in any error messages until a call is attempted. This is the same result as when starting both client and server with the network disconnected. Once a call is attempted the client will transition from IDLE to CONNECTING and then begin to bounce back and forth between CONNECTING and TRANSIENT_FAILURE states (attempting to reconnect with exponential back-off) until the connection is re-established.
If the client is started with the network connected, but doesn't send a request and the network is disconnected the server doesn't get a disconnected client error. Until a call is made, the client stays in a "IDLE".
If a client is initialized and a call is made on a disconnected network, then the client will enter a CONNECTING state (with exponential backoff up to a max of 2 min where the client will be in a TRANSIENT_FAILURE state). Once the network is connected, the connection will be re-established the next time the channel will enter a CONNECTING state and the client will enter the READY state. After this, each call will succeed until the network is reset.
Disconnecting the network after the client is in a READY state will not transition the client out of a READY state.
In summary: Until a call is made, the client will stay in an "IDLE" state no matter the network status. Once a call is made, the client will attempt to make a connection by entering the CONNECTING state. If no connection is found, it will transition bounce in-between CONNECTING and TRANSIENT_FAILURE states. Once a connection is found, the client will go into a READY state. From here, if a connection is lost, the client will not attempt to enter a CONNECTING state again.


Similar issue (closed) to the one I’m having:
https://github.com/grpc/grpc/issues/16974


Known fix

Create a new channel on each call.


Failed fix attempts

Set GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA = 0


Questions

Should the client be able to use the already created channel after a network reset?
Does the channel have to be restarted when a network is reset?


yas...@google.com

unread,
Apr 22, 2019, 3:21:32 PM4/22/19
to grpc.io
Please continue the discussion on the github issue. We have recently added tests that create an environment very similar to the one that you have described. 

A slight clarification - A deadline_exceeded error on a call does not result in a transient failure on the channel. The failures that result in a transient failure are usually something from the underlying transport rather than the application layer.

antons...@verbsurgical.com

unread,
Apr 22, 2019, 8:04:28 PM4/22/19
to grpc.io

Replied on the github issue, will consolidate all further discussion on there: https://github.com/grpc/grpc/issues/18554
Reply all
Reply to author
Forward
0 new messages