I have a question about long-running asynchronous operations using gRPC. Please forgive me for providing a fairly lengthy description but I want to be as clear as possible about the specific issue we are seeing.
There are two kinds of RPC calls that are made from the client side (it is a server-to-server implementation).
1. Short-lived calls for user specific operations such as creating and acting on orders. These RPC calls would be short-lived (typically < 1 second).
2. Long-lived calls that exist to support a subscription-based scheme. That is to say, the client makes a subscribe call, and the asynchronous responses are used to send notifications in response to various system events that the client is interested in. The remote call itself remains open until the server receives an unsubscribe message from the client. This could be many hours laters.
The service (simplified for the sake of this example) as defined by the proto file looks something like this.
service RemoteService {
rpc Subscribe (SubscriptionRequest) returns (stream Notification);
rpc Unsubscribe (SubscriptionRequest) returns (stream UnsubscribeResponse);
}
What we are seeing in practice is that after a channel has been open for an extended period of time (an hour or two) it silently loses connectivity with the server. No error is reported at the time of the apparent disconnect, but rather an error of server Unavailable is reported as an error code whenever the next operation is attempted. The status code is 14.
This raises a few questions:
1. Is the gRPC stack suitable for long-lived RPC calls between servers like this or are we using the wrong tool for the job? Do we need to implement some kind of heartbeat/keep alive mechanism? I am wondering if something is timing out after a period of idle time.
2. If this is an acceptable approach, what is the desired mechanism for monitoring channels and RPC calls to detect disconnects. If we do disconnect periodically (for one reason or another), then we’d want to re-subscribe so that we can actively maintain communication between the two servers.
I have tried searching for recommendations or best-practices as far as typical use case scenarios for error detection when using asynchronous calls, but I have not found anything that addresses this issue specifically.
Finally, in case it is important, here is the version information:
Client: Running in a Windows server environment with the c# implementation 0.5.1
Server: Running in a Linux environment using the java implementation 0.7.1
I have been monitoring the various milestones and it seems that the c# implementation is a bit behind the Java one. It looks like 0.9.1 is available for Java and 0.7.1 for c#. Would it make sense to use a common version that is available for both, i.e. 0.7.1. Someone else initially setup this project and seemingly grabbed the newest versions that were available for each language at the time (around June/July).
Any and all help/guidance is appreciated.
Thanks,
Steve