Long-lived channels and asynchronous calls

7,010 views
Skip to first unread message

Steven Richardson

unread,
Dec 15, 2015, 12:37:41 PM12/15/15
to grpc.io
I have a question about long-running asynchronous operations using gRPC. Please forgive me for providing a fairly lengthy description but I want to be as clear as possible about the specific issue we are seeing.

There are two kinds of RPC calls that are made from the client side (it is a server-to-server implementation).

1. Short-lived calls for user specific operations such as creating and acting on orders. These RPC calls would be short-lived (typically < 1 second). 

2. Long-lived calls that exist to support a subscription-based scheme. That is to say, the client makes a subscribe call, and the asynchronous responses are used to send notifications in response to various system events that the client is interested in. The remote call itself remains open until the server receives an unsubscribe message from the client. This could be many hours laters.

The service (simplified for the sake of this example) as defined by the proto file looks something like this. 

service RemoteService {
    rpc Subscribe (SubscriptionRequest) returns (stream Notification);

    rpc Unsubscribe (SubscriptionRequest) returns (stream UnsubscribeResponse);
}

What we are seeing in practice is that after a channel has been open for an extended period of time (an hour or two) it silently loses connectivity with the server. No error is reported at the time of the apparent disconnect, but rather an error of server Unavailable is reported as an error code whenever the next operation is attempted. The status code is 14. 

This raises a few questions:

1. Is the gRPC stack suitable for long-lived RPC calls between servers like this or are we using the wrong tool for the job? Do we need to implement some kind of heartbeat/keep alive mechanism? I am wondering if something is timing out after a period of idle time.
2. If this is an acceptable approach, what is the desired mechanism for monitoring channels and RPC calls to detect disconnects. If we do disconnect periodically (for one reason or another), then we’d want to re-subscribe so that we can actively maintain communication between the two servers. 

I have tried searching for recommendations or best-practices as far as typical use case scenarios for error detection when using asynchronous calls, but I have not found anything that addresses this issue specifically. 

Finally, in case it is important, here is the version information:
 
Client: Running in a Windows server environment with the c# implementation 0.5.1

Server: Running in a Linux environment using the java implementation 0.7.1

I have been monitoring the various milestones and it seems that the c# implementation is a bit behind the Java one. It looks like 0.9.1 is available for Java and 0.7.1 for c#. Would it make sense to use a common version that is available for both, i.e. 0.7.1. Someone else initially setup this project and seemingly grabbed the newest versions that were available for each language at the time (around June/July).

Any and all help/guidance is appreciated.

Thanks,
Steve 

Steven Richardson

unread,
Dec 16, 2015, 11:06:40 AM12/16/15
to grpc.io
TLDR; Part of our design requires keeping open async RPC calls for extended periods of time to support a subscription scheme where notifications are sent back to the client in response to various events until they unsubscribe. Does anyone see a problem with this approach? 

Louis Ryan

unread,
Dec 16, 2015, 6:50:54 PM12/16/15
to Steven Richardson, grpc.io
Steve,

GRPC is suitable for handling long-lived streams so there's nothing wrong with the design approach. It's one we use at Google for many APIs. 

TCP connections can and do die for a variety of reasons and without more details it's hard to know whats occurring in your case, perhaps there is more detail available in logs. There's also a fair chance that we have a bug that needs to be resolved. We've addressed similar bugs in the past.

- Louis

--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/4e6cb04f-e6b6-4324-a0ca-dd26ebd77c3b%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Eric Anderson

unread,
Jan 11, 2016, 11:36:43 AM1/11/16
to Steven Richardson, grpc.io
I'm coming back from holiday/vacation...

On Tue, Dec 15, 2015 at 9:37 AM, Steven Richardson <stevenjohnrichardson@gmail.com> wrote:
What we are seeing in practice is that after a channel has been open for an extended period of time (an hour or two) it silently loses connectivity with the server. No error is reported at the time of the apparent disconnect, but rather an error of server Unavailable is reported as an error code whenever the next operation is attempted. The status code is 14.

That sounds like that something killed the TCP connection due to inactivity (commonly a NAT). The more general approach to solve this is TCP keep-alives. It could also be possible for gRPC to use HTTP/2 pings.

I've been slowly pushing a bit to get some sort of per-connection solution in place, mainly for mobile. There is concern about accidental DDoSing, so it isn't quite trivial. It is also not universal; inside a data center you very possibly would want it off.

This raises a few questions:

1. Is the gRPC stack suitable for long-lived RPC calls between servers like this or are we using the wrong tool for the job?

As Louis replied, you are not using the wrong tool. Long-lived streams for notifications are a very important intended use-case.

Do we need to implement some kind of heartbeat/keep alive mechanism?

Short-term, that could resolve your issue. Long-term, I hope you would no longer have need for it. A simple NOOP service that you issue a request to every 15 minutes-2 hours would be enough. You could ignore the response of the service; you are mainly just triggering data packets to 1) notice if the connection is down and 2) prevent any NATs from dropping the port mapping due to inactivity.

Note that you would still hope for such a thing to be built-in, because ideally this would happen on a per-connection basis instead of a per-gRPC-Channel basis. In some cases a single gRPC Channel may have more than one TCP connection.

2. If this is an acceptable approach, what is the desired mechanism for monitoring channels and RPC calls to detect disconnects. If we do disconnect periodically (for one reason or another), then we’d want to re-subscribe so that we can actively maintain communication between the two servers.

I think I covered that above, but if you are doing traffic on the connection elsewhere and the TCP connection fails, then gRPC will fail all the RPCs on that connection. So your stream would get UNAVAILABLE and you simple re-issue the streaming RPC.

I have been monitoring the various milestones and it seems that the c# implementation is a bit behind the Java one. It looks like 0.9.1 is available for Java and 0.7.1 for c#.

In general, use the newest version available for a given language. Java's version numbers aren't currently aligned with the other languages.
Reply all
Reply to author
Forward
0 new messages