[grpc-java] Deadline issues with broken connections

703 views
Skip to first unread message

Uli Bubenheimer

unread,
Aug 11, 2017, 9:24:48 PM8/11/17
to grpc.io
I am seeing issues with how Deadline works at present in the presence of failing connections. What I found in my tests & research regarding my Android use case:

  • I am missing an option to explicitly fail a connection (signal that the connection is broken) when a Deadline expires. An expired Deadline in my Android use case would typically mean that the connection is broken, and reconnecting may help.
  • Detecting broken connections with Keep-Alives does not work so well in conjunction with Deadlines. If all RPCs have failed due to expired Deadlines caused by a broken connection then using Keep-Alive won't detect the broken connection unless I set NettyServerBuilder.permitKeepAliveWithoutCalls(true) and OkHttpChannelBuilder.keepAliveWithoutCalls(true), which judging by the documentation does not seem the best idea.
Deadlines are useful, for example, when there is a user waiting for a response - I have to take action when nothing comes back after 20 or 30 seconds. Setting KeepAlive to such small values as a workaround would not be a good idea.

I started looking at creating my own LoadBalancer as a workaround, which seems less architecturally insane than recreating the Channel. I am thinking when I sense a broken connection via an expired Deadline I can shutdown() the old Subchannel and create a new Subchannel for the same address. I'm not sure how to signal all other open RPCs to error out as if connection failure had been detected by Keep-Alive - I'd have that problem even with the channel recreation workaround.

Thoughts?

Carl Mastrangelo

unread,
Aug 15, 2017, 7:45:39 PM8/15/17
to grpc.io
How can you tell if a connection is broken?  Unless you receive a packet saying the host isn't reachable, its possible the remote endpoint is just taking a long time.  It can't be distinguished from radio silence.  The deadline mechanism isn't really for connection level usage, it's for RPC level usage.  

If you can listen for OS updates on Android, why not just kill the RPCs your self when you get notified?  And, even if you did do this, how can you tell if the connection is failed?  For example, if the OS tells you the antenna is turned off, it may be temporary and could turn on again with neither endpoint being the wiser.  The connection is still active.

Uli Bubenheimer

unread,
Aug 16, 2017, 2:47:07 AM8/16/17
to grpc.io
Carl, thanks for your input.

In client-server development, especially for mobile, I can control certain things better than others. I have control over the server and can make it very reliable, so that I can mostly rule it out as a point of failure. By contrast, the network is chaotic. Network communication is going over the cellular network in my case, and I have to design for substantial connectivity problems because the app is generally used in remote locations. So usually the blame for communications problems in my case lies on the client-side network.

How can you tell if a connection is broken?  Unless you receive a packet saying the host isn't reachable, its possible the remote endpoint is just taking a long time.  It can't be distinguished from radio silence.

That's the crux of the problem - I usually can't tell if the connection is broken in a timely manner when a Deadline expires. So I have to use good heuristics (make informed guesses) in a way that minimizes the impact on app users. If I take an optimistic stance and reissue calls on an existing yet broken connection, I will waste time because all I may get at the application level is silence and another expired Deadline. If instead I take a pessimistic stance and assume that the connection is broken (whether or not that actually is the case), I can try a new connection (if it appears that I am online) and minimize the downtime. Doing one retry on the old connection can be a good idea, but if it fails I am still facing the same problem.

The real strategy may be a little more complex, but in the end I still need the ability to request a reconnect when my strategy dictates it (and when the Channel has no indication of recent data received). The current Channel design is overbearing when connections are not reliable.

The deadline mechanism isn't really for connection level usage, it's for RPC level usage.
 
Agreed. From the application level perspective, however, an expired Deadline can simultaneously be an indicator of a broken or unavailable connection, so to take action it is important to have the ability to signal this problem to the connection/channel level right away, without having to wait for a keep alive ping and lack of response at their regularly scheduled intervals. That kind of wait may be fine on the server side, but it's not feasible for my interactive user-facing app. And expired Deadlines can cancel out KeepAlive pings due to no active calls, which exacerbates the problem.

I mean, I can work around the problem on the client side, but for my use case this means potentially issuing very frequent pings and doing so regardless of the existence of open calls. It's not a great design. I wouldn't have to do either of these things if I could signal to the Channel to reconnect when the Deadline expires.

If you can listen for OS updates on Android, why not just kill the RPCs your self when you get notified?

Good idea - doing more proactive OS listening & killing RPCs could come in useful. My question in this regard was more about how to best kill all existing RPCs in grpc-java while minimizing races, and the general feasibility of attempting to hide the messiness of Channel recreation by using Subchannels.

And, even if you did do this, how can you tell if the connection is failed?  For example, if the OS tells you the antenna is turned off, it may be temporary and could turn on again with neither endpoint being the wiser.  The connection is still active.

 Very true, there is no telling if the connection broke.

So in summary I think the ability to signal to the Channel to reconnect is still needed for the Android use case. I recall that one of the official grpc design documents (connection backoff document perhaps) mentions to just continue using the old connection instead if it recovers before the new connection is ready; the same could be true here.

There are already enhancement requests for RPC retries and reconnection backoff improvements, which I also sorely need.

Should I create an issue about the undocumented problem of expiring Deadlines interfering with KeepAlives? Maybe even just to add something to the Javadoc? While logical, I found this quite surprising.

Carl Mastrangelo

unread,
Aug 25, 2017, 10:35:19 PM8/25/17
to grpc.io
Responses inline.


On Tuesday, August 15, 2017 at 11:47:07 PM UTC-7, Uli Bubenheimer wrote:
Carl, thanks for your input.

In client-server development, especially for mobile, I can control certain things better than others. I have control over the server and can make it very reliable, so that I can mostly rule it out as a point of failure. By contrast, the network is chaotic. Network communication is going over the cellular network in my case, and I have to design for substantial connectivity problems because the app is generally used in remote locations. So usually the blame for communications problems in my case lies on the client-side network.

How can you tell if a connection is broken?  Unless you receive a packet saying the host isn't reachable, its possible the remote endpoint is just taking a long time.  It can't be distinguished from radio silence.

That's the crux of the problem - I usually can't tell if the connection is broken in a timely manner when a Deadline expires.

To be clear, I am not sure *anyone* can tell if a connection is gone.  The best bet is getting connection level timeouts.  On NettyChannelBuilder we expose setting channel options there, but I don't know if we have one for OkHttpChannelBuilder, which is likely what you are using.  We can probably expose one if you file a Github Issue.   (ICMP packets are a close second, but I am not holding my breath for those.)

 
So I have to use good heuristics (make informed guesses) in a way that minimizes the impact on app users. If I take an optimistic stance and reissue calls on an existing yet broken connection, I will waste time because all I may get at the application level is silence and another expired Deadline. If instead I take a pessimistic stance and assume that the connection is broken (whether or not that actually is the case), I can try a new connection (if it appears that I am online) and minimize the downtime. Doing one retry on the old connection can be a good idea, but if it fails I am still facing the same problem.

The real strategy may be a little more complex, but in the end I still need the ability to request a reconnect when my strategy dictates it (and when the Channel has no indication of recent data received). The current Channel design is overbearing when connections are not reliable.

The deadline mechanism isn't really for connection level usage, it's for RPC level usage.
 
Agreed. From the application level perspective, however, an expired Deadline can simultaneously be an indicator of a broken or unavailable connection, so to take action it is important to have the ability to signal this problem to the connection/channel level right away, without having to wait for a keep alive ping and lack of response at their regularly scheduled intervals. That kind of wait may be fine on the server side, but it's not feasible for my interactive user-facing app. And expired Deadlines can cancel out KeepAlive pings due to no active calls, which exacerbates the problem.

I mean, I can work around the problem on the client side, but for my use case this means potentially issuing very frequent pings and doing so regardless of the existence of open calls. It's not a great design. I wouldn't have to do either of these things if I could signal to the Channel to reconnect when the Deadline expires.

If you can listen for OS updates on Android, why not just kill the RPCs your self when you get notified?

Good idea - doing more proactive OS listening & killing RPCs could come in useful. My question in this regard was more about how to best kill all existing RPCs in grpc-java while minimizing races, and the general feasibility of attempting to hide the messiness of Channel recreation by using Subchannels.

Well, I think I need to eat crow,  killing RPCs is likely not a good idea.  Sadly, killing the channel is not a good idea either.  Your options:

* If you kill the RPC, there is no chance for automatic retries.  They are an upcoming feature that you probably do want.   
* To answer your specific question, calling channel.shutdownNow() is the safest way of killing all RPCs.  It will also kill all connections owned by that channel, including ones that were not broken!  Safest but worst option
* Kill the connections.  I don't think such a thing exists today, because connection management is harder if we let users have direct access to them.  Open to ideas on how to safely expose them. 
 


And, even if you did do this, how can you tell if the connection is failed?  For example, if the OS tells you the antenna is turned off, it may be temporary and could turn on again with neither endpoint being the wiser.  The connection is still active.

 Very true, there is no telling if the connection broke.

So in summary I think the ability to signal to the Channel to reconnect is still needed for the Android use case. I recall that one of the official grpc design documents (connection backoff document perhaps) mentions to just continue using the old connection instead if it recovers before the new connection is ready; the same could be true here.

There are already enhancement requests for RPC retries and reconnection backoff improvements, which I also sorely need.

Should I create an issue about the undocumented problem of expiring Deadlines interfering with KeepAlives? Maybe even just to add something to the Javadoc? While logical, I found this quite surprising.

Yes, I think so.  It would at least result in docs being made clearer.
Reply all
Reply to author
Forward
0 new messages