Detecting bad connection in java

yfe...@gmail.com

unread,

Jun 5, 2019, 1:48:38 PM6/5/19

to grpc.io

Hey folks,

I'm making calls using a `ManagedChannel` over a VPN connection. When the VPN renegotiates (~ 8 hours), a periodic call (issued every 10 seconds) that would ordinarily take 1ms times out repeatedly for about 15 min instead before the `ManagedChannel` figures out the connection is bad (fails with `RuntimeException: UNAVAILABLE`) and resets. Is there a way to make that happen faster? I've set the deadline for this call to 2 seconds which is typically more than enough time (usually comes back in under 10 ms). The logs end up looking something like

14:14:05.256 io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 4999901019ns

14:14:15.257 io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 4999901019ns

... call is made every 10 sec and fails like this for 15 min before

14:30:42.236 java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.grpc.netty.shaded.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
at io.grpc.netty.shaded.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
at io.grpc.netty.shaded.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347)
at io.grpc.netty.shaded.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656)
at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591)
at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508)
at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)

Wrapped by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception

After which everything resumes normally.

How can I get `ManagedChannel` to pick up that the connection is bad and to reset earlier than 15min. I've tested outside of grpc that the VPN renegotiation takes about 2 seconds, not anywhere close to 15 min.

I'm using GRPC for java 1.18.0 with Netty

Many thanks!

Yosef

Carl Mastrangelo

unread,

Jun 5, 2019, 1:57:25 PM6/5/19

to grpc.io

Try setting the keep alive settings defined here:

https://grpc.github.io/grpc-java/javadoc/io/grpc/ManagedChannelBuilder.html#keepAliveTimeout-long-java.util.concurrent.TimeUnit-

yfe...@gmail.com

unread,

Jun 5, 2019, 2:02:34 PM6/5/19

to grpc.io

Hey Carl,

Thanks for your help with this!

Is it correct that if I am submitting calls every 10 seconds, but my keepAliveTime > 10seconds, then no keep alives would be sent? Or are keep alives sent regardless of request activity?

Is it only failed keep alive pings that will indicate a channel is down but not regular requests?

Thanks again,

Yosef

Carl Mastrangelo

unread,

Jun 5, 2019, 3:01:42 PM6/5/19

to grpc.io

You'll need to read the docs of that method I linked (and the other ones with "keepalive" in the name). It discusses what will happen with regards to active calls.

yfe...@gmail.com

unread,

Jun 5, 2019, 8:24:24 PM6/5/19

to grpc.io

So it seems like I'm getting hit with the problem addressed in https://github.com/grpc/grpc-java/pull/5599 and https://github.com/grpc/proposal/blob/master/A18-tcp-user-timeout.md. It exactly matches the 15min it takes to reset, the write heavy workload and the default linux TCP timeout.

I'm sadly unable to upgrade to the latest release (1.21.0) that has the fix at the moment. Wonder if anyone has another way around this?