Under high load, clients get StatusRuntimeException: UNKNOWN: channel closed

Ethan Cheng

unread,

Jul 29, 2020, 7:56:50 PM7/29/20

to grpc.io

We had an incident earlier that when one of our client had improper retry policy that leads to huge wave of requests to the server. Then all clients started to see `this StatusRuntimeException: UNKNOWN: channel closed` at client side.

for one client, it was:

```

@400000005f0744ef2096858c java.util.concurrent.CompletionException: io.grpc.StatusRuntimeException: UNKNOWN: channel closed

@400000005f0744ef2096858c Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]

@400000005f0744ef20968974 at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?]

@400000005f0744ef2096df64 at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[?:?]

@400000005f0744ef2096df64 at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632) ~[?:?]

@400000005f0744ef2096e34c at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) [?:?]

@400000005f0744ef2096e34c at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) [?:?]

@400000005f0744ef2096ef04 at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:449) [grpc-stub-1.27.1.jar:1.27.1]

@400000005f0744ef2096fabc at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426) [grpc-core-1.29.0.jar:1.29.0]

@400000005f0744ef2096fea4 at io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66) [grpc-core-1.29.0.jar:1.29.0]

@400000005f0744ef2096fea4 at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689) [grpc-core-1.29.0.jar:1.29.0]

@400000005f0744ef20970674 at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577) [grpc-core-1.29.0.jar:1.29.0]

@400000005f0744ef20970a5c at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751) [grpc-core-1.29.0.jar:1.29.0]

@400000005f0744ef20970a5c at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740) [grpc-core-1.29.0.jar:1.29.0]

@400000005f0744ef2097604c at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) [grpc-core-1.29.0.jar:1.29.0]

@400000005f0744ef20976434 at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) [grpc-core-1.29.0.jar:1.29.0]

@400000005f0744ef20976434 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]

@400000005f0744ef20976fec at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]

@400000005f0744ef20976fec at java.lang.Thread.run(Thread.java:834) [?:?]

@400000005f0744ef20976fec Caused by: io.grpc.StatusRuntimeException: UNKNOWN: channel closed

@400000005f0744ef209773d4 Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]

@400000005f0744ef209777bc at io.grpc.Status.asRuntimeException(Status.java:533) ~[grpc-api-1.29.0.jar:1.29.0]

@400000005f0744ef209777bc ... 12 more

```

for another clients it was:

```

Caused by: java.nio.channels.ClosedChannelException

at io.grpc.netty.shaded.io.grpc.netty.Utils.statusFromThrowable(Utils.java:168)

at io.grpc.netty.shaded.io.grpc.netty.NettyClientTransport$5.operationComplete(NettyClientTransport.java:267)

at io.grpc.netty.shaded.io.grpc.netty.NettyClientTransport$5.operationComplete(NettyClientTransport.java:261)

at io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)

at io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:485)

at io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)

at io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.setFailure(DefaultPromise.java:112)

java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNKNOWN: channel closed

at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) ~[?:1.8.0_92]

at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) ~

Caused by: io.grpc.StatusRuntimeException: UNKNOWN: channel closed

at io.grpc.Status.asRuntimeException(Status.java:533)

at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:442) io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[sync-bid-strategy.jar:?]

at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[sync-bid-strategy.jar:?]

at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:700) ~[sync-bid-strategy.jar:?]

at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)

at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)

at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)

at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:399)

at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:507)

at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:66)

at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:627)

at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$700(ClientCallImpl.java:515)

at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:686)

at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:675)

at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)

at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_92]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_92]

at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_92]

Caused by: java.nio.channels.ClosedChannelException

at io.grpc.netty.shaded.io.grpc.netty.Utils.statusFromThrowable(Utils.java:202) ~

at io.grpc.netty.shaded.io.grpc.netty.WriteBufferingAndExceptionHandler.exceptionCaught(WriteBufferingAndExceptionHandler.java:86)

at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:297)

at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:276)

at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:268)

at io.grpc.netty.shaded.io.grpc.netty.ProtocolNegotiators$ClientTlsHandler.userEventTriggered(ProtocolNegotiators.java:373) io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:341)

at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:327)

at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:319)

at io.grpc.netty.shaded.io.netty.handler.ssl.SslUtils.handleHandshakeFailure(SslUtils.java:348)

at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.setHandshakeFailure(SslHandler.java:1792)

at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:1073)

at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)

at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)

at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)

at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1403)

at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)

at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)

at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912)

at io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:827)

at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)

at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:405)

at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:338)

at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)

at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)

at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_92]

Caused by: java.nio.channels.ClosedChannelException

at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.channelInactive(...)(Unknown Source)

```

The load is probably not very very high ~40RPS, but most of these requests are server streaming and will transmit fairly large amount of data, meaning that it will take considerable amount of time for each request to complete.

Are we hitting some kind of limitation of our server? I thought it could be `MAX_CONCURRENT_STREAM`, but not sure the `MAX_CONCURRENT_STREAM` applies to a client's connection to server, or the server's connections overall.

We are using grpc-netty-shaded(1.27.1).

Eric Gribkoff

unread,

Jul 30, 2020, 1:25:28 AM7/30/20

to Ethan Cheng, grpc.io

The first stack trace's cause is truncated so I'm not sure what happened there, but the second includes "at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.setHandshakeFailure(SslHandler.java:1792)" after " io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:1073) ", which I believe indicates that the channel - for an unknown (possibly truncated in the stack trace?) reason - moved to idle during the TLS handshake. As for why this would happen under load: just taking a guess, but is it possible your server is performing a blocking operation in the Netty event loop? I don't think MAX_CONCURRENT_STREAMS-related capacity issue would result in this stack trace, as the connection should be more or less cleanly rejected rather than moving a connected channel to inactive mid-handshake as suggested by the provided stacktrace.

Thanks,

Eric

--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/de583aee-62f7-4fba-9150-277e135e7658n%40googlegroups.com.

Ethan Cheng

unread,

Jul 30, 2020, 4:34:08 PM7/30/20

to grpc.io

Hi Eric, thanks for looking into this.

I agree that this has nothing to do with `MAX_CONCURRENT_STREAM` and I can see why you are thinking about the blockings in the Netty's event loop. Since a lot of our requests are long running streaming request which transmit a lot data, is it possible that they are using most of the threads in the event loop which lead to kind of `thread starvation` for the connections coming from the troublesome clients? Then these connections stay idled and passed the `DEFAULT_SERVER_KEEPALIVE_TIMEOUT_NANOS` and so these connections became inactive?

If this is the situation, should we customize the threadpool in Netty?

Eric Gribkoff

unread,

Jul 31, 2020, 3:21:43 PM7/31/20

to Ethan Cheng, grpc.io

Hi Ethan,

That sounds like a reasonable hypothesis (although I should caveat this with I'm far from an expert on gRPC Netty server best practices). Did you observe any errors or warnings in the server-side logs when this was occurring? By default, a gRPC Netty server will use a cached thread pool - the general recommendation (e.g., here and here) is to specify a fixed thread pool instead via `ServerBuilder.executor`. I think it may be possible that, if your cached thread pool running your server application code has spun up lots of threads that are long-running, it may starve other threads, resulting in the Netty event loops (see https://groups.google.com/g/grpc-io/c/LrnAbWFozb0) handling incoming connections being unable to run. Based on the recommendations for using a fixed thread pool executor instead, I would recommend trying that, making sure the size of the fixed thread pool is capped at a # of threads that won't prevent the CPU from running other threads as well.

Thanks,

Eric

To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/aeb390d1-a26c-402e-8a9a-2790cfbc9c17n%40googlegroups.com.

Ethan Cheng

unread,

Aug 3, 2020, 12:57:18 PM8/3/20

to grpc.io

Hi Eric,

Thanks for following up.

Yes, we have been using a custom application/server threadpool which is a fixed size thread pool with a core size of 8 * cpu core. (Do you think the 8 is too high, so that it may prevent the CPU from running other threads as you mentioned). And we use the default bossEventloopGroup and workerEventLoopGroup which I believe is default to 2* cpu core.

Also, to answer your question before that whether we saw anything interesting in the server log. We did not see anything interesting, it seems that the requests coming from these clients never went to gRPC layer, they are somehow rejected/ignored in the netty layer already as our metrics/logging interceptors did not catch any traces of them at all.

Now we have a rate limiter in place and a better retry policy from clients, so it should prevent this thing from happening, but the exact reason why this particular incident happened was still a bit mysterious even with the hypothesis.

Reply all

Reply to author

Forward