grpc-java throwing unavailable exception after ~2 billion requests

2,176 views
Skip to first unread message

Erik Gorset

unread,
May 6, 2016, 4:49:55 AM5/6/16
to grp...@googlegroups.com
Hi,

It looks like grpc-java is calling netty's incrementAndGetNextStreamId [0] which returns int. Does this really mean that grpc only supports 2^31 requests per channel?

What’s the recommended way of dealing with this? I’m hoping for a better answer than “restart the channel or process often enough”. I’m happy to create a github issue if this can be seen as a bug and not a known limitation. Does the other grpc implementations have the same limitation?

The background for my question is that we had an outage caused by the limitation, where multiple replicas got shutdown within a short period since we had started them at roughly the same time...

2016-05-06 06:17:42,721 ERROR [pool-7-thread-13] c.g.c.u.c.UncaughtExceptionHandlers$Exiter - Caught an exception in Thread[pool-7-thread-13,5,main]. Shutting down.
io.grpc.StatusRuntimeException: UNAVAILABLE: Stream IDs have been exhausted
at io.grpc.Status.asRuntimeException(Status.java:431) ~[grpc-core.jar:0.13.2]
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:208) ~[grpc-stub.jar:0.13.2]
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:141) ~[grpc-stub.jar:0.13.2]

[0] http://netty.io/4.1/api/io/netty/handler/codec/http2/Http2Connection.Endpoint.html#incrementAndGetNextStreamId()


Erik Cysneiros Gorset

Josh Humphries

unread,
May 6, 2016, 9:53:09 AM5/6/16
to Erik Gorset, grpc-io
The limit is per transport (e.g. actual socket connection), not channel. And it's a limit of the HTTP/2 spec (https://tools.ietf.org/html/rfc7540#section-5.1.1). I think this is a bug in the gRPC channel code: it should recycle the transport when it reaches (or gets close to) this limit.

Also, the limit is actually only ~1 billion streams because IDs for client-initiated streams (all streams in gRPC) must be odd.


----
Josh Humphries
Payments Engineering
Atlanta, GA  |  678-400-4867


--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/1F613F43-F3EE-494B-A33E-B161EC6E18CA%40cxense.com.
For more options, visit https://groups.google.com/d/optout.

Eric Anderson

unread,
May 6, 2016, 12:25:48 PM5/6/16
to Erik Gorset, grpc-io
On Fri, May 6, 2016 at 1:49 AM, Erik Gorset <erik....@cxense.com> wrote:
It looks like grpc-java is calling netty's incrementAndGetNextStreamId [0] which returns int. Does this really mean that grpc only supports 2^31 requests per channel?

As Josh said, the limit is per transport, not per channel. There is already code intended to swap to a new transport, but maybe it is buggy/suboptimal.

I’m happy to create a github issue if this can be seen as a bug and not a known limitation.

Please make an issue. This is a bug.

UNAVAILABLE: Stream IDs have been exhausted

That status appears to be coming from here. The behavior then seems to be that that particular RPC will fail but future RPCs should start going to a new transport. That alone is suboptimal but not too bad; a transient failure of 1 out of 2^30 RPCs should be recoverable by applications, otherwise they are probably going to have a bad time from other failures. However, it won't necessarily be only 1 RPC that fails, since it will take a small amount of time to divert traffic to a new transport, and all RPCs during that time would fail. It'd be good to address that.

However, I think the larger problem is that calling close doesn't trigger things quickly enough, especially if you have long-lived streams, since it delays until all the RPCs on that transport are complete. There is no upper-bound on how long a stream could live, so a Channel could be broken for quite some time.

The background for my question is that we had an outage caused by the limitation

If your RPCs are short-lived and my analysis is correct, I wouldn't expect an outage, but instead a temporary failure. Is the lifetime of some of your RPCs long? If so, then I think that would help confirm my theory.

Erik Gorset

unread,
May 9, 2016, 8:07:05 AM5/9/16
to Eric Anderson, grpc-io
On 6. mai 2016, at 18:25, Eric Anderson <ej...@google.com> wrote:

On Fri, May 6, 2016 at 1:49 AM, Erik Gorset <erik....@cxense.com> wrote:
It looks like grpc-java is calling netty's incrementAndGetNextStreamId [0] which returns int. Does this really mean that grpc only supports 2^31 requests per channel?

As Josh said, the limit is per transport, not per channel. There is already code intended to swap to a new transport, but maybe it is buggy/suboptimal.

I’m happy to create a github issue if this can be seen as a bug and not a known limitation.

Please make an issue. This is a bug.



UNAVAILABLE: Stream IDs have been exhausted

That status appears to be coming from here. The behavior then seems to be that that particular RPC will fail but future RPCs should start going to a new transport. That alone is suboptimal but not too bad; a transient failure of 1 out of 2^30 RPCs should be recoverable by applications, otherwise they are probably going to have a bad time from other failures. However, it won't necessarily be only 1 RPC that fails, since it will take a small amount of time to divert traffic to a new transport, and all RPCs during that time would fail. It'd be good to address that.

However, I think the larger problem is that calling close doesn't trigger things quickly enough, especially if you have long-lived streams, since it delays until all the RPCs on that transport are complete. There is no upper-bound on how long a stream could live, so a Channel could be broken for quite some time.

The background for my question is that we had an outage caused by the limitation

If your RPCs are short-lived and my analysis is correct, I wouldn't expect an outage, but instead a temporary failure. Is the lifetime of some of your RPCs long? If so, then I think that would help confirm my theory.

Thanks for the details!

The production system in question is a long lived client/server with only short lived RPCs with close to zero temporary failures due to its simplicity, thus our error handling is expensive with a full restart if something unexpected happens. Most of our systems are not like this and we generally deploy/upgrade processes often enough so that we would never reach the stream ID limit for a single channel/transport. Health checks will also ensure that broken channels are not used.

I wrote up a couple of suggestions in the github issue - feel free to change/delete as you see fit.

— 
Erik Cysneiros Gorset

alan....@gmail.com

unread,
Jul 26, 2017, 10:53:59 PM7/26/17
to grpc.io, erik....@cxense.com
I encountered this problem twice, runtime exception just like the following.
The first time is still working with Cloud Speech API Beta, system running about three weeks and about ten thousands requests.
The second time is working Cloud Speech API GA only less than one hours and about 100 requests. 
(Exactly the same system already
Both time system can only be recovered by application restart.
I doubt this issue might not need 2 billion requests to make it happen.
And is there any more elegant way to recover other than kill application and restart it.

2017-07-27 08:30:19,794 [DEBUG] [r-ELG-49-2]
verification of certificate failed
java
.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
 at sun
.security.validator.PKIXValidator.<init>(Unknown Source)
 at sun
.security.validator.Validator.getInstance(Unknown Source)
 at sun
.security.ssl.X509TrustManagerImpl.getValidator(Unknown Source)
 at sun
.security.ssl.X509TrustManagerImpl.checkTrustedInit(Unknown Source)
 at sun
.security.ssl.X509TrustManagerImpl.checkTrusted(Unknown Source)
 at sun
.security.ssl.X509TrustManagerImpl.checkServerTrusted(Unknown Source)
 at io
.netty.handler.ssl.ReferenceCountedOpenSslClientContext$ExtendedTrustManagerVerifyCallback.verify(ReferenceCountedOpenSslClientContext.java:223)
 at io
.netty.handler.ssl.ReferenceCountedOpenSslContext$AbstractCertificateVerifier.verify(ReferenceCountedOpenSslContext.java:606)
 at org
.apache.tomcat.jni.SSL.readFromSSL(Native Method)
 at io
.netty.handler.ssl.ReferenceCountedOpenSslEngine.readPlaintextData(ReferenceCountedOpenSslEngine.java:470)
 at io
.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:927)
 at io
.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1033)
 at io
.netty.handler.ssl.SslHandler$SslEngineType$1.unwrap(SslHandler.java:200)
 at io
.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1117)
 at io
.netty.handler.ssl.SslHandler.decode(SslHandler.java:1039)
 at io
.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:411)
 at io
.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248)
 at io
.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
 at io
.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
 at io
.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
 at io
.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
 at io
.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
 at io
.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
 at io
.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
 at io
.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129)
 at io
.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:642)
 at io
.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:565)
 at io
.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:479)
 at io
.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441)
 at io
.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at io
.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
 at java
.lang.Thread.run(Unknown Source)
Caused by: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
 at java
.security.cert.PKIXParameters.setTrustAnchors(Unknown Source)
 at java
.security.cert.PKIXParameters.<init>(Unknown Source)
 at java
.security.cert.PKIXBuilderParameters.<init>(Unknown Source)
 
... 32 more


and

java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: Channel closed while performing protocol negotiation





Eric Anderson於 2016年5月7日星期六 UTC+8上午12時25分48秒寫道:
Reply all
Reply to author
Forward
0 new messages