[grpc-java] All RPCs failed with DEADLINE_EXCEEDED

4,195 views
Skip to first unread message

bira...@gmail.com

unread,
Sep 25, 2017, 6:19:14 PM9/25/17
to grpc.io
Language: java
Version: 1.5

I ran into weird issue multiple times recently: all RPCs from one client failed with DEADLINE_EXCEEDED. From server log, it looks like these failed requests didn't arrive at server at all. Other GRPC clients worked fine during that time. The issue was fixed by restarting client application.
Keep-alive feature is not used in client. From my understanding, client channel should manage the underlying connection properly even keep-alive is off.
This issue happens occasionally and I haven't find a reliable way to reproduce.

Carl Mastrangelo

unread,
Sep 25, 2017, 7:58:36 PM9/25/17
to grpc.io
What deadline are you using?  Sometimes if DNS resolution fails, it won't try to re-resolve for a minute or so, causing failures to occur.

bira...@gmail.com

unread,
Sep 25, 2017, 8:27:23 PM9/25/17
to grpc.io
My deadline is 10 minute and the RPC failures can last for hours.
Ip address of server is not changed to my knowledge.

Jorg Heymans

unread,
Sep 27, 2017, 7:20:03 AM9/27/17
to grpc.io
Happy to see this being posted. We are running into this as well, it happens once every couple of weeks we have no idea why. Indeed server side we don't see anything in the logs it's as if nothing is going on. When it happens we restart the services though, not the client application and it makes the problem go away. The client does not seem to be able to recover from it. It seems to be an issue only in production, presumably because the client applications and services do not get restarted as much as in test ?

We have a zookeeper based NameResolver, client deadlines are set to 10 seconds. Are you suggesting we should up this > 1 minute to get around any resolution failures ?

Thanks,
Jorg

bira...@gmail.com

unread,
Sep 27, 2017, 1:06:29 PM9/27/17
to grpc.io
Another thing I forgot to mention is, the issue starts happening after periodic health-check grpc call is removed, which makes me feel it is related to internal connection management.

Kun Zhang

unread,
Sep 28, 2017, 7:08:22 PM9/28/17
to grpc.io


On Wednesday, September 27, 2017 at 4:20:03 AM UTC-7, Jorg Heymans wrote:
Happy to see this being posted. We are running into this as well, it happens once every couple of weeks we have no idea why. Indeed server side we don't see anything in the logs it's as if nothing is going on. When it happens we restart the services though, not the client application and it makes the problem go away. The client does not seem to be able to recover from it. It seems to be an issue only in production, presumably because the client applications and services do not get restarted as much as in test ?

Can you clarify this part? Did or did not the client recover from it after you restarted the services (servers?) ? If not, what does "it makes the problem go away" mean? 

Jorg Heymans

unread,
Sep 29, 2017, 3:15:56 AM9/29/17
to grpc.io

On Friday, September 29, 2017 at 1:08:22 AM UTC+2, Kun Zhang wrote:


On Wednesday, September 27, 2017 at 4:20:03 AM UTC-7, Jorg Heymans wrote:
Happy to see this being posted. We are running into this as well, it happens once every couple of weeks we have no idea why. Indeed server side we don't see anything in the logs it's as if nothing is going on. When it happens we restart the services though, not the client application and it makes the problem go away. The client does not seem to be able to recover from it. It seems to be an issue only in production, presumably because the client applications and services do not get restarted as much as in test ?

Can you clarify this part? Did or did not the client recover from it after you restarted the services (servers?) ? If not, what does "it makes the problem go away" mean? 

Sorry for the confusion. The client is able to successfully make grpc calls again once the server is restarted. Without a server restart, the client stays in this error state indefinitely. Hope this is more clear.

Jorg

Jorg Heymans

unread,
Oct 10, 2017, 3:08:34 AM10/10/17
to grpc.io
Hi,

Don't want to make this a bigger deal than it really is, but can we conclude from this thread then that health checks are mandatory to avoid this problem ? Or is there something that can be done internally with regards to retrying a failed DNS lookup ?

Thanks,
Jorg

Taehyun Park

unread,
Oct 11, 2017, 10:57:28 AM10/11/17
to grpc.io
This is what I did to avoid this problem in production. I wrapped all grpc calls with RxJava and used Retry to re-initialize channel when DEADLINE_EXCEEDED is thrown.

bira...@gmail.com

unread,
Oct 18, 2017, 6:40:19 PM10/18/17
to grpc.io
The problem happened to me again even with keep-alive.
netstats suggests that underlying tcp connection is established.
Client thread dump follows:
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f1624000038> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at io.grpc.stub.ClientCalls$ThreadlessExecutor.waitAndDrain(ClientCalls.java:572)
        at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:120)

All server threads are idle. It is likely server receives no request.

Carl Mastrangelo

unread,
Oct 19, 2017, 1:31:34 PM10/19/17
to grpc.io
Is that the only thread?  Also, are you using waitForReady?

Bi Ran

unread,
Oct 19, 2017, 2:09:26 PM10/19/17
to Carl Mastrangelo, grpc.io
The client automatically recovered after a few hours.
This is the only thread that is relevant. 
I did not explicitly set waitForReady, and I think by default it is disabled.
I can provide more thread dumps when it happen again next time. Let me know if any information other than thread dump is needed.


--
You received this message because you are subscribed to a topic in the Google Groups "grpc.io" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/grpc-io/0t9n1-2GTms/unsubscribe.
To unsubscribe from this group and all its topics, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/178d7d15-fc27-4c0a-b53d-39f48fc6f9f9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

bira...@gmail.com

unread,
Oct 30, 2017, 5:00:59 PM10/30/17
to grpc.io
Some findings to share, not sure if related though:
1. we use custom grpc name resolver, and refresh() method is not overridden (default is no-op)
2. in the past occurrences, before DEADLINE_EXCEEDED starts happening, there is ALWAYS a 30 minutes gap between the last succeeded and the second last succeeded grpc from the broken client.

30 minutes seems to be the default connection idle timeout, which is what we use. Not sure if it is related though.
To unsubscribe from this group and all its topics, send an email to grpc-io+unsubscribe@googlegroups.com.

Carl Mastrangelo

unread,
Nov 1, 2017, 5:08:25 PM11/1/17
to grpc.io
All the methods should be overriden.  The idle connection time is likely that timeout.  
To unsubscribe from this group and all its topics, send an email to grpc-io+u...@googlegroups.com.

bira...@gmail.com

unread,
Nov 2, 2017, 12:36:11 AM11/2/17
to grpc.io
I think I have a theory that explains what happened. 
This issue is caused by https://github.com/grpc/grpc-java/issues/3545.
In grpc 1.5, there is a race condition: when channel goes idle, load balancer is shutdown and sub channels are scheduled to shutdown in 5 sec. If some rpc come and succeeded within this 5 sec, new load balancer and sub-channel will be created. 
However, when old sub-channel shuts down, old load balancer will be notified about state change and update channel picker with an empty list of channels. 
Now if some rpc is made before channel goes idle again, that rpc will be buffered in delayed transport forever.
Following code reproduces the scenario above:

public static void main(String[] args) throws Exception {
ServerBuilder.forPort(12345)
.addService(new GreeterImpl().bindService())
.build()
.start();
Channel channel = NettyChannelBuilder.forTarget("localhost:12345")
.idleTimeout(1, TimeUnit.SECONDS)
.negotiationType(NegotiationType.PLAINTEXT)
.loadBalancerFactory(RoundRobinLoadBalancerFactory.getInstance())
.usePlaintext(true)
.build();
GreeterBlockingStub stub = GreeterGrpc.newBlockingStub(channel);
stub.sayHello(HelloRequest.getDefaultInstance());
Thread.sleep(5500); // idle mode timer runs after 1 sec, sub-channel will shutdown after 6 sec
stub.sayHello(HelloRequest.getDefaultInstance()); // connection reestablished and rpc succeeded
Thread.sleep(600); // wait for channel shuts down, bad channel picker got set
stub.sayHello(HelloRequest.getDefaultInstance()); // if I make another rpc before channel goes idle, it will never return
}

The bug is fixed in 1.7 because following change protects channel picker from being updated by load balancer after shutdown.
Reply all
Reply to author
Forward
0 new messages