grpc-java DnsNameResolver behavior (Kubernetes pod failing scenario behind Kube DNS)

Yee-Ning Cheng

unread,

Jan 15, 2019, 5:36:39 PM1/15/19

to grpc.io

Hi,

I have a gRPC client using the default DnsNameResolver and RoundRobinLoadBalancer that is connected to gRPC servers on Kubernetes using the Kube DNS endpoint. The servers are deployed as Kube pods and may fail. I see that when a pod fails, the onStateChange gets called to refresh the DnsNameResolver. The problem is that the new Kube pod that gets spun up in the old pod's place is not up yet when the resolver is trying to refresh the subchannel state and doesn't see the new pod. And thus, the client is not able to see the new pod and does not connect to it.

Is there a configuration I am missing or is there a way to refresh the resolver on a scheduled timer?

Thanks,

Yee-Ning

Kun Zhang

unread,

Jan 17, 2019, 2:04:45 PM1/17/19

to grpc.io

Even though the first DNS refresh is too early to notice the new address, as long as the old address is still returned, RoundRobin will continue trying to connect to the old address (subject to exponential back-off of Subchannel reconnections). Of course it will fail, but whenever it does, a new DNS refresh will be triggered. Eventually you will get the new address.

If you have waited long enough and still not seen the new address, it may be due to the TTL of the DNS record, or more likely, JVM's DNS caching.

Yee-Ning Cheng

unread,

Jan 17, 2019, 4:47:51 PM1/17/19

to grpc.io

What is the default exponential backoff configuration?

I have used dig on the DNS record and it is has a 30s expiration, so that does not seem to be the issue.

With regards to the JVM DNS caching, I tried setting a variety of properties and none of them seem to work.

I tried setting the following property right before my main, but it does not work.

object ClientDriver {

  java.security.Security.setProperty("networkaddress.cache.ttl", "20")

  def main(arg: Array[String]) = {

    // Code
    ....
    
  }
}

I even tried setting the following system property which didn't work either

-Dsun.net.inetaddr.ttl=20

Kun Zhang

unread,

Jan 18, 2019, 7:10:32 PM1/18/19

to grpc.io

The exponential back-off is implemented here:

https://github.com/grpc/grpc-java/blob/master/core/src/main/java/io/grpc/internal/ExponentialBackoffPolicy.java#L31

You can enable FINE logging for io.grpc.internal.InternalSubchannel (1.16.x or older) or FINEST logging for io.grpc.ChannelLogger (1.17.x or newer), and see if reconnection is really happening.

What's the gRPC version are you on?

Reply all

Reply to author

Forward