Problems with DNS updates at low traffic volume

Mark Robinson

unread,

Apr 18, 2023, 1:23:22 PM4/18/23

to grpc.io

Hi,

I have a problem where I have two services communicating but won't update the endpoint connection information when DNS updates.

Very briefly, the architecture is

http -> [AWS LB] -> [Service A] -> (grpc) [AWS ELB] -> [Service B]

When I change the DNS information for service B, with the intent of sending it through a physically different ELB, service A won't change to point to the new IP address at low traffic volume.

If I increase the traffic volume, it will switch fairly quickly. On the order of minutes. However, if I keep the traffic volume low (<2 RPS/pod), it'll stick on the old connection for hours if not forever. The only solution I've found is to restart all of service A, but that's not a great solution.

Does anyone know what might be going on here?

Mark

Antoine Tollenaere

unread,

Apr 19, 2023, 5:46:21 AM4/19/23

to Mark Robinson, grpc.io

At least in Java, Go, and C, the DNS resolver refreshes when an underlying transport (the TCP connection) is closed. The transport can be closed for a variety of reasons. Most likely in your scenario, the traffic volume has an influence on whether connections are being closed on the AWS ELB side, which triggers DNS re-resolution in the high-volume case.

One way to force periodic DNS re-resolution from the client side is to set the maximum connection age. In C it's done via the channel options GRPC_ARG_MAX_CONNECTION_AGE_MS and GRPC_ARG_MAX_CONNECTION_AGE_GRACE_MS documented here:

https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#gabd3a16f46ad2cb5f06064bb607df7b5b

https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#gaf4574abe94c339c6f21163bca6e7b6b7

There are equivalents for other languages. This will cause a bit of connection churn, so you probably don't want to set the maximum age too low. Another option would be to implement regular connection closing on the server side, on your AWS ELB configuration -- not sure if AWS ELBs provide that as an option.

Hope this helps,

Antoine.

--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/1042f275-b01f-49d3-b39c-4edcfac34a84n%40googlegroups.com.

Mark Robinson

unread,

Apr 21, 2023, 2:00:57 PM4/21/23

to Antoine Tollenaere, grpc.io

Thanks for the tips.

After digging into it, it is being caused by the nginx ingress controller. Nginx is terminating the grpc connections and then load balancing requests to pods underneath it. So while nginx is respecting the server's connection timeout, it's not forwarding it to the client to reset its connections. So the client just hangs on to an open connection forever. There's a (closed) bug in the nginx tracker about this exact problem - https://github.com/kubernetes/ingress-nginx/issues/4402

Mark

Reply all

Reply to author

Forward