Rotating GRPC servers with DNS resolution

2,645 views
Skip to first unread message

yun...@compass.com

unread,
Dec 9, 2016, 5:45:57 PM12/9/16
to grpc.io, Alec Thomas, Luke Downey
I'm a bit unclear as to how name resolvers in GRPC work with load balancing in cases of rolling deploys. As far as I can tell, it seems like the most recently deployed server will end up no traffic if we follow our current deploy strategy.

Our rolling deploy strategy that we plan to adapt for GRPC works as follows

1. Build artifacts
2. De-register a server replica from its DNS name
3. Update the server and restart it
4. Re-register the server replica from its DNS name.

For the Java GRPC implementation, it looks like the GRPC name resolver does not refresh the list of IPs unless (1) there is an error in a previous resolve or (2) a server goes down. I believe the core implementation does the same thing, though I'm not familiar enough with C to really tell.

What I believe will happen during a rolling deploy is:

1. Before deploy: Client is talking to N nodes
2. A server is removed from DNS, nothing happens on the client
3. The server issues a GOAWAY frame to clients. The client removes the server from its list of connections, and resolves a new list of servers, finding any newly added servers
4. The server is restarted and added to the DNS
5. Repeat for all other servers in the server set
5. After deploy: Client is talking to N-1 nodes and will never attempt to look for the last server to be restarted

Is my analysis correct? And if so, what is the recommended way to make sure the client ends up talking to all N servers after a rolling deploy?

Carl Mastrangelo

unread,
Dec 13, 2016, 7:08:31 PM12/13/16
to grpc.io, al...@compass.com, lu...@compass.com
A couple things:


* Is there anyway to add a new server to the pool before turning down the previous replica?  That would make you slightly over capacity during the roll out.
* The load balancer in Java works differently based on the strategy you use.  In RR, the LB maintains a connection to each of the replicas, so that when one goes down the next one will be used.  This means that as long as you have some in the pool, it will always go to the next connection.  That said, if you are doing rolling restart, this pool will get smaller and smaller until every connection has been killed and marked unusuable.  At that point I believe it will refresh the list.  

Kun Zhang

unread,
Dec 13, 2016, 7:28:39 PM12/13/16
to grpc.io, al...@compass.com, lu...@compass.com
The generalized version of the issue would be that you add a server but the client will never pick it up until one existing server is down.
We could add periodical refreshing functionality to our DNS NameResolver. I have filed https://github.com/grpc/grpc-java/issues/2514

Luke Tyler Downey

unread,
Dec 13, 2016, 8:30:41 PM12/13/16
to Kun Zhang, grpc.io, al...@compass.com
Even better (for our use case in particular, but in general too probably) would be to actually respect the TTL in a DNS record.

Luke

Yunchi Luo

unread,
Dec 13, 2016, 11:49:22 PM12/13/16
to Luke Tyler Downey, Kun Zhang, grpc.io, al...@compass.com
I'm very happy to get so much feedback!

To Carl,

We don't have a container cluster management system like kubernetes or mesos running, so to spin up an additional instance would mean starting up a new machine and that will add significant latency to our deploys.

We use RR. In RR does the resolver server list refresh only happen when all server connections are down, or each time a single server go down?

To Kun,

Thanks for filing the issue. We are planning to use grpc in Python, Go, and Java. We implemented our own Go dns resolver that does support periodic updates since the Go library seems to be missing dns resolvers entirely.

Do you know what the core implementation which Python uses does? Does the periodic refresh or TTL need to be implemented in core as well?

Support rolling deploys and/or adding servers to a name and having clients pick them up periodically seems like a pretty critical feature for a production system. Is there design for these cases documented somewhere?

To Luke and Kun,

I agree TTL seems like the most natural option since it's part of DNS. I think I saw an issue in core somewhere that mentioned DNS TTL but can't seem to dig it up.

Thanks!
Yunchi
--
You received this message because you are subscribed to a topic in the Google Groups "grpc.io" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/grpc-io/wxgLgjzkR30/unsubscribe.
To unsubscribe from this group and all its topics, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/B2119977-C217-4B0A-94A9-ED616CD9DD47%40compass.com.
For more options, visit https://groups.google.com/d/optout.

Carl Mastrangelo

unread,
Dec 14, 2016, 5:18:24 PM12/14/16
to grpc.io, zhan...@google.com, al...@compass.com, lu...@compass.com
I don't think the Java API provides the TTL info unfortunately.

Carl Mastrangelo

unread,
Dec 14, 2016, 5:22:31 PM12/14/16
to grpc.io, lu...@compass.com, zhan...@google.com, al...@compass.com
response inline


On Tuesday, December 13, 2016 at 8:49:22 PM UTC-8, Yunchi Luo wrote:
I'm very happy to get so much feedback!

To Carl,

We don't have a container cluster management system like kubernetes or mesos running, so to spin up an additional instance would mean starting up a new machine and that will add significant latency to our deploys.

We use RR. In RR does the resolver server list refresh only happen when all server connections are down, or each time a single server go down?

I believe it happens when a single connection goes down.  FTR a goaway is not considered going down.  Assuming your server stops accepting connections, and THEN sends a goaway, this should be okay.  Right now, we a single connection gets a goaway, we try to reconnect.  If your server sends goaway after it stops calling accept, the client will:

1. Get a goaway
2.  Try to reconnect
3.  Get an unreachable error

But, if the server sends goaway first, and once connections are gone stops calling accept, the client may be stuck in a loop:

1.  Get a goaway
2.  Try to reconnect.
3.  Get another goaway.
4.   Goto 1.

Neither of these are very good.   Thus, I think there is a bug in our reconnect code.  The github issue would be the right place to talk about solutions.
Reply all
Reply to author
Forward
0 new messages