grpc-go streaming: how to keep connections balanced while connecting to only one server at a time

774 views
Skip to first unread message

Charles Ma

unread,
Sep 11, 2018, 7:25:11 PM9/11/18
to grpc.io
Hey grpc users,

I recently migrated a streaming service to GRPC and have a couple questions.

I have a small group of servers (8) discoverable via DNS serving a lot of clients (100k+) with a streaming API. I control the client code (fat client library) but not their lifecycle (restarting). I want to keep the number of clients roughly balanced between servers while maintaining only one connection to the server per client. This means I can't use the roundrobin load balancer (or any custom Pickers) since that uses one SubCon per address which makes TCP connections to all 8 servers at once. 

So far, I've implemented a rendezvous algorithm in a custom DNS resolver and used the pickfirst lb strategy, which connects to only one peer at a time ordered by priority. However, I ran into two issues:
1. If I restart one of the servers, that server will never get any clients unless the clients also restart. Before grpc, we had a stream timeout that also timed out the connection, so after a certain interval, the client will kill the connection and re-establish a new connection in priority order. This allows clients who was kicked off of their preferred peer to re-connect back after a certain interval. This behavior no longer works when I migrated to grpc because killing the stream doesn't terminate the underlying connection. Is there any way to do this with the grpc go library?
2. Similarly, when our server comes online, there's a warm up period where it's loading data, so the stream handler returns an UNAVAILABLE error if the server is not ready, but this doesn't kill the underlying grpc connection so the client will retry connecting to the same server over and over again until the server is warmed up. Is there anyway to consider a connection failed based on what error is returned by a stream so the ClientConn/SubConn it tries a different peer?

I'm trying to avoid re-implementing the low level dialing, backoff, and reconnecting logic provided by a grpc SubConn & ClientConn. But reading through the code, it doesn't look like I have much control over the life cycle of the underlying connection, so it seems the only way to do this is to grpc.Dial servers individually and manage the dns peer list + redials outside of the load balancer API. Is that right?

Eric Anderson

unread,
Sep 12, 2018, 2:16:24 PM9/12/18
to c...@uber.com, grpc-io
On Tue, Sep 11, 2018 at 4:25 PM 'Charles Ma' via grpc.io <grp...@googlegroups.com> wrote:
1. If I restart one of the servers, that server will never get any clients unless the clients also restart. Before grpc, we had a stream timeout that also timed out the connection, so after a certain interval, the client will kill the connection and re-establish a new connection in priority order. This allows clients who was kicked off of their preferred peer to re-connect back after a certain interval. This behavior no longer works when I migrated to grpc because killing the stream doesn't terminate the underlying connection. Is there any way to do this with the grpc go library?

You probably want MaxConnectionAge. This is a server-side option that will cause clients to re-connect (more precisely, to stop using the current connection for new RPCs), allowing them to make another LB decision. You specify it via keepalive.ServerParameters. MaxConnectionAgeGrace configures when to kill RPCs to force the connection to close.

You should still continue to periodically close the RPC stream. You can see A9-server-side-conn-mgt.md for more some details.

2. Similarly, when our server comes online, there's a warm up period where it's loading data, so the stream handler returns an UNAVAILABLE error if the server is not ready, but this doesn't kill the underlying grpc connection so the client will retry connecting to the same server over and over again until the server is warmed up. Is there anyway to consider a connection failed based on what error is returned by a stream so the ClientConn/SubConn it tries a different peer?

This is more complicated because it interacts with your server's lifecycle.

Options:
1. Have the server delay opening the gRPC port until it is ready
2. Have the name resolver avoid returning servers that aren't yet ready. This is easiest if the server can delay registering with your service discovery system until it is ready.
3. Have the load balancer check for healthiness before routing traffic to the backend. This is very difficult to do with pick-first, now and in the future (in the next couple quarters we think we'll work to make this easy for round robin and similar LBs).

It's unclear if (1) or (2) would be easy for you. Would they fit into your architecture?
Reply all
Reply to author
Forward
0 new messages