Hey grpc users,
I recently migrated a streaming service to GRPC and have a couple questions.
I have a small group of servers (8) discoverable via DNS serving a lot of clients (100k+) with a streaming API. I control the client code (fat client library) but not their lifecycle (restarting). I want to keep the number of clients roughly balanced between servers while maintaining only one connection to the server per client. This means I can't use the roundrobin load balancer (or any custom Pickers) since that uses one SubCon per address which makes TCP connections to all 8 servers at once.
So far, I've implemented a rendezvous algorithm in a custom DNS resolver and used the pickfirst lb strategy, which connects to only one peer at a time ordered by priority. However, I ran into two issues:
1. If I restart one of the servers, that server will never get any clients unless the clients also restart. Before grpc, we had a stream timeout that also timed out the connection, so after a certain interval, the client will kill the connection and re-establish a new connection in priority order. This allows clients who was kicked off of their preferred peer to re-connect back after a certain interval. This behavior no longer works when I migrated to grpc because killing the stream doesn't terminate the underlying connection. Is there any way to do this with the grpc go library?
2. Similarly, when our server comes online, there's a warm up period where it's loading data, so the stream handler returns an UNAVAILABLE error if the server is not ready, but this doesn't kill the underlying grpc connection so the client will retry connecting to the same server over and over again until the server is warmed up. Is there anyway to consider a connection failed based on what error is returned by a stream so the ClientConn/SubConn it tries a different peer?
I'm trying to avoid re-implementing the low level dialing, backoff, and reconnecting logic provided by a grpc SubConn & ClientConn. But reading through the code, it doesn't look like I have much control over the life cycle of the underlying connection, so it seems the only way to do this is to grpc.Dial servers individually and manage the dns peer list + redials outside of the load balancer API. Is that right?