Hey folks,
I'm trying to solve a problem of even load (or at least connection) distribution between grpc clients and our backend servers.
First of all let me describe our setup:
We are using network load balancing (L4) in front of our grpc servers.
Clients will see one endpoint (LB) and connect to it. This means that standard client-side load balancing features like round robing wouldn't work as there will only be one sub-channel for client-server communication.
One issue with this approach can be demonstrated by the following example:
Let's say we have 2 servers running and 20 clients
connect to them. At the beginning, since we go through the network
load balancer, connections will be distributed evenly (or close to that),
so we'll roughly have 50% of connections to each server.
Now let's assume these servers reboot one after another, like in a deployment.
What would happen is that server that comes up first would
get all 20 worker connections and server that comes up later
would have zero. This situation won't change unless client or server would drop a connection periodically or more clients request connections.
I've considered a few options for solving this:
2. Connection management on the server side - drop connections periodically on the server. Downside - this approach looks less graceful than the client side one and may impact request latency and result in request failures on the client side.
3. Use request based grpc-aware L7 LB, this way client would connect to the LB, which would fan out requests to the servers. Downside - I've been told by our infra guys that it is hard to implement in our setup due to the way we use TLS and manage certificates.
4. Expose our servers outside and use grpc-lb or client side load balancing. Downside - it seems less secure and would make it harder to protect against DDoS attacks if we go this route. I think this downside makes this approach unviable.
My bias is towards going with option 3 and doing request based load balancing because it allows much more fine grained control based on load, but since our infra can not support it at the moment, I might be forced to use option 1 or 2 in the short to mid term. Option 2 I like the least, as it might result in latency spikes and errors on the client side.
My questions are:
1. Which approach is generally preferable?
2. Are there other options to consider?
3. Is it possible to influence grpc channel state in grpc-go, which would trigger resolver and balancer to establish a new connection similar to what enterIdle does in java? From what I see in the [clientconn.go](https://github.com/grpc/grpc-go/blob/master/clientconn.go) there is no option to change the channel state to idle or trigger a reconnect in some other way.
4. Is there a way to implement server side connection management cleanly without impacting client-side severely?
Here are links that I find useful for some context:
Sorry for the long read,
Vitaly