Connection management and load balancing

514 views
Skip to first unread message

Vitaly

unread,
Feb 18, 2021, 10:06:37 PM2/18/21
to grpc.io

Hey folks,

I'm trying to solve a problem of even load (or at least connection) distribution between  grpc clients and our backend servers.

First of all let me describe our setup:
We are using network load balancing (L4) in front of our grpc servers.
Clients will see one endpoint (LB) and connect to it. This means that standard client-side load balancing features like round robing wouldn't work as there will only be one sub-channel for client-server communication.

One issue with this approach can be demonstrated by the following example:
Let's say we have 2 servers running and 20 clients connect to them. At the beginning, since we go through the network load balancer, connections will be distributed evenly (or close to that), so we'll roughly have 50% of connections to each server. Now let's assume these servers reboot one after another, like in a deployment. What would happen is that server that comes up first would get all 20 worker connections and server that comes up later would have zero. This situation won't change unless client or server would drop a connection periodically or more clients request connections.

I've considered a few options for solving this:
1. Connection management on the client side - do something to reset the channel (like [enterIdle](https://grpc.github.io/grpc-java/javadoc/io/grpc/ManagedChannel.html#enterIdle) in grpc-java). Downside - it seems that this feature has been developed for android and I can't find similar functionality in grpc-go.
2. Connection management on the server side - drop connections periodically on the server. Downside - this approach looks less graceful than the client side one and may impact request latency and result in request failures on the client side.
3. Use request based grpc-aware L7 LB, this way client would connect to the LB, which would fan out requests to the servers. Downside - I've been told by our infra guys that it is hard to implement in our setup due to the way we use TLS and manage certificates.
4. Expose our servers outside and use grpc-lb or client side load balancing. Downside - it seems less secure and would make it harder to protect against DDoS attacks if we go this route. I think this downside makes this approach unviable.

My bias is towards going with option 3 and doing request based load balancing because it allows much more fine grained control based on load, but since our infra can not support it at the moment, I might be forced to use option 1 or 2 in the short to mid term. Option 2 I like the least, as it might result in latency spikes and errors on the client side.

My questions are:
1. Which approach is generally preferable?
2. Are there other options to consider?
3. Is it possible to influence grpc channel state in grpc-go, which would trigger resolver and balancer to establish a new connection similar to what enterIdle does in java? From what I see in the [clientconn.go](https://github.com/grpc/grpc-go/blob/master/clientconn.go) there is no option to change the channel state to idle or trigger a reconnect in some other way.
4. Is there a way to implement server side connection management cleanly without impacting client-side severely?

Here are links that I find useful for some context:


Sorry for the long read,
Vitaly

Srini Polavarapu

unread,
Feb 19, 2021, 3:50:17 PM2/19/21
to grpc.io
Hi,

Option 3 is ideal but since you don't have that as well as option 4 available, option 2 is worth exploring. Are the concerns with options 2 based on some experiments you have done or is it just a hunch? This comment has some relevant info that you could use.  

Vitaly

unread,
Feb 19, 2021, 6:47:22 PM2/19/21
to grpc.io
Thanks Srini,

I haven't tested option 2 yet, I would expect though that since client is unaware of what is happening we should see some request failures/latency spikes until new connection is established. That's why I would consider it mostly for disaster prevention rather than for general connection balancing.
I'm actually now more interested in exploring option 4 as it looks like we can achieve safe setup if we keep proxy in front of servers and expose a separate proxy port for each server.
Can someone recommend a good opensource grpclb implementation? I've found bsm/grpclb which looks reasonable but wasn't sure if there is anything else available.

Srini Polavarapu

unread,
Feb 22, 2021, 12:34:46 PM2/22/21
to grpc.io
Hi Vitaly,

Please see this post if you are planning to use gRPCLB. gRPC has moved away from gRPCLB protocol. Instead, gRPC is adopting xDS protocol. A number of xDS features, including round robin LB, are already supported in gRPC. This project might be useful to you but I think it is blocked on this issueThis project might be useful too.

Eric Anderson

unread,
Feb 22, 2021, 2:37:21 PM2/22/21
to Vitaly, grpc.io
On Thu, Feb 18, 2021 at 7:06 PM Vitaly <vitaly....@gmail.com> wrote:
1. Connection management on the client side - do something to reset the channel (like [enterIdle](https://grpc.github.io/grpc-java/javadoc/io/grpc/ManagedChannel.html#enterIdle) in grpc-java). Downside - it seems that this feature has been developed for android and I can't find similar functionality in grpc-go.

Go doesn't go into IDLE at all today. But even so, this isn't an approach we'd encourage. The enterIdle() is really for re-choosing which network to use, and would be a hack to use it in this case.

2. Connection management on the server side - drop connections periodically on the server. Downside - this approach looks less graceful than the client side one and may impact request latency and result in request failures on the client side.

L4 proxy is exactly the use-case for server-side connection age (as you should have seen in gRFC A9). The impact on request latency is the connection handshake, which is no worse than if you were using HTTP/1. The shutdown should avoid races on-the-wire, which should prevent most errors and some latency. There are some races client-side that could cause very rare failures; it should be well below the normal noise level of failures.

We have seen issues with GOAWAYs introducing disappointing latency, but in large part because of cold caches in the backend.

3. Use request based grpc-aware L7 LB, this way client would connect to the LB, which would fan out requests to the servers. Downside - I've been told by our infra guys that it is hard to implement in our setup due to the way we use TLS and manage certificates.
4. Expose our servers outside and use grpc-lb or client side load balancing. Downside - it seems less secure and would make it harder to protect against DDoS attacks if we go this route. I think this downside makes this approach unviable.

Option 3 is the most common solution for serious load balancing across trust domains (like public internet vs data center). Option 4 depends on how much you trust your clients.

1. Which approach is generally preferable?

For a "public" service, the "normal" preference would be (3) L7 proxy (highest), (2) L4 proxy + MAX_CONNECTION_AGE, (1) manual code on client-side hard-coded with special magical numbers. (4) gRPC-LB/xDS could actually go anywhere, depending on how you felt about your client and your LB needs; it's more about how you feel about (4). (4) is the highest performance and lowest latency solution, although it is rarely used for public services that receive traffic from the Internet.

2. Are there other options to consider?

You could go with option (2), but expose two L4 proxy IP addresses to your clients and have the clients use round-robin. Since MAX_CONNECTION_AGE uses jitter, the connections are unlikely to both go down at the same time and so it'd hide the connection establishment latency.

3. Is it possible to influence grpc channel state in grpc-go, which would trigger resolver and balancer to establish a new connection similar to what enterIdle does in java?

You'd have to shut down the ClientConn and replace it.

4. Is there a way to implement server side connection management cleanly without impacting client-side severely?

I'd suggest giving option (2) a try and informing us if you have poor experiences. Option (2) is actually pretty common, even when using L7 proxies, as you may need to load balance the proxies.
Reply all
Reply to author
Forward
0 new messages