Load balancer and resolver with Ruby

95 views
Skip to first unread message

Emmanuel Delmas

unread,
Dec 3, 2020, 11:57:03 AM12/3/20
to grpc.io
Hi

Question
I'm wondering how to refresh the IP list in order to update subchannel list, after creating gRPC channel in Ruby using DNS resolution (which created several subchannels).

Context
I've setup gRPC communication between our services in a Kubernetes environnement two years ago but we are facing issues after pods restart.

I've setup a Kubernetes headless service (in order to get all pod IPs from the DNS).
I've managed to use load balancing with the following piece of code.
stub = ExampleService::Stub.new("headless-test-grpc-master.test-grpc.svc.cluster.local:50051", :this_channel_is_insecure, timeout: 5, channel_args: {'grpc.lb_policy_name' => 'round_robin'})

But when I create new pods after the connection or a reconnection, calls are not load balanced on these new servers.
That why I'm wondering what should I do to make the gRPC resolver refresh the list of IP and create expected new subchannels.

Is it something achievable? Which configuration should I use?

Thanks for your help

Emmanuel Delmas 
Backend Developer
CSE Member
https://github.com/papa-cool

apo...@google.com

unread,
Dec 21, 2020, 1:42:17 PM12/21/20
to grpc.io
> "But when I create new pods after the connection or a reconnection, calls are not load balanced on these new servers."

Can you elaborate a bit on what exactly is done here and the expected behavior?

In general, one thing to note about gRPC's client channel/stub is that in general a client will not refresh the name resolution process unless it encounters a problem with the current connection(s) that it has. So for example if the following events happen:
1) client stub resolves headless-test-grpc-master.test-grpc.svc.cluster.local in DNS, to addresses 1.1.1.1, 2.2.2.2, and 3.3.3.3
2) client stub establishes connections to 1.1.1.1, 2.2.2.2, and 3.3.3.3, and begins round robining RPCs across them
3) a new host, 4.4.4.4, starts up, and is added behind the headless-test-grpc-master.test-grpc.svc.cluster.local DNS name

Then the client will continue to just round robin its RPCs across 1.1.1.1, 2.2.2.2, and 3.3.3.3 indefinitely -- so long as it doesn't encounter a problem with those connections. It will only re-query the DNS, and so learn about 4.4.4.4, if it encounters a problem.

There's some possibly interesting discussion about this behavior in https://github.com/grpc/grpc/issues/12295 and in https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md.

Emmanuel Delmas

unread,
Dec 22, 2020, 2:30:44 PM12/22/20
to grpc.io
Thank you. I've setup MAX_CONNECTION_AGE and it seems to work well.

I was looking for a way to refresh the name resolution because I'm facing another issue.
It happens that sometimes, the GOAWAY signal isn't received by the client.
In this case, I receive a bunch of DeadlineExceeded errors, the client still sending message to a deleted Kubernetes pod.
I wanted to trigger a refresh at this time but I understand it is not possible.

Do you already get this kind of issue?
Do you have any advice to handle a not received GOAWAY signal?

apo...@google.com

unread,
Dec 22, 2020, 3:34:32 PM12/22/20
to grpc.io
> It happens that sometimes, the GOAWAY signal isn't received by the client.

Just curious, how has this been determined that the GOAWAY frame wasn't received? Also what are your values of MAX_CONNECTION_AGE and MAX_CONNECTION_AGE_GRACE ?

A guess: one possible thing to look for is if IP packets to/from the pod's address stopped forwarding, rendering the TCP connection to it a "black hole". In that case, a grpc client will, by default, realize that a connection is bad only after the TCP connection times out (typically ~15 minutes). You may set keepalive parameters to notice the brokenness of such connections faster -- see references to keepalive in https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md for more details.

Emmanuel Delmas

unread,
Dec 23, 2020, 4:50:31 AM12/23/20
to grpc.io
> Just curious, how has this been determined that the GOAWAY frame wasn't received? Also what are your values of MAX_CONNECTION_AGE and MAX_CONNECTION_AGE_GRACE ?

MAX_CONNECTION_AGE and MAX_CONNECTION_AGE_GRACE was infinite but I changed this week MAX_CONNECTION_AGE to 5 minutes.

I followed this documentation to display gRPC logs and the see GOAWAY signal.
To reproduce the error, I setup a channel without round robin load balancing (only one subchannel).
ExampleService::Stub.new("headless-test-grpc-master.test-grpc.svc.cluster.local:50051", :this_channel_is_insecure, timeout: 5)
Then I recursively kill the server pod connected to my client. When I see in the logs that GOAWAY signal is received, a reconnection occurs without any error in my requests. But when the reception of the GOAWAY signal is not logged, no reconnection occurs and I receive a bunch of DeadlineExceeded errors during several minutes.
The error still occur even if I create a new channel. However, if a recreate the channel adding "dns:" at the beginning of the host, it works.
ExampleService::Stub.new("dns:headless-test-grpc-master.test-grpc.svc.cluster.local:50051", :this_channel_is_insecure, timeout: 5)
The opposite if true. If I create the channel with "dns:" at the beginning of the host, it can lead to the same failure and I will be able to create a working channel removing the "dns:" at the beginning of the host.

Did you already heard this kind of issue? Is there some cache in the dns resolver?

> A guess: one possible thing to look for is if IP packets to/from the pod's address stopped forwarding, rendering the TCP connection to it a "black hole". In that case, a grpc client will, by default, realize that a connection is bad only after the TCP connection times out (typically ~15 minutes). You may set keepalive parameters to notice the brokenness of such connections faster -- see references to keepalive in https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md for more details.

Yes. It is like requests go to a black hole. And has you said, it is naturally fixed by itself after around 15 minutes. I will add a client side keep alive to make it shorter. But even with 1 minute instead of 15, I need to find another workaround in order to avoid degraded services for my customer.

Thank you.

Chen Song

unread,
Aug 31, 2021, 7:43:07 PM8/31/21
to grpc.io
I want to follow up on this thread, as we have similar requirements (force clients to refresh server addresses from dns resolver as new pods will be launched on K8s) but client is in Python.

Given that client is doing client-side LB with round_robin, is setting max_connection_age on the server-side the right way to solve this problem? Will clients be able to refresh and reconnect automatically, or do we need to recreate the client (the underlying channel) periodically?
Also, the GOAWAY signal is random. Do client implementation need to handle this in particular?

Chen

Emmanuel DELMAS

unread,
Sep 1, 2021, 8:18:39 AM9/1/21
to Chen Song, grpc.io
Hi Chen

> Given that client is doing client-side LB with round_robin, is setting max_connection_age on the server-side the right way to solve this problem? Will clients be able to refresh and reconnect automatically, or do we need to recreate the client (the underlying channel) periodically?
I set max_connection_age on the server-side and it works well. Nothing else to do on the client side. When max_connection_age is reached, a GOAWAY signal is sent to the client. Each time a client receives a GOAWAY signal, it automatically refreshes its DNS and creates connections for new services and the one that has been closed.

> Also, the GOAWAY signal is random. Do client implementation need to handle this in particular?
What do you mean exactly? I'm not sure to be able to answer this point.

Regards

Emmanuel Delmas 
Backend Developer
CSE Member


19 rue Blanche, 75009 Paris, France


--
You received this message because you are subscribed to a topic in the Google Groups "grpc.io" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/grpc-io/j18OMinOAxo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to grpc-io+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/4e4a7c0e-9cb7-47a8-8392-c0b80a06dba7n%40googlegroups.com.

Chen Song

unread,
Sep 2, 2021, 3:43:22 PM9/2/21
to grpc.io
Hi Emmanuel
I have some follow-up questions.
> I set max_connection_age on the server-side and it works well. Nothing else to do on the client side. When max_connection_age is reached, a GOAWAY signal is sent to the client. Each time a client receives a GOAWAY signal, it automatically refreshes its DNS and creates connections for new services and the one that has been closed.
May I ask how you monitor this? Did you verify this on the client side with gRPC debug level logging? Or did you have your client program to send gRPC requests continuously to verify this on the server side?
Also, what happens if the GOAWAY signal is received during an in-flight request (e.g., a long-lived read)? Did the read fails, or it completes as long as max_connection_grace_age is long enough?

Best,
Chen

Emmanuel DELMAS

unread,
Sep 3, 2021, 4:39:44 AM9/3/21
to Chen Song, grpc.io
Hi Chen

>May I ask how you monitor this? Did you verify this on the client side with gRPC debug level logging? Or did you have your client program to send gRPC requests continuously to verify this on the server side?
The main reason for me to ensure DNS refresh was to get connections for all available running servers at upscale. Indeed, new servers launched after the last DNS refresh won't receive any traffic. I tested it in my environment (Kubernetes). All my new pods started receiving traffic after a few minutes thanks to `max_connection_age`. I set it to 5 minutes.

>Also, what happens if the GOAWAY signal is received during an in-flight request (e.g., a long-lived read)? Did the read fails, or it completes as long as max_connection_grace_age is long enough?
I didn't test it specifically but I understand `max_connection_grace_age` is designed for that: ensure that ongoing requests will finish. When the server sends a GOAWAY signal, it indicates to the client what is the last request received that will be treated. If the max_connection_grace_age is reached, the server sends a new signal indicating the last request that has been completely treated. Then no requests are lost. Furthermore, no latency occurs because the new connection is built before the old one is closed.

Emmanuel Delmas 
Backend Developer
CSE Member


19 rue Blanche, 75009 Paris, France


Reply all
Reply to author
Forward
0 new messages