> Just curious, how has this been determined that the GOAWAY frame wasn't
received? Also what are your values of MAX_CONNECTION_AGE and
MAX_CONNECTION_AGE_GRACE ?
MAX_CONNECTION_AGE and MAX_CONNECTION_AGE_GRACE was infinite but I changed this week MAX_CONNECTION_AGE to 5 minutes.
I followed this documentation to display gRPC logs and the see GOAWAY signal.
To reproduce the error, I setup a channel without round robin load balancing (only one subchannel).
ExampleService::Stub.new("headless-test-grpc-master.test-grpc.svc.cluster.local:50051", :this_channel_is_insecure, timeout: 5)
Then I recursively kill the server pod connected to my client. When I see in the logs that GOAWAY signal is received, a reconnection occurs without any error in my requests. But when the reception of the GOAWAY signal is not logged, no reconnection occurs and I receive a bunch of DeadlineExceeded errors during several minutes.
The error still occur even if I create a new channel. However, if a recreate the channel adding "dns:" at the beginning of the host, it works.
ExampleService::Stub.new("dns:headless-test-grpc-master.test-grpc.svc.cluster.local:50051", :this_channel_is_insecure, timeout: 5)
The opposite if true. If I create the channel with "dns:" at the beginning of the host, it can lead to the same failure and I will be able to create a working channel removing the "dns:" at the beginning of the host.
Did you already heard this kind of issue? Is there some cache in the dns resolver?
> A guess: one possible thing to look for is if IP packets to/from the
pod's address stopped forwarding, rendering the TCP connection to it a
"black hole". In that case, a grpc client will, by default, realize that
a connection is bad only after the TCP connection times out (typically
~15 minutes). You may set keepalive parameters to notice the brokenness
of such connections faster -- see references to keepalive in
https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md for more details.
Yes. It is like requests go to a black hole. And has you said, it is naturally fixed by itself after around 15 minutes. I will add a client side keep alive to make it shorter. But even with 1 minute instead of 15, I need to find another workaround in order to avoid degraded services for my customer.
Thank you.