HTTP requests to Cluster IP in GKE seems not stable enough?

50 views
Skip to first unread message

Joey Wang

unread,
Nov 2, 2021, 9:15:57 AM11/2/21
to gce-discussion
App: Ruby on Rails 6.4.1
GKE Node:
Node version: v1.21.4-gke.2300
Kernel version: 5.4.120+
Container-rutime: docker://20.10.3
K8S Service: Cluster IP


Our Apps talk to each other with HTTP1.1 through Cluster IP. So sometimes one pod is talking to another pod on the same node, sometimes it's the one on another node. 
In my understanding, it should be stable and fast enough, but from September we get network errors almost every day.

I undertand HTTP/TCP is not always reliable, we must have retry and tolerance at App level.
But it's just seems too frequently for me and it's not even CPU issue. 
When CPU is too busy with incoming requests, usually TCP RESET will be sent as I understand.

Adebisi Ibirogba

unread,
Nov 4, 2021, 10:47:48 AM11/4/21
to gce-discussion
Normally the performance should not be impacted if all the resource such as CPU, Memory etc. are adequate.

In order to properly troubleshoot the issue, do feel free to submit a Public issue tracker.

Joey Wang

unread,
Nov 4, 2021, 1:51:51 PM11/4/21
to gce-discussion

Thanks a lot. I will do that when I gather more information. For now, I'm fighting with another issue.

Joey Wang

unread,
Dec 1, 2021, 12:03:18 PM12/1/21
to gce-discussion
Updates:

After upgrading to containerd as runtime based on Google's suggestion, the network problem is gone now. And no errors of RESET, REFUSE and Timeout now.

Improve we have done to tackle this kind problem:
1. k8s App level: termination grace period to ensure requests are served.
2. k8s scaling HPA: pcik up the surging quickly enough to boot up new pods to serve.
3. App level optimize: retry logic to API client and put other jobs to queue for Sidekiq to take care of retrying.

Assumption of root cause:

1. iptable NAT/netfilter in Linux kernel. It might just be feature swtich we don't know.
2. Web Server is too busy to deal with the TCP request.
3. gap between ready check leave space to crashed server

Unfortuanately I don't have any evidence for these assumptions. Hope I can figure out or know how to figure out in the future.

Derek Murphy

unread,
Dec 1, 2021, 1:22:29 PM12/1/21
to gce-discussion

Hello,

I was just wondering if you feel the issue that you presented here was solved?

Reply all
Reply to author
Forward
0 new messages