After upgrading to containerd as runtime based on Google's suggestion, the network problem is gone now. And no errors of RESET, REFUSE and Timeout now.
Improve we have done to tackle this kind problem:
1. k8s App level: termination grace period to ensure requests are served.
2. k8s scaling HPA: pcik up the surging quickly enough to boot up new pods to serve.
3. App level optimize: retry logic to API client and put other jobs to queue for Sidekiq to take care of retrying.
Assumption of root cause:
1. iptable NAT/netfilter in Linux kernel. It might just be feature swtich we don't know.
2. Web Server is too busy to deal with the TCP request.
3. gap between ready check leave space to crashed server
Unfortuanately I don't have any evidence for these assumptions. Hope I can figure out or know how to figure out in the future.