During a 100k TPS load test, a subset of pods had errors connecting to a downstream service and we maxed out the nf_conntrack table (500k) which affected the rest of the pods on each node that had this issue - which happened to be 55% of the cluster.
Besides handling this at the application level, I wanted to protect the cluster as a whole so that not one deployment can affect the entire cluster in this manner.
Thanks for any help.
-Jonathan
--
You received this message because you are subscribed to the Google Groups "Kubernetes user discussion and Q&A" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-use...@googlegroups.com.
To post to this group, send email to kubernet...@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-users.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to a topic in the Google Groups "Kubernetes user discussion and Q&A" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kubernetes-users/ZlteifiQO8c/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kubernetes-use...@googlegroups.com.
When the downstream service went south we rapidly went from ~25k to 500k in the table in less than a minute. I wouldn’t think there would be a reasonable number to set that to that could prevent the entire node from being affected. TPS was so high that catastrophe could be delayed a bit but not prevented by a higher number.
We also noticed that when this breakdown occurs the network traffic and CPU utilization on our DNS servers increased tremendously.
After installing conntrack, I dumped the list of connections by status and created a pivot table in excel to group the connections by source and destination. I could see that a vast majority of the TCP connections were in SYN_SENT or TIME_WAIT and the source IP was the flannel ip of each of nodes (10.x.x.0) of our cluster - and the destination IP/Ports were just 2 pods - so that deployment was getting crushed by connections and it couldnt respond due to a downstream system being unavailable. So connections were backing up in the form of SYN_SENT and TIME_WAIT - and we hit our 500k limit for that ec2 instance (c4.4xlarge). We are looking at some form of a circuit breaker framework, and also looking at limiting connections at the Spring Boot/tomcat level. It would be nice if we could also do that as a Network Policy in kube.