Proxymax Engineering Services

0 views

Skip to first unread message

Amit Bolds

unread,

Aug 4, 2024, 9:42:39 PM8/4/24

to tamultito

Proxymaxwas established since July 2007 by Mr. Lee and have been involved in the engineering industry for more than 30 years, before the entrepreneurial spirit struck him and convinced him to create his own business.

Our company specializes in Marine engineering works as well as Oil & Gas industry. We provide total solutions like designing, customisation of industrial equipment and processes. In addition, we also perform overhaul, repair & maintenance for critical industrial equipment such as pumps, valves and compressors beside normal machining works & metal fabrications.

We are the place to go to for all engineering services in regards to Marine and Oil & Gas Industry. Proxymax strives to blaze the trail and work towards establishing ourselves as a leader in our industry and niche.

We are excited to announce the Cilium 1.5 release. Cilium 1.5 is the firstrelease where we primarily focused on scalability with respect to number ofnodes, pods and services. Our goal was to scale to 5k nodes, 20k pods and 10kservices. We went well past that goal with the 1.5 release and are nowofficially supporting 5k nodes, 100k pods and 20k services. Along the way, welearned a lot, some expected, some unexpected, this blog post will dive intowhat we learned and how we improved.

Besides scalability, several significant features made its way into the releaseincluding: BPF templating, rolling updates for transparent encryption keys,transparent encryption for direct-routing, a new improved BPF based serviceload-balancer with improved fairness, BPF based masquerading/SNAT support,Istio 1.1.3 integration, policy calculation optimizations as well as severalnew Prometheus metrics to assist in operations and monitoring. For thefull list of changes, see the 1.5 Release Notes.

Cilium is open source software for transparently providing and securing thenetwork and API connectivity between application services deployed using Linuxcontainer management platforms like Kubernetes, Docker, and Mesos.

At the foundation of Cilium is a new Linux kernel technology called BPF, whichenables the dynamic insertion of powerful security, visibility, and networkingcontrol logic within Linux itself. BPF is utilized to provide functionalitysuch as multi-cluster routing, load balancing to replace kube-proxy,transparent encryption using X.509 certificates as well as network and servicesecurity. Besides providing traditional network level security, the flexibilityof BPF enables security with the context of application protocols and DNSrequests/responses. Cilium is tightly integrated with Envoy and provides anextension framework based on Go. Because BPF runs inside the Linux kernel, allCilium functionality can be applied without any changes to the application codeor container configuration.

Missing exponential back-off: These are typically more subtle and oftencome into play after the cluster has successfully scaled up slowly followedby sudden synchronized failure across many nodes. Without exponentialback-off, the load of retries may exceed the capacity of centralizedresources and a cluster may never recover again.

Missing jitter for intervals: This causes unnecessary alignment ofactions towards centralized resources such as the Kubernetes apiserver.Failures due to these appear randomly and are hard to track down. Whilewe never reproduced this type of problem, we introduced jitter into allmajor timer intervals to be safe.

Scaling Kubernetes watchers: The sheer size of some Kubernetes resourcesmakes it challenging to scale watchers across large clusters withoutconsuming significant network bandwidth. The issue itself was expected, thedimension of this problem was a surprise, we measured 400Mbit/s of networktraffic when running a 2K nodes cluster with the standard node heartbeatintervals.

etcd panic: A bug in grpc-go can cause etcd to close an already closedgo channel resulting in a panic. Above a certain scale, we started seeingthis very frequently and it caused a lot of disruption.

Dealing with apiserver server-side rate-limiting: This is less of aproblem with self-managed Kubernetes clusters but all managed Kubernetesservices we have tested with impose server-side rate-limiting of theapiserver. For most of them, the rate limiting is dependent on the clustersize. When scaling up quickly, the rate limiting can lag behindsignificantly.

Note: The scalability of services will depend heavily on whether and howyou are running kube-proxy. Operating kube-proxy in iptables mode can causeconsiderable CPU consumption due to requiring several individual iptables rulesfor each service backend. It is already possible for Cilium users to removekube-proxy entirely if you are not relying on NodePort. Starting with Cilium1.6, NodePort will be natively support with BPF as well and kube-proxy can beremoved entirely from all nodes.

The graph shows min/max/avg CPU spent across all nodes (1 vCPU, 2GB RAM) aswell as the sum of all kvstore operations performed by the entire clsuter. TheCPU consumption on individual nodes is almost flat. This is thanks to the newBPF templating support described later on this post which avoids expensivecompilation.

As you grow clusters further, additional resources are primarily required foryour centralized services such as the apiserver, Prometheus and the etcdcluster user by Cilium. In this test, we have been running a dedicated 3 nodeetcd cluster for Cilium and the memory consumption of etcd had grown to about700MB per instance.

The GET calls are done to retrieve pod labels of new pods for network policyevaluation. The DELETE calls are agents removing custom resources withpotential conflicting names. Doing a DELETE is cheaper than doing a GETfirst followed by a DELETE if needed. The POST and PATCH calls are usedto create and update CiliumEndpoint custom resources. By using PATCH asavailable with k8s >=v1.13.0, it is no longer required to keep a local copy ofthese resources.

It was a bit of a surprise that we were the first to hit this etcd panic as itwas quite simple to reproduce above several hundred nodes. The bug was not inetcd itself but in grpc-go and has been fixed by gprpc-go PR#2695.etcd PR#10624 has been opened as wellto rebase on top of the fixed grpc-go.

Cilium uses the Kubernetes node resource to detect other nodes in theKubernetes cluster and pod resources to learn which pod is running on whichnode and to derive the pod labels to implement network policies.

As much as we love the standard Kubernetes resources for nodes, they do notscale very well when attempting to register a standard Kubernetes ListAndWatchfrom every ach worker node in a larger cluster. This is for two main reasons:

The node resource is used to detect stale nodes, each node will update aheartbeat field regularly which cause the node resource to change everycouple of seconds. Each of those changes is being reflected to alllisteners, in this case several thousand worker nodes. The issue is beingaddressed in Kubernetes via KEP-009using a new NodeLease feature. You can further details about this in the sectionNodeControllerof the Kubernetes documentation.

As many other resources, the node resource has grown over time. Even withProfoBuf serialization enabled, transmitting the entire node resource to allworker nodes on each change of the resource results in significant load inbytes transmitted over the network and in CPU cycles spent to serialize anddeserialize the large resource so often. This is made worse by the factthat only some of the fields in the resource are relevant to Cilium so themajority of events are sent unnecessarily.

This resolves both of the previously described problems, etcd will only store aminimal version of the resource with the fields relevant to Cilium and willthus only distribute an event when significant fields have changed. It alsomeans that the heavy-weight event is sent once to the single-instance operatorand a light-weight event is distributed to the 5000 worker nodes.

Another problem that appears at scale is the amount of memory it requires tomaintain resource caches of standard resources such as nodes, services, pods,endpoints, and network policies when using the standard Kubernetes goclient via interfaces such asInformers. The cache will storethe full-blown resource with all fields defined in a local in-memory cache.

By introducing a slimmed down of various Kubernetes resources that only definesthe fields of relevance, the memory consumption of the agent in each node canbe reduced significantly. The graph above shows the difference in memoryconsumption in a 2K nodes GKE cluster.

In order to gain visibility into how successfully Cilium interacts with theKubernetes apiserver, we have introduced a Prometheus metric which keeps trackof number of interactions per resource as well as track the return codes andlatency as histogram.

The following shows an attempt to scale up from 3 to 5K nodes without anyoptimizations. While cloud providers have no problem to provision 5k nodes inparallel, 5k nodes suddenly hitting the apiserver to register themselvesclearly overloads the apiserver:

The apiserver is so undersized that it gets overwhelmed immediately and callsto the apiserver will simply timeout and several thousand apiserver calls failper second (yellow bars). Note that the graph only shows the interactions asperformed by the apiserver, in parallel, kubelet and kube-proxy will alsointeract with the apiserver, adding more load to the system. This is notrepresented in this graph.

Shouldn't the apiserver get resized and become more powerful? Yes, but itrequires the cluster to grow first which requires kubelet to be successful instarting up. This depends on successful interactions with the apiserver. Aclassic chicken and egg situation. It also requires kubelet to continue beingsuccessful in updating the heartbeat timestamps of the node resource or thenode will be marked stale again.