--
You received this message because you are subscribed to the Google Groups "Kubernetes user discussion and Q&A" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-use...@googlegroups.com.
To post to this group, send email to kubernet...@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-users.
For more options, visit https://groups.google.com/d/optout.
Currently I could fix the problem by doing the following:
Changing the GCP TCP Load Balancer Health Check.
Currently it's described here how it should work:
https://kubernetes.io/docs/tutorials/services/source-ip/#source-ip-for-services-with-type-loadbalancer
However somehow when creating my nginx-ingress controller the Health Check that was generated did use a port which returned Response: 200 on **every** node.
But it should fail on the nodes which are not responsible for the LoadBalancer.
I mean the nginx-ingress-controller did run on the following nodes:
gke-k8s-default-pool-e0db078d-grrf
gke-k8s-default-pool-e0db078d-grrf
gke-k8s-default-pool-01f61f68-4h5k
it was twice on the same node, however the GCP TCP LB Points to all three:
gke-k8s-default-pool-01f61f68-4h5k
gke-k8s-default-pool-e0db078d-grrf
gke-k8s-default-pool-87c828e9-7447
I'm also not sure how that would work, when dealing with that when having non-http traffic over tcp or udp. Currently we want to use OpenVPN on top of GKE but with that limitation it might be routed to the wrong node or we would need a DaemonSet with Haproxy that is installed on all nodes that are on the LB.
I reproduced this. Here's the explanation of why it is working as intended. We can discuss whether that intention is misguided or not.When externalTrafficPolicy is "Cluster", each node acts as an LB gateway to the Service, regardless of where the backends might be running. In order to make this work, we must SNAT (which obscures the client's IP) because the traffic could cross nodes, and you can't have the final node responding directly to the client (bad 5-tuple). As we agree, this mode works but isn't what you want.When externalTrafficPolicy is "Local", only nodes that actually have a backend for a given Service act as an LB gateway. This means we do not need to SNAT, thereby keeping the client IP. But what about nodes which do not have backends? They drop those packets. The GCE load-balancer has relatively long programming time, so we don't want to change the TargetPools every time a pod moves. The healthchecks are set up such that nodes that do not have a backend fail the HC - thus the LB should never route to them! Very clever.Here's the rub - the way GCP sets up LBs is via a VIP which is a local IP to each VM. By default, access to the LB VIP from a node "behind" that VIP (all nodes in k8s) is serviced by that same VM, not by the actual LB. The assumption is that you are accessing yourself, why go through the network to do that?Kubernetes makes an explicit provision for pods that access an LB VIP by treating them as if they accessed the internal service VIP (which is not guaranteed to stay node-local). We did not make a provision for a NODE to access the LB VIP in the same way. Maybe we could? I seem to recall an issue there, in how we distinguish traffic originating from "this VM" vs traffic we are gatewaying.So there you see - it is doing what is intended, but maybe not what you want. Now - concvince me that the use-case of accessing an external LB VIP from a node in the cluster (not a pod - a node) is worth the extra complexity? One case I admit that falls in a crack is `hostNetwork` pods. They will fail.
Well, the good news is that I think this is easy to fix.For my test case, the service hashes to NWV5X2332I4OT4T3iptables -t nat -I KUBE-XLB-NWV5X2332I4OT4T3 -m addrtype --src-type LOCAL -j KUBE-SVC-NWV5X2332I4OT4T3iptables -t nat -I KUBE-XLB-NWV5X2332I4OT4T3 -m addrtype --src-type LOCAL -j KUBE-MARK-MASQNote these two are in inverse order because I am lazy and used -I. This will break the "only local" property for in-cluster accesses, but I think that is OK since it's explicitly not true for access from pods.The kube-proxy change here should be easy enough, but the testing is a little involved.Would you do me the good favor of opening a github bug and we can see if we can rally an implementor?