externalTrafficPolicy: Local and reachability

Don Bowman

unread,

Jun 13, 2018, 6:32:47 PM6/13/18

to Kubernetes user discussion and Q&A

I'm not sure if this is what to expect.

I have an ingress controller (nginx), which I would like to be able to properly set X-FORWARDED-FOR.

So I enabled preserving source IP by setting ' externalTrafficPolicy: Local'.

And that works ok. Now on the service I see this header, and my logs work correctly.

However, as a side-affect, devices inside the GCP routing space cannot reach it.

As an example, my LoadBalancer IP is 35.203.53.164. I can reach this from outside GCP, all is good.

One of my GKE nodes is 35.203.115.188. When that header is not set, I can do a curl and hit the

ingress controller.

When that header is set, the curl hangs (no response, not a 404, not a TCP-RST).

Is this expected? Its kind of causing me a problem since I run a private registry and a gitlab runner, and now I can't reach myself.

Tim Hockin

unread,

Jun 15, 2018, 3:55:48 PM6/15/18

to Kubernetes user discussion and Q&A

First, none of the GCP or k8s routing infrastructure is HTTP-aware, so
that header is not the culprit. Something else is misbehaving.

Second, accessing the external IP of a service hosted in the cluster
from within the cluster is a bit of a corner case right now. It's
SUPPOSED to work properly, but it's possible some combination of
settings isn't properly covered.

If you access the internal IP of the ingress controller service, does it work?

I would run tcpdump to try to understand it.

Don Bowman

unread,

Jun 15, 2018, 4:05:50 PM6/15/18

to Kubernetes user discussion and Q&A

That is not an HTTP header. Its an attribute in the YAML of the configuration of the Load Balancer.

When the LB is configured as 'externalTrafficPolicy: Local', then the ingress controller sees the IP of the client, but clients internal to cluster cannot reach the LB. When 'externalTrafficPolicy: Cluster' is set, the ingress controller sees the IP of the LB, but clients internal to the cluster can reach the LB.

I'm looking to achieve the goal of knowing the IP of the client at the service.

The chain is:

LoadBalancer -> Ingress Controller -> Service

so if I want 'Service' to be able to talk to the LoadBalancer (for a different reason), I cannot achieve my goal.

Here the issue is that a different API from 'Service' is used as well, and I don't want e.g. split-horizon DNS or some other trick, I kind of want the internal-view and external-view to be the same.

Tim Hockin

unread,

Jun 15, 2018, 4:20:15 PM6/15/18

to Kubernetes user discussion and Q&A

So it's possible that the combination of local service + external LB
fails in some way.

1) Try from a different VM, not a node in that cluster. I predict it works.
2) `iptables-save` - that's where the edge-cases are handled.

The answer lies in step 2, I'm almost certain.

Don Bowman

unread,

Jun 18, 2018, 1:05:45 PM6/18/18

to Kubernetes user discussion and Q&A

Should one not expect that a node in GKE can reach any public IP node on the Internet? Why would I expect a special case for this?

Tim Hockin

unread,

Jun 18, 2018, 2:25:35 PM6/18/18

to Kubernetes user discussion and Q&A

Because of the way LBs are typically handled in GCP - the VIP is configured as a "local" address on each VM, which means that VM will always try to handle VIP traffic without routing to the wider network.

--
You received this message because you are subscribed to the Google Groups "Kubernetes user discussion and Q&A" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-use...@googlegroups.com.
To post to this group, send email to kubernet...@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-users.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Tim Hockin

unread,

Jun 22, 2018, 1:32:45 PM6/22/18

to Kubernetes user discussion and Q&A

I reproduced this. Here's the explanation of why it is working as intended. We can discuss whether that intention is misguided or not.

When externalTrafficPolicy is "Cluster", each node acts as an LB gateway to the Service, regardless of where the backends might be running. In order to make this work, we must SNAT (which obscures the client's IP) because the traffic could cross nodes, and you can't have the final node responding directly to the client (bad 5-tuple). As we agree, this mode works but isn't what you want.

When externalTrafficPolicy is "Local", only nodes that actually have a backend for a given Service act as an LB gateway. This means we do not need to SNAT, thereby keeping the client IP. But what about nodes which do not have backends? They drop those packets. The GCE load-balancer has relatively long programming time, so we don't want to change the TargetPools every time a pod moves. The healthchecks are set up such that nodes that do not have a backend fail the HC - thus the LB should never route to them! Very clever.

Here's the rub - the way GCP sets up LBs is via a VIP which is a local IP to each VM. By default, access to the LB VIP from a node "behind" that VIP (all nodes in k8s) is serviced by that same VM, not by the actual LB. The assumption is that you are accessing yourself, why go through the network to do that?

Kubernetes makes an explicit provision for pods that access an LB VIP by treating them as if they accessed the internal service VIP (which is not guaranteed to stay node-local). We did not make a provision for a NODE to access the LB VIP in the same way. Maybe we could? I seem to recall an issue there, in how we distinguish traffic originating from "this VM" vs traffic we are gatewaying.

So there you see - it is doing what is intended, but maybe not what you want. Now - concvince me that the use-case of accessing an external LB VIP from a node in the cluster (not a pod - a node) is worth the extra complexity? One case I admit that falls in a crack is `hostNetwork` pods. They will fail.

On Thu, Jun 21, 2018 at 7:38 AM <c.sc...@briefdomain.de> wrote:

Currently I could fix the problem by doing the following:

Changing the GCP TCP Load Balancer Health Check.

Currently it's described here how it should work:
https://kubernetes.io/docs/tutorials/services/source-ip/#source-ip-for-services-with-type-loadbalancer

However somehow when creating my nginx-ingress controller the Health Check that was generated did use a port which returned Response: 200 on **every** node.
But it should fail on the nodes which are not responsible for the LoadBalancer.

I mean the nginx-ingress-controller did run on the following nodes:
gke-k8s-default-pool-e0db078d-grrf
gke-k8s-default-pool-e0db078d-grrf
gke-k8s-default-pool-01f61f68-4h5k

it was twice on the same node, however the GCP TCP LB Points to all three:
gke-k8s-default-pool-01f61f68-4h5k
gke-k8s-default-pool-e0db078d-grrf
gke-k8s-default-pool-87c828e9-7447

I'm also not sure how that would work, when dealing with that when having non-http traffic over tcp or udp. Currently we want to use OpenVPN on top of GKE but with that limitation it might be routed to the wrong node or we would need a DaemonSet with Haproxy that is installed on all nodes that are on the LB.

Don Bowman

unread,

Jun 22, 2018, 1:46:04 PM6/22/18

to Kubernetes user discussion and Q&A

On Friday, June 22, 2018 at 1:32:45 PM UTC-4, Tim Hockin wrote:

I reproduced this. Here's the explanation of why it is working as intended. We can discuss whether that intention is misguided or not.

When externalTrafficPolicy is "Cluster", each node acts as an LB gateway to the Service, regardless of where the backends might be running. In order to make this work, we must SNAT (which obscures the client's IP) because the traffic could cross nodes, and you can't have the final node responding directly to the client (bad 5-tuple). As we agree, this mode works but isn't what you want.

When externalTrafficPolicy is "Local", only nodes that actually have a backend for a given Service act as an LB gateway. This means we do not need to SNAT, thereby keeping the client IP. But what about nodes which do not have backends? They drop those packets. The GCE load-balancer has relatively long programming time, so we don't want to change the TargetPools every time a pod moves. The healthchecks are set up such that nodes that do not have a backend fail the HC - thus the LB should never route to them! Very clever.

Here's the rub - the way GCP sets up LBs is via a VIP which is a local IP to each VM. By default, access to the LB VIP from a node "behind" that VIP (all nodes in k8s) is serviced by that same VM, not by the actual LB. The assumption is that you are accessing yourself, why go through the network to do that?

Kubernetes makes an explicit provision for pods that access an LB VIP by treating them as if they accessed the internal service VIP (which is not guaranteed to stay node-local). We did not make a provision for a NODE to access the LB VIP in the same way. Maybe we could? I seem to recall an issue there, in how we distinguish traffic originating from "this VM" vs traffic we are gatewaying.

So there you see - it is doing what is intended, but maybe not what you want. Now - concvince me that the use-case of accessing an external LB VIP from a node in the cluster (not a pod - a node) is worth the extra complexity? One case I admit that falls in a crack is `hostNetwork` pods. They will fail.

So my use case...

I run gitlab, gitlab-runner, docker-registry, wiki inside kubernetes.

But I also access the docker-registry outside kubernetes. So my hostname for it is 'cr.COMPANY.com'. In DNS this resolves to the LB IP. Actually, all 3 services have DNS names that resolve to the same LB IP, which then goes to the nginx ingress. so wiki.COMPANY.com, git.COMPANY.com, wiki.COMPANY.com are generally addressed as LB-IP w/ HOSTNAME+SNI. Its 'astonishing' to not be able to access this DNS name when I'm 'inside' but can when I'm 'outside'.

If I enable the IP transparency (because in my wiki I want the IP of the user who made the edit), then my gitlab-runner fails(since it cannot pull from the registry).

Tim Hockin

unread,

Jun 22, 2018, 4:14:24 PM6/22/18

to Kubernetes user discussion and Q&A

Well, the good news is that I think this is easy to fix.

For my test case, the service hashes to NWV5X2332I4OT4T3

iptables -t nat -I KUBE-XLB-NWV5X2332I4OT4T3 -m addrtype --src-type LOCAL -j KUBE-SVC-NWV5X2332I4OT4T3

iptables -t nat -I KUBE-XLB-NWV5X2332I4OT4T3 -m addrtype --src-type LOCAL -j KUBE-MARK-MASQ

Note these two are in inverse order because I am lazy and used -I. This will break the "only local" property for in-cluster accesses, but I think that is OK since it's explicitly not true for access from pods.

The kube-proxy change here should be easy enough, but the testing is a little involved.

Would you do me the good favor of opening a github bug and we can see if we can rally an implementor?

Don Bowman

unread,

Jun 22, 2018, 4:30:04 PM6/22/18

to Kubernetes user discussion and Q&A

On Friday, June 22, 2018 at 4:14:24 PM UTC-4, Tim Hockin wrote:

Well, the good news is that I think this is easy to fix.

For my test case, the service hashes to NWV5X2332I4OT4T3

iptables -t nat -I KUBE-XLB-NWV5X2332I4OT4T3 -m addrtype --src-type LOCAL -j KUBE-SVC-NWV5X2332I4OT4T3
iptables -t nat -I KUBE-XLB-NWV5X2332I4OT4T3 -m addrtype --src-type LOCAL -j KUBE-MARK-MASQ

Note these two are in inverse order because I am lazy and used -I. This will break the "only local" property for in-cluster accesses, but I think that is OK since it's explicitly not true for access from pods.

The kube-proxy change here should be easy enough, but the testing is a little involved.

Would you do me the good favor of opening a github bug and we can see if we can rally an implementor?

10:4 will do.

Don Bowman

unread,

Jun 22, 2018, 4:42:28 PM6/22/18

to Kubernetes user discussion and Q&A

https://github.com/kubernetes/kubernetes/issues/65387

Reply all

Reply to author

Forward