Kubernetes networking model, NAT and HostNetwork pods

657 views
Skip to first unread message

antonio.o...@gmail.com

unread,
Apr 26, 2021, 4:27:03 PM4/26/21
to kubernetes-sig-network
Hi,

Quoting from the web the paragraph that describe the kubernetes networking model [1]:

"Kubernetes imposes the following fundamental requirements on any networking implementation (barring any intentional network segmentation policies):

pods on a node can communicate with all pods on all nodes without NAT
agents on a node (e.g. system daemons, kubelet) can communicate with all pods on that node
Note: For those platforms that support Pods running in the host network (e.g. Linux):

pods in the host network of a node can communicate with all pods on all nodes without NAT"

It is clear that if the source is a hostNetwork pod, it should be able to communicate with any other pod without NAT, but is the reverse also true?

Does traffic from "normal" pods should be able to communicate with HostNetwork pods without NAT?

I realised we are not running tests to enforce the networking model, once we have clear the definition I'll add the tests to verify the behaviour, and once they are stable ask to promote them to Conformance.


Tim Hockin

unread,
Apr 27, 2021, 1:14:36 AM4/27/21
to antonio.o...@gmail.com, kubernetes-sig-network
On Mon, Apr 26, 2021 at 1:27 PM antonio.o...@gmail.com <antonio.o...@gmail.com> wrote:
Hi,

Quoting from the web the paragraph that describe the kubernetes networking model [1]:

"Kubernetes imposes the following fundamental requirements on any networking implementation (barring any intentional network segmentation policies):

pods on a node can communicate with all pods on all nodes without NAT
agents on a node (e.g. system daemons, kubelet) can communicate with all pods on that node
Note: For those platforms that support Pods running in the host network (e.g. Linux):

pods in the host network of a node can communicate with all pods on all nodes without NAT"

It is clear that if the source is a hostNetwork pod, it should be able to communicate with any other pod without NAT, but is the reverse also true?

Does traffic from "normal" pods should be able to communicate with HostNetwork pods without NAT?

Is there a model you are thinking about where it's true in one direction and not in the other? 

I realised we are not running tests to enforce the networking model, once we have clear the definition I'll add the tests to verify the behaviour, and once they are stable ask to promote them to Conformance.


--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-ne...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-network/5ef98c3a-a5ba-4e15-89f5-d68ab29fec1cn%40googlegroups.com.

Antonio Ojea

unread,
Apr 27, 2021, 3:43:22 AM4/27/21
to Tim Hockin, kubernetes-sig-network
On Tue, 27 Apr 2021 at 07:14, Tim Hockin <tho...@google.com> wrote:


On Mon, Apr 26, 2021 at 1:27 PM antonio.o...@gmail.com <antonio.o...@gmail.com> wrote:
Hi,

Quoting from the web the paragraph that describe the kubernetes networking model [1]:

"Kubernetes imposes the following fundamental requirements on any networking implementation (barring any intentional network segmentation policies):

pods on a node can communicate with all pods on all nodes without NAT
agents on a node (e.g. system daemons, kubelet) can communicate with all pods on that node
Note: For those platforms that support Pods running in the host network (e.g. Linux):

pods in the host network of a node can communicate with all pods on all nodes without NAT"

It is clear that if the source is a hostNetwork pod, it should be able to communicate with any other pod without NAT, but is the reverse also true?

Does traffic from "normal" pods should be able to communicate with HostNetwork pods without NAT?

Is there a model you are thinking about where it's true in one direction and not in the other? 

Not really, but I've found 2 different CNI implementations masquerading "pods --> hostNetwork", so it is also a heads up for CNI plugins implementation, I don't want to break someone once the test merges and eventually goes to conformance :)

I like and I prefer direct communication both ways, that guarantees that everything will work  

Dan Winship

unread,
Apr 27, 2021, 8:28:14 AM4/27/21
to Tim Hockin, antonio.o...@gmail.com, kubernetes-sig-network
On 4/27/21 1:14 AM, 'Tim Hockin' via kubernetes-sig-network wrote:
>
>
> On Mon, Apr 26, 2021 at 1:27 PM antonio.o...@gmail.com
> <mailto:antonio.o...@gmail.com> <antonio.o...@gmail.com
> <mailto:antonio.o...@gmail.com>> wrote:
>
> Hi,
>
> Quoting from the web the paragraph that describe the kubernetes
> networking model [1]:
>
> "Kubernetes imposes the following fundamental requirements on any
> networking implementation (barring any intentional network
> segmentation policies):
>
> pods on a node can communicate with all pods on all nodes without NAT
> agents on a node (e.g. system daemons, kubelet) can communicate with
> all pods on that node
> Note: For those platforms that support Pods running in the host
> network (e.g. Linux):
>
> pods in the host network of a node can communicate with all pods on
> all nodes without NAT"
>
> It is clear that if the source is a hostNetwork pod, it should be
> able to communicate with any other pod without NAT, but is the
> reverse also true?
>
> Does traffic from "normal" pods should be able to communicate with
> HostNetwork pods without NAT?
>
> Is there a model you are thinking about where it's true in one direction
> and not in the other?

That seems to be the obvious simple implementation; nodes will generally
need to know the pod network CIDR and treat traffic going to a pod IP
specially. Pods / the network plugin do not _necessarily_ need to know
the set of all valid node IPs, and so when they see traffic going from a
pod to 192.168.2.5 they don't necessarily know whether that's a node IP
or a random external host, so they NAT the traffic. Of course, a network
plugin _could_ keep track of all node IPs and not NAT traffic to them,
but that doesn't tend to happen automatically in the way that "not
NATting node-to-pod traffic" does.

Although also, is "pods in the host network of a node can communicate
with all pods on all nodes without NAT" supposed to imply "when a node
connects to a pod, the pod will see that's node's 'primary node IP' as
the source IP"? (eg, more concretely, does it say that when a pod
receives a connection from its own node, that the source IP will be the
pod's pod.status.hostIP?) Because that's not true in many cases; the pod
will instead see the traffic as coming from the node's local IP on the
bridge associated with the pod network, or something like that.

-- Dan

>
> I realised we are not running tests to enforce the networking model,
> once we have clear the definition I'll add the tests to verify the
> behaviour, and once they are stable ask to promote them to Conformance.
>
> [1]: https://kubernetes.io/docs/concepts/cluster-administration/networking/
> <https://kubernetes.io/docs/concepts/cluster-administration/networking/>
>
> [2]: https://github.com/kubernetes/kubernetes/pull/101445
> <https://github.com/kubernetes/kubernetes/pull/101445>
>
> --
> You received this message because you are subscribed to the Google
> Groups "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to kubernetes-sig-ne...@googlegroups.com
> <mailto:kubernetes-sig-ne...@googlegroups.com>.
> <https://groups.google.com/d/msgid/kubernetes-sig-network/5ef98c3a-a5ba-4e15-89f5-d68ab29fec1cn%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to kubernetes-sig-ne...@googlegroups.com
> <mailto:kubernetes-sig-ne...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kubernetes-sig-network/CAO_RewbZS3S3pu0KGa53Qt9tm_pXanfgUZW%2ByvUe1k_HwQyM%2BA%40mail.gmail.com
> <https://groups.google.com/d/msgid/kubernetes-sig-network/CAO_RewbZS3S3pu0KGa53Qt9tm_pXanfgUZW%2ByvUe1k_HwQyM%2BA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Tim Hockin

unread,
Apr 27, 2021, 1:02:24 PM4/27/21
to Dan Winship, antonio.o...@gmail.com, kubernetes-sig-network
A recap, to make sure I follow:

As written, the rule is "pods on a node can communicate with all pods on all nodes without NAT".  That includes hostNetwork.  Assertion made: hostNet pod -> normal pod can rely on podCIDR (but we allow multiple pod CIDRs, up to and including a disjoint range per node).  
 
Although also, is "pods in the host network of a node can communicate
with all pods on all nodes without NAT" supposed to imply "when a node
connects to a pod, the pod will see that's node's 'primary node IP' as
the source IP"? (eg, more concretely, does it say that when a pod
receives a connection from its own node, that the source IP will be the
pod's pod.status.hostIP?) Because that's not true in many cases; the pod
will instead see the traffic as coming from the node's local IP on the
bridge associated with the pod network, or something like that.

Let's go back to the original intent.  Maybe we can derive clearer requirement statements.

Assume pod P1, which is published as IP1 and serves on port 1000.
Assume pod P2, which is published as IP2 and serves on port 2000.

The main goal was to ensure that the following sort of sequence works:

1) P1 can connect to IP2 on port 2000 and reach P2
2) P2 observes the src IP of the connection as IP1
3) P2 can connect to IP1 on port 1000 and reach P1

 The insistence on "without NAT" was to ensure that IP-level identity was retained, and was a reaction to Docker's default model where the only on-the-wire packets were node-IP to node-IP with SNAT on egress and DNAT on ingress.
 
Now, 6 years later (!!!) - do those same requirements hold?  Is there a better way to express them?  And is hostNetwork just an oddball case that needs to be excepted?  The case you bring up with overlays and bridge interfaces is interesting.  In (2) above, IP1 would be the node's IP.  What should P2 see as the src IP in the packets?. 

How would you rewrite the rules to accommodate?

Dan Winship

unread,
Apr 27, 2021, 2:48:49 PM4/27/21
to Tim Hockin, antonio.o...@gmail.com, kubernetes-sig-network
Well... does it though? If "pods" is supposed to include host-network
pods, then why do we say both

pods on a node can communicate with all pods on all nodes without NAT

AND

pods in the host network of a node can communicate with all pods on
all nodes without NAT

If we had meant for the first rule to apply to host network pods too,
then why did we write the second rule? But then if "pods" alone doesn't
include host-network pods, then neither rule is saying anything about
traffic *to* host-network pods.

(Unless "pods" alone means "just pod-network pods" but "all pods" means
"both pod-network and host-network pods" !!!)

> Assertion made:
> hostNet pod -> normal pod can rely on podCIDR (but we allow multiple pod
> CIDRs, up to and including a disjoint range per node). 

So just to be clear, I wasn't arguing about what _should_ be the case.
You just asked if there was a model where masquerading would occur in
one direction but not the other, and I was pointing out that under a
very plausible set of circumstances, a developer could end up writing a
network plugin that ends up masquerading pod-to-node traffic but does
not masquerade node-to-pod traffic, and since there are no conformance
tests around this, there would be nothing to make the developer think
they had done anything wrong.

(For values of "a developer" equal to "me".)
In the case I gave, the source IP observed by P2 would be a valid IP of
the node, it just wouldn't be the canonical IP for the node. If all P2
cares about is "I can connect back to the IP that someone else connects
to me from" or "each client has a distinct source IP [unless two clients
share a network namespace]" then it's still OK. It would only run into
problems if it was more actively trying to interpret the IP. (Eg, if it
wanted to respond to health checks in a different way then other
requests, so it checks to see if the source IP is the node IP.)

> How would you rewrite the rules to accommodate?

Well, we need to figure out what we're saying first. The most
restrictive version (node-to-pod must use the node's canonical IP):

When a pod connects to another pod via one of its `podIPs`, the
source IP, as seen by the destination pod, must be one of the
source pod's `podIPs`. (This applies to both pod-network and
host-network pods, and thus implies by extension that neither
pod-to-node nor node-to-pod traffic is NATted, and that
node-to-pod traffic will use the same "primary" IPv4 or IPv6
address that the node would assign to a host-network pod.)

For the slightly-less restrictive version:

When a pod-network pod connects to a pod-network or host-network
pod via one of its `podIPs`, the source IP, as seen by the
destination pod, must be one of the source pod's `podIPs`.

When a host-network pod connects to a pod-network or host-network
pod via one of its `podIPs`, the source IP, as seen by the
destination pod, must be an IP owned by the source node.

(This implies that neither pod-to-node nor node-to-pod traffic is
NATted, but that node-to-pod traffic may make use of internal IP
addresses, which the pod may not necessarily be able to
associate with a specific node.)

For the even-less restrictive version:

When a pod-network pod connects to another pod-network pod
via one of its `podIPs`, the source IP, as seen by the destination
pod, must be one of the source pod's `podIPs`.

When a pod-network pod connects to a host-network pod, or a
host-network pod connects to a pod-network pod, no guarantees are
made about the source IP that will be seen by the destination pod,
and it is possible that the traffic will be NATted in a way that
prevents the destination pod from distinguishing different source
pods.

-- Dan

>
> >     I realised we are not running tests to enforce the networking
> model,
> >     once we have clear the definition I'll add the tests to verify the
> >     behaviour, and once they are stable ask to promote them to
> Conformance.
> >
> >   
>  [1]: https://kubernetes.io/docs/concepts/cluster-administration/networking/
> <https://kubernetes.io/docs/concepts/cluster-administration/networking/>
> >   
>  <https://kubernetes.io/docs/concepts/cluster-administration/networking/
> <https://kubernetes.io/docs/concepts/cluster-administration/networking/>>
> >
> >     [2]: https://github.com/kubernetes/kubernetes/pull/101445
> <https://github.com/kubernetes/kubernetes/pull/101445>
> >     <https://github.com/kubernetes/kubernetes/pull/101445
> <https://github.com/kubernetes/kubernetes/pull/101445>>
> >
> >     --
> >     You received this message because you are subscribed to the Google
> >     Groups "kubernetes-sig-network" group.
> >     To unsubscribe from this group and stop receiving emails from it,
> >     send an email to
> kubernetes-sig-ne...@googlegroups.com
> <mailto:kubernetes-sig-network%2Bunsu...@googlegroups.com>
> >     <mailto:kubernetes-sig-ne...@googlegroups.com
> <mailto:kubernetes-sig-network%2Bunsu...@googlegroups.com>>.
>  <https://groups.google.com/d/msgid/kubernetes-sig-network/5ef98c3a-a5ba-4e15-89f5-d68ab29fec1cn%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/kubernetes-sig-network/5ef98c3a-a5ba-4e15-89f5-d68ab29fec1cn%40googlegroups.com?utm_medium=email&utm_source=footer>>.
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "kubernetes-sig-network" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> > an email to kubernetes-sig-ne...@googlegroups.com
> <mailto:kubernetes-sig-network%2Bunsu...@googlegroups.com>
> > <mailto:kubernetes-sig-ne...@googlegroups.com
> <mailto:kubernetes-sig-network%2Bunsu...@googlegroups.com>>.
> <https://groups.google.com/d/msgid/kubernetes-sig-network/CAO_RewbZS3S3pu0KGa53Qt9tm_pXanfgUZW%2ByvUe1k_HwQyM%2BA%40mail.gmail.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/kubernetes-sig-network/CAO_RewbZS3S3pu0KGa53Qt9tm_pXanfgUZW%2ByvUe1k_HwQyM%2BA%40mail.gmail.com?utm_medium=email&utm_source=footer>>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to kubernetes-sig-ne...@googlegroups.com
> <mailto:kubernetes-sig-ne...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kubernetes-sig-network/CAO_RewY%2Biq_b7Y%2BETMOOi1C49tjyAGdNknL549ScdemM7gmDdw%40mail.gmail.com
> <https://groups.google.com/d/msgid/kubernetes-sig-network/CAO_RewY%2Biq_b7Y%2BETMOOi1C49tjyAGdNknL549ScdemM7gmDdw%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Antonio Ojea

unread,
Apr 28, 2021, 5:59:39 PM4/28/21
to Dan Winship, Tim Hockin, kubernetes-sig-network
I was talking with Dan Winship offline about this topic and one of my comments was that I've observed some "interesting" behaviours when implementing the tests for services with hostnetwork endpoints, so he immediately suggested raising these issues to avoid inconsistency between clusters.  
There was also this recent issue with lifecycle hooks failing to reach pods in other nodes ... 

It seems that Kubernetes evolved, Tim mentioned 6 years have passed since that model was published... maybe is a good time to revisit this topic, but since there are many combinations between pods, hostNetwork pods, kubelet probes/hooks, .... and those are hard to explain in an email, I'm going to create a gdoc with all the "networking" scenarios so we can evaluate it against the current networking model.

I will attach the document to this thread, and will try to send it next week ... no promises though :)

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-ne...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-network/678fa086-33ee-8519-f48b-3697613e1eca%40redhat.com.

Antonio Ojea

unread,
May 13, 2021, 11:15:03 AM5/13/21
to Dan Winship, Tim Hockin, kubernetes-sig-network
I've added this topic to today's agenda and I've created some slides to try to provide context to the topic.

lars.g...@est.tech

unread,
May 17, 2021, 7:08:10 AM5/17/21
to kubernetes-sig-network
Hi,

Good overview! But...

The "wrong source" for udp packets on a multi-interface/address server is a known problem, and not a K8s problem. A good explanation is here;


I think is would be a big mistake to try to fix this problem in some way in K8s. Not only it would be very hard it would also be K8s unique. IMHO network programmers using UDP (which is much harder than TCP) must learn about the problems involved.

Another UDP problem is MTU. In ipv6 fragmentation is only allowed on the sender, a bit like always have Dont-Fragment in IPv4. Basically an UDP programmer using IPv6 must use PMTU discovery (or ensure packet size <=1280).

Regards,
Lars Ekman

lars.g...@est.tech

unread,
May 20, 2021, 8:49:33 AM5/20/21
to kubernetes-sig-network
Hi again,

Much better and complete udp-server example at; https://github.com/miekg/dns/blob/master/udp.go

Regards,
L Ekman

Antonio Ojea

unread,
May 28, 2021, 5:46:04 AM5/28/21
to lars.g...@est.tech, kubernetes-sig-network
Hi,

I just want to point out that for me the importance of this topic is on not breaking users, i.e. if I have manifest with probes or container hooks, I'd love that the project tell me if this will work only on pods on the same node or if I can use it to check pods in another nodes. I'm afraid that the precedent is the later, since the existing conformance test was not assuming where the pod was scheduled

I also think that it's an error if we allow it to depend on CNI implementations, because it will generate more confusion on users: "I can use probes between nodes with CNIs X,Y and Z but not with CNI W", and it will break portability.

In addition, it was clear to me from yesterday's sig-network meeting, that platform specifics matter, and I honestly have close to zero knowledge on the Windows area, there was a thread some time ago about hostNetwork pods in the mailing list, that is what I know ....  I think that we should use this opportunity to try to avoid these ambiguities so users and CNI implementation on Windows can benefit of it.

Tim Hockin

unread,
May 28, 2021, 5:58:06 PM5/28/21
to Antonio Ojea, lars.g...@est.tech, kubernetes-sig-network
Thanks Antonio for being persistent on this.

On the question of kubelet being able to reach pods on other nodes, I
think the most relevant point is that kubelet is not NECESSARILY in
the host network the same way an arbitrary pod is. We have never
required that (some folks run it in a pod, apparently Windows does
other things). I don't think it is reasonable for us to require that
it be so, at this point

On the question of whether pod->node traffic must be without NAT, I
acknowledge that reasonable implementations might do that, and I don't
feel good about the idea of clawing that back. In fact, I wonder if
the requirement for host-net-pod to pod-net-pod w/o NAT is too
draconian, and I suspect that if we actually started testing it, we'd
find it was actually not working (for some situations) in the first
place.

Should we relax that "without NAT" clause?

Is the rest ambiguous, then? E.g. something like:

"""
In every cluster, there exists an abstract pod-network to which pods
are connected by default, unless explicitly configured to use the
host-network (which is an optional capability).

Kubernetes imposes the following fundamental requirements on any
networking implementation (barring any intentional network
segmentation policies):
* pod-network pods on a node can communicate with all pod-network
pods on all nodes without NAT
* non-pod agents on a node (e.g. system daemons, kubelet) can
communicate with all pods on that node

For those platforms that support pods running in the host-network (e.g. Linux):
* host-network pods on a node can communicate with all pods on all
nodes, with NAT if required
"""

Does that cost us any valuable semantics? NetworkPolicy is already
challenging for host-net pods.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-network/CABhP%3DtZ-X67cvH4J62qBAS6P2MR46XL-LQvCYO-uywkL%2BeDZ4A%40mail.gmail.com.

Dan Winship

unread,
Jun 1, 2021, 9:23:33 AM6/1/21
to Antonio Ojea, kubernetes-sig-network
On 5/28/21 5:45 AM, Antonio Ojea wrote:
> Hi,
>
> I just want to point out that for me the importance of this topic is on
> not breaking users, i.e. if I have manifest with probes or container
> hooks, I'd love that the project tell me if this will work only on pods
> on the same node or if I can use it to check pods in another nodes. I'm
> afraid that the precedent is the later, since the existing conformance
> test was not assuming where the pod was scheduled
> https://github.com/kubernetes/kubernetes/pull/101063

Pod probes are between a pod and its own kubelet. A kubelet on node A
has no reason to check the liveness/readiness of a pod on node B, and if
for some reason it decided it wanted to, we have never claimed that it
is required to succeed. (In fact, if the pod on node B is isolated by
NetworkPolicy, then the connection attempt from kubelet A is expected to
fail [though not actually required to, I think].)

The problem with 101063 is that we only *require*:

1. kubelet can reach all pods on the same node
2. host-network pods (if they exist) can reach all pods on all nodes

but some code had also mistakenly assumed

*3. kubelet behaves like a host-network pod

and that's not true on Windows.

-- Dan

Antonio Ojea

unread,
Jun 2, 2021, 11:20:20 AM6/2/21
to Dan Winship, jay vyas, Lachlan Evenson, kubernetes-sig-network
This sentence makes a lot of sense to me, you totally convinced me:

"Pod probes are between a pod and its own kubelet."

Do we move forward with Tim's and Dan's comments and amend them to the docs?
@jay vyas @Lachlan Evenson Is that ok for Windows too?

Lazy consensus until the next sig-network meeting Jun 10th?

Tim Hockin

unread,
Jun 4, 2021, 7:11:54 PM6/4/21
to Antonio Ojea, Dan Winship, jay vyas, Lachlan Evenson, kubernetes-sig-network
Explicit Consensus +1
> --
> You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-ne...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-network/CABhP%3Dtb%3Drx4RecQcJPnzA%2BnJRxwxOCCiTLL1Wy%3DXUgH3VHuGcg%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages