RFC/PoC kubelet CNI multi-network implementation based off Joji's spec

Dan Williams

unread,

Jul 19, 2017, 3:24:42 PM7/19/17

to kubernetes-sig-network

Hi,

https://github.com/dcbw/kubernetes/commits/kube-cni-multi-network

This is an implementation of the proposed multi-network capability
specified in Joji's document which is here:

https://docs.google.com/document/d/1TW3P4c8auWwYy-w_5afIPDcGNLK3LZf0m14943eVfVg/edit

I hope this can help push things forward and perhaps convince those who
are less enamored of multiple networks that perhaps the changeset isn't
as large/drastic as they may think.

The toplevel multi-network directory has more details and some example
files that you could use to test this out.

TLDR:

1) put some CNI JSON conf files in /etc/cni/net.d
2) if you've got a "thick" plugin, put it into /opt/cni/bin
3) kubectl create -f multi-network/network-crd.yaml
4) create some Custom Resource network objects, see multi-
network/dbnet.yaml for an example of a "thin" plugin and multi-
network/thick-plugin-example.yaml for a "thick" plugin example
5) create a pod and add the "alpha.network.k8s.io/networks" annotation,
for example:

alpha.network.k8s.io/networks: dbnet,othernet,thick-plugin-example

where you list the names of Custom Resource network objects you've just
created. All these networks will now be added to your pod and you'll
get multiple interfaces and IP addresses inside that pod.

Notes:
==============

* this implementation is backwards compatible; if the annotation does
not exist, the "default" network is used, which is exactly the same as
the current network picked by CNI. eg, first file in /etc/cni/net.d

* the first network in the annotation list is the "default" network and
only this network's reported IP will be used in the API. Others are
discarded.

* kubelet restart is not yet correctly handled; there is an internal
cache of networks added to the pod that is not persisted. This means
on restart we cannot reliably tear a pod down. A final implementation
would need to checkpoint these.

* it's a bit ugly to pass the kubeConfig file down to plugins, but I'm
not really sure how else to do it since the CNI driver could be remote
and we can't rely on a a pre-created kube client object. That said, we
could just pass the kube client along and other things that vendor the
CNI driver (eg, CRI-O) will just deal with it.

* if a network is removed while pods use it, pods are left running and
the network is torn down at pod teardown time. The implementation
caches the config JSON and sends it to plugin at teardown even if
config file is missing from /etc/cni/net.d

* new networks are read from /etc/cni/net.d and /opt/cni/bin each time
the SyncLoop happens, which is every couple seconds

Spec limitations
================

* Detecting and working with "thick" plugins is not great. To generate
configuration to send them, we must know what CNI versions they
support. Usually that's provided by the JSON .conf files, but we don't
have those here. So we'd have to query each plugin with the VERSION
command and find the intersection between what kube supports and what
the plugin supports and use that in the generated JSON.

Instead, what if we had /etc/cni/kubenet.d or something in which you
would create a JSON config file that had everything for your plugin
*except* the network name? Kubelet then parses that (and gets
Capabilities, CNIVersion, etc) and adds the network name before
calling. This gets us HostPorts too.

* kubelet restart; as mentioned above, we'd now have to checkpoint pod
networks somewhere, either in the CNI driver itself, or in the runtime.
Since the pod may be deleted from the apiserver by the time teardown
happens, we cannot rely on reading the annotation to figure out which
networks the pod was using. We must save it in persistent storage.

* network interface names; the implementation uses "ethX" where X is
the index of the network name in the annotation list. Probably sub-
optimal; do we want to allow a pod to pick its interface name for any
given network?

* split between pod-level details and network details. Related to the
network interface issue. I know I argued against this in the past, but
I now realize I may be wrong. There may be details that are specific
to the pod that shouldn't be put into the generic network config,
whether that's a thick or thin plugin. For example, if the machine has
a number of NICs which should be moved directly into pods; this could
use a generic CNI plugin for the move, but each pod would require a
different IP address; we cannot encode that IP in the network JSON.

And maybe more I'm forgetting, which I'll follow up with if I think
about them.

Dan

Kuralamudhan Ramakrishnan

unread,

Jul 20, 2017, 6:35:15 AM7/20/17

to kubernetes-sig-network

+1 Dan.

Really impressive PoC work on Kubernetes Multi networking.

I think, It will work for both TPR and CDR.

Is that possible to use the CRD/TPR clientset for the in cluster communication. I think we need kubeconfig only for out of cluster communication. We can use kubeclient clientset for communication between Kubelet and Kube-apiserver.

With regards,

Kural.

cmluciano

unread,

Jul 20, 2017, 12:00:58 PM7/20/17

to kubernetes-sig-network

+1 This is great work. This will really enhance some of the device plugin PoCs in the resource mgmt WG as well.

karun chennuri

unread,

Jul 28, 2017, 3:38:10 PM7/28/17

to kubernetes-sig-network, karun.c...@huawei.com

This will be a long email :) sorry about that! but it's a comprehensive report on the implementation.

Hi Dan

I spent some time going through your changes and really appreciate for putting things together. There is no question from my end on the actual use case, but my concern is only the way implementation is done. I want to date back to 2 threads to pass my honest opinion.

https://groups.google.com/forum/#!msg/kubernetes-sig-network/8nDSQ2hF40w/RKnTe4dEBAAJ;context-place=forum/kubernetes-sig-network

https://groups.google.com/forum/#!topic/kubernetes-sig-network/k-vkTBOVJSg

(There is no specific order of priority in points below)

Threads above suggest, some supporting such implementation inside k8s as first class object and most against it. General consensus was to implement outside k8s through annotations eg: CNI-Genie, Multus etc. So have we addressed this yet, whether to be inside k8s with api-changes or outside k8s? Your implementation is inside k8s using TPR/CRD-annotations.
This implementation is still annotation based and not first class objects. There are already existing custom external plugins like CNI-Genie which does have the same implementation but without a line of code touched in Kubernetes. So for me this implementation still looks like a temporary fix to get things going and not adding much besides what custom-external-cni-plugins already offer outside Kubernetes.
Does the implementation currently list multiple IPs when we query through $ kubectl get pods –o wide? If we are to make code changes insides k8s, that’s probably I thought one of the first features we would see. Isn't it?
One of the questions raised earlier and a possible strong reason why we need this functionality as a part of k8s api is to be able to schedule pods on a host that provisions access to the network of choice. But I don't see this happening in your implementation!?
Does this implementation address how services interact with multiple networks and egress traffic? I've not seen it yet (may be 'cause it's still a POC?)
Since we have listed the use cases already, do we have concrete use cases that absolutely need api changes inside k8s, without which the use case can’t be addressed?
Also let me refer to a PR https://github.com/kubernetes/kubernetes/pull/47536 that was closed (I presume it's because of pt. 1 above :)

Overall to me this implementation still falls short at various levels.

With Regards

Karun Chennuri

Message has been deleted

Peter Zhao

unread,

Aug 4, 2017, 1:03:08 PM8/4/17

to kubernetes-sig-network, karun.c...@huawei.com

+1. Dan. Thanks for the effort to push things forward. Although we might not be able to make network be first class objects for some reasons, we DO need the feature.

Besides Karun's concerns, I also have some cents.

Is there a graceful way to add new network dynamically? Say users want their pods to join a new network, should they add the new CNI conf files to EVERY nodes manually? I remember this was discussed in Joji's doc but there was not a consensus.
It'd be good to allow users to define network interface names and provide a way to pass the names to POD, e.g. via Downward API.
We may want to let K8S respect all the IPs which CNI plugin returns, and allow POD to expose as a service endpoint via specific interface/IP.

With regards,

Peter Zhao from ZTE.

Tim Hockin

unread,

Aug 6, 2017, 7:02:12 PM8/6/17

to Peter Zhao, kubernetes-sig-network, Karun Chennuri

Prepping for tomorrow.

I don't understand the thick-plugin demo. Where does it receive he
rest of its config (e.g. IPAM) ?

The kubeconfig stuff highlights (again) that maybe we should think
about more native ways of deploying these drivers. If they were
deployed as a DaemonSet, for example, it could come with all the
credentials needed to access the master.

I'll go re-read Joji's doc now.

> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kubernetes-sig-ne...@googlegroups.com.
> To post to this group, send email to
> kubernetes-...@googlegroups.com.
> Visit this group at https://groups.google.com/group/kubernetes-sig-network.
> For more options, visit https://groups.google.com/d/optout.

Dan Williams

unread,

Aug 7, 2017, 11:48:59 AM8/7/17

to Tim Hockin, Peter Zhao, kubernetes-sig-network, Karun Chennuri

On Sun, 2017-08-06 at 16:01 -0700, 'Tim Hockin' via kubernetes-sig-

network wrote:
> Prepping for tomorrow.
>
> I don't understand the thick-plugin demo. Where does it receive he
> rest of its config (e.g. IPAM) ?

It doesn't. "Thick" plugins are expected to handle *everything*
internally, including IPAM, because they are much more complicated,
have daemons of their own that monitor Kube state, etc.

As we'd defined it (and I don't think we should really use these terms
officially):

thin: all configuration held in the CNI .conf/conflist file; each
network should be configured in a separate file

thick: the plugin itself handles everything internally; it just wants
to know the network name so it can go off and look up config in
whatever internal databases it uses.

I'm imagining that "thick" plugins would be ones like OpenStack, OVN,
whatever Mike Spreitzer is working on, etc. Plugins that keep a ton of
internal state themselves. FWIW, we currently implement the OpenShift
SDN plugin this way too, because we already have a "network"-type
object internally and all the credentials to access the apiserver
already.

> The kubeconfig stuff highlights (again) that maybe we should think
> about more native ways of deploying these drivers. If they were
> deployed as a DaemonSet, for example, it could come with all the
> credentials needed to access the master.

Yeah, that would be nice. But how do we deploy something that's not a
actually a pod or long-lived-process?

Dan

Peter Zhao

unread,

Aug 7, 2017, 12:06:30 PM8/7/17

to kubernetes-sig-network, tho...@google.com, zhao.xi...@zte.com.cn, karun.c...@huawei.com

>thick: the plugin itself handles everything internally; it just wants
to know the network name so it can go off and look up config in
whatever internal databases it uses.

Exactly. We ZTE has such a "thick" plugin which interacts with K8S northbound and Neutron southbound. It gets network names from pod.annotations and handles everything internally. It doesn't use config files on each node to pass network info. We have a manager on master to manage "real" networks.

> > email to kubernetes-sig-network+unsub...@googlegroups.com.

Kaveh Shafiee

unread,

Aug 8, 2017, 6:30:44 PM8/8/17

to kubernetes-sig-network, tho...@google.com, zhao.xi...@zte.com.cn, karun.c...@huawei.com

Hi everyone,

I would like to thank Dan again for initiating the meeting on multi-networking yesterday.

As I mentioned during the meeting, a few months ago we shared a design doc. with the SIG on supporting 'Service' for multi-networking.

Based on yesterday's discussions, to enable multi-networking, one of the fundamentals we need is a reasonable design for 'Service'. I summarized our previous doc. into a 2-pager and would like to share again it as a starting point:

https://docs.google.com/document/d/1a7DMEMWN-_RPdwzBepgjJPBSILapSiUjIQbpInBWOBs/edit?usp=sharing

Note that the purpose of this doc. is to find common ground on the design of 'Service'. So, any feedback is greatly appreciated.