Hi,
https://github.com/dcbw/kubernetes/commits/kube-cni-multi-network
This is an implementation of the proposed multi-network capability
specified in Joji's document which is here:
https://docs.google.com/document/d/1TW3P4c8auWwYy-w_5afIPDcGNLK3LZf0m14943eVfVg/edit
I hope this can help push things forward and perhaps convince those who
are less enamored of multiple networks that perhaps the changeset isn't
as large/drastic as they may think.
The toplevel multi-network directory has more details and some example
files that you could use to test this out.
TLDR:
1) put some CNI JSON conf files in /etc/cni/net.d
2) if you've got a "thick" plugin, put it into /opt/cni/bin
3) kubectl create -f multi-network/network-crd.yaml
4) create some Custom Resource network objects, see multi-
network/dbnet.yaml for an example of a "thin" plugin and multi-
network/thick-plugin-example.yaml for a "thick" plugin example
5) create a pod and add the "
alpha.network.k8s.io/networks" annotation,
for example:
alpha.network.k8s.io/networks: dbnet,othernet,thick-plugin-example
where you list the names of Custom Resource network objects you've just
created. All these networks will now be added to your pod and you'll
get multiple interfaces and IP addresses inside that pod.
Notes:
==============
* this implementation is backwards compatible; if the annotation does
not exist, the "default" network is used, which is exactly the same as
the current network picked by CNI. eg, first file in /etc/cni/net.d
* the first network in the annotation list is the "default" network and
only this network's reported IP will be used in the API. Others are
discarded.
* kubelet restart is not yet correctly handled; there is an internal
cache of networks added to the pod that is not persisted. This means
on restart we cannot reliably tear a pod down. A final implementation
would need to checkpoint these.
* it's a bit ugly to pass the kubeConfig file down to plugins, but I'm
not really sure how else to do it since the CNI driver could be remote
and we can't rely on a a pre-created kube client object. That said, we
could just pass the kube client along and other things that vendor the
CNI driver (eg, CRI-O) will just deal with it.
* if a network is removed while pods use it, pods are left running and
the network is torn down at pod teardown time. The implementation
caches the config JSON and sends it to plugin at teardown even if
config file is missing from /etc/cni/net.d
* new networks are read from /etc/cni/net.d and /opt/cni/bin each time
the SyncLoop happens, which is every couple seconds
Spec limitations
================
* Detecting and working with "thick" plugins is not great. To generate
configuration to send them, we must know what CNI versions they
support. Usually that's provided by the JSON .conf files, but we don't
have those here. So we'd have to query each plugin with the VERSION
command and find the intersection between what kube supports and what
the plugin supports and use that in the generated JSON.
Instead, what if we had /etc/cni/kubenet.d or something in which you
would create a JSON config file that had everything for your plugin
*except* the network name? Kubelet then parses that (and gets
Capabilities, CNIVersion, etc) and adds the network name before
calling. This gets us HostPorts too.
* kubelet restart; as mentioned above, we'd now have to checkpoint pod
networks somewhere, either in the CNI driver itself, or in the runtime.
Since the pod may be deleted from the apiserver by the time teardown
happens, we cannot rely on reading the annotation to figure out which
networks the pod was using. We must save it in persistent storage.
* network interface names; the implementation uses "ethX" where X is
the index of the network name in the annotation list. Probably sub-
optimal; do we want to allow a pod to pick its interface name for any
given network?
* split between pod-level details and network details. Related to the
network interface issue. I know I argued against this in the past, but
I now realize I may be wrong. There may be details that are specific
to the pod that shouldn't be put into the generic network config,
whether that's a thick or thin plugin. For example, if the machine has
a number of NICs which should be moved directly into pods; this could
use a generic CNI plugin for the move, but each pod would require a
different IP address; we cannot encode that IP in the network JSON.
And maybe more I'm forgetting, which I'll follow up with if I think
about them.
Dan