Network Plugins definition (was "Kicking off the network SIG")

Tim Hockin

unread,

Aug 20, 2015, 1:30:15 PM8/20/15

to kubernetes-...@googlegroups.com

It's hard to talk about network plugins without also getting into how
we can better align with docker and rocket-native plugins, but given
the immaturity of the whole space, let's try to ignore them and think
about what is the overall behavior we really want to get.

Forking from the kickoff thread.

On Sun, Aug 16, 2015 at 10:00 AM, Gurucharan Shetty <she...@nicira.com> wrote:

> From what I understand, this hairpin flag is only needed if one uses
> the kube-proxy to do L4 load-balancing. If there are people that would
> be doing L7 load balancing or a more feature rich L4 load balancing
> (based not only on L4 of the packet but also based on the traffic load
> on a particular destination pod) and they don't want to use iptables,
> I guess, mandating the hair-pin is not needed. So would it be correct
> to say that, if a network solution intends to use kube-proxy to do L4
> load balancing then their network plugin should enable hair-pin? This
> also raises a more general question: Is kube-proxy replaceable by
> something else?

kube-proxy is supposed to be one implementation of Services, but not
the only. In fact I have seen at least two demos of Services without
kube-proxy and know of two or three others that I have not seen
directly.

Given that, it seems to make more sense to describe the behavior that
we expect from a network plugin (ideally in the form of a test) and
let people conform to that.

> 1. Right now, there is a requirement that all the pods should also be
> able to speak to every host that hosts pods. This was clearly needed
> for the previous kube-proxy implementation where the source ip and mac
> would change to that of the src host. With the new implementation of
> kube-proxy, do we still need the host to be able to speak to the pods
> via IP? May be there were other reasons for this requirement?
> Thoughts?

This question touches on both directions: pod-to-node and node-to-pod.
It's a fair question - we jump through a lot of hoops to ensure that
access to services from nodes works. Is that really necessary? It's
certainly useful for debugging. If we want things like docker being
able to pull from a registry that is running as a Service or nodes to
access DNS in the cluster we need this access, or we start doing
per-node proxies and host-ports and localhost for everything. See
recent work by one of our interns on making an in-cluster docker
registry work for a taste.

Those are the "obvious" ones. I am interested to hear what other
things people might be doing where a process on the node needs to
access a Pod or Service, or vice-versa. Simplifying this connectivity
would be a win.

> 2. The current network plugin architecture prevents the network plugin
> from changing the IP address of the pod (from the Docker generated
> one). Well, it does not prevent you from changing the IP, but things
> like 'kubectl get endpoints' would only see the docker generated IP
> address. Kubernetes currently is a single-tenant system, so it
> probably is not very important to be able to change the IP address of
> the pod. But in the future, if there are plans for multi-tenancy (dev,
> qa, prod, staging etc in single environment), then overlapping IP
> addresses and logical networks (SDN, network virtualization) may be
> needed , in which case, ability to change the IP address will become
> important. Any thoughts?

As of recently plugins can return status, which includes a different IP.

On Mon, Aug 17, 2015 at 11:01 AM, Casey Davenport
<casey.d...@metaswitch.com> wrote:

> I don't think the hairpin flag will come into play for Calico. We don't
> build on the Docker bridge, instead creating new veth pair with one end in
> the pod's namespace and one end in the host's for each new pod. I won't
> know for sure if this is true until I test it, but I plan on doing so this
> week.

So you're assuming (rightly, so far) that network plugins are
monolithic and not composeable. Is it valuable or interesting to have
network plugins be composeable? For example, should it be possible to
write a plugin that handles things like installing special iptables
rules and use that plugin alongside a Calico plugin? For a more
concrete example, let's look at what we do in GCE in the default
Docker bridge mode. All of this is done in kubelet, but should be
plugins (IMO).

1) set a broad MASQUERADE rule in iptables (required to egress traffic
from containers because of GCE's edge NAT).
2) configure cbr0 (our docker bridge) with the per-node CIDR
3) tweak Docker config (I think?) to use cbr0
4) soon: install hairpin config on each cbr0 interface

At least two things stand out as pretty distinct - the MASQUERADE
rules and the cbr0 stuff.

The MASQUERADE stuff is needed regardless of whether you use a docker
bridge or Calico or Weave or Flannel, but it's actually pretty
site-specific. In GCE we shouldbasically say "anything that is not
destined for 10.0.0.0/8 needs masquerade", but that's not right. It
should be "anything destined for and RFC1918 address". But there are
probably cases where we would want the masquerade even withing RFC1918
space (I'm thinking VPNs, maybe?). Outside of GCE, the rules are
obviously totally different. sShould this be something that
kubernetes understands or handles at all? Or can we punt this down to
the node-management layer (in as much as we have one)?

The bridge management seems like a pretty clear case of plugin. We
could/should move all of the cbr0/docker0 management into a plugin and
make that the default, probably. This touches on another sore point -
docker itself has flags that we have historically suggested people set
(or not set) around iptables and masquerade. Those flags do things
that conflict with what we want to do, sometimes, but we don't
actively check that they are not set.

Is there any use for composable plugins?

> From my perspective, this should be handled by each individual network
> plugin - each plugin might want to handle this differently (or not at all).
> Perhaps there are other cases to be made for chaining of network plugins,
> but the hairpin case alone doesn't convince me.

I think I am coming to the same conclusion.

>> 1. Right now, there is a requirement that all the pods should also be
>> able to speak to every host that hosts pods.
>
> We've been talking about locking down host-pod communication in the general
> case as part of our thoughts on security. There are still cases where
> host->pod communication is needed (e.g. NodePort), but at the moment our
> thinking is to treat infrastructure as "outside the cluster". As far as
> security is concerned, we think the default behavior should be "allow from
> pods within my namespace". Anything beyond that can be configured using
> security policy.

See above - what about cases where the node needs to legitimately
access cluster services (the canonical case being a docker registry)?

On Tue, Aug 18, 2015 at 5:45 AM, Michael Bridgen <mic...@weave.works> wrote:

> I have been working on adding an API library to CNI[1], which is used for
> rocket's networking, but was intended as a widely-applicable plugin
> mechanism. To date, CNI consists of a specification for invoking plugins[2],
> some plugins themselves, and a handful of example scripts that drive the
> plugins. With an API in a go library, it'd be much easier to use as common
> networking plugin infrastructure for kubernetes, rocket, runc and other
> things that come along.
>
> I like CNI because it does just what is needed, while giving plugins and
> applications a fair bit of freedom of implementation. It's pretty close, and
> at the same level of abstraction, to the networking hooks added to
> Kubernetes recently.

I'm fine with folding things together - that would be great, in fact.
I have not paid attention to CNI in the last 2 months, but I had some
concerns with it, last I looked. I was one of the people arguing that
CNI and CNM were too close to not fold together. I still feel that
there is not really room for more than one or MAYBE two plugin models
in this space. I don't have any particular attachment to owning one
of those, personally, but I am VERY concerned that:

a) implementors like Weave/Calico/... have to implement and maintain
Docker plugins and CNI/k8s plugins with slightly different semantics

b) users experience confusion and/or complexity about how to configure
a solution

> So far, with Eugene @ CoreOS's help, I have pulled together enough of a
> library that Rocket's networking can be ported to it[3] -- not too
> surprising, since much of the code was adapted from there -- and I've
> written a tiny command-line tool that can be used with e.g., runc.
> Meanwhile, Paul @ Calico is getting good results trying an integration with
> Kubernetes.

I'd like to see this expanded on. If we can reduce the space from 3
plugins to 2, that's a win.

> I am aware that I'm late to the party, and that Kubernetes + CNI and various
> other combinations have been discussed before. But I think things have moved
> on a bit[4], so if people don't mind some recapitulation, it'd be useful to
> hear objections and (unmet) requirements and so on. Perhaps it is needless
> to say that I would like this to become a "full community effort", if we
> find that it is a broadly acceptable approach.

I'll have to look at CNI again.

On Tue, Aug 18, 2015 at 9:40 AM, Paul Tiplady
<paul.t...@metaswitch.com> wrote:

> I like this model because it would allow Calico to provide a single CNI
> plugin for Kubernetes, and have it run for any containerizer (docker, rkt,
> runc, ...). As k8s support for different runtimes grows, this will become an
> increasingly significant issue. (Right now we can just target docker and be
> done with it).

Does CNI work with Docker?

> Of the plugin hooks, CNI maps cleanly to ADD and DELETE. It doesn't have a
> concept of daemons, so the k8s INIT action isn't covered (we don't currently
> use INIT, though we think we will eventually). To handle the functionality
> currently provided by INIT, CNI could potentially be extended to add the
> concept of a daemon, or we could leave the INIT hook as a kick to an
> arbitrary binary that is independent from the CNI mechanism. The latter is
> probably the short-term pragmatic solution, since current plugins' INIT
> hooks will remain unchanged.

k8s plugins were intended to be exec-and-terminate, but docker plugins
are assumed to be be daemons. In both cases we have open issues with
startup, containerization vs not, etc.

> As for the new STATUS plugin action, I'm not sure if that's needed if we use
> CNI; the plugin returns the IP from the ADD call so we can just update the
> pod status after it's created. Was another motivation of STATUS the idea
> that a pod's IP could change? If we don't need to support that use case then
> things integrate very cleanly.

I asked the same question. I think it was "following the existing
pattern on calling docker for status". I think simply returning it
once might be OK.

On Tue, Aug 18, 2015 at 10:09 AM, Paul Tiplady
<paul.t...@metaswitch.com> wrote:

> For the rate-limiting case, I can't see how you can implement this outside
> the plugin in a generic way; after the plugin has done its thing, how do you
> determine which interface is connected to the pod? For example Calico
> deletes the veth pair that docker creates, so we'd have to duplicate any tc
> config that was set anyway. IMO better to have all that logic in one place,
> where a plugin implementor can see what needs to be implemented.

If I recall, the TC logic is applied at the host interface per-CIDR
not per veth.

> Also, currently kubelet has code to setup cbr0. While that's a great
> pragmatic simplification, I don't think it quite fits with the concept of
> pluggable networking modules -- could that be handled in the plugin INIT
> step? That would make the docker-bridge plugin an equal peer to other
> plugins, which would help flush out issues with the API; if we can't
> implement docker-bridge entirely as a k8s plugin, then maybe the API isn't
> complete enough.

Yeah, I should have read the thread before responding - this is my
conclusion, too.

Tim Hockin

unread,

Aug 21, 2015, 2:40:14 AM8/21/15

to kubernetes-sig-network

I re-read the CNI spec and looked at some of the code. I have a lot
of questions, which I will write up tomorrow hopefully, but it seems
viable to me.

eugene.y...@coreos.com

unread,

Aug 21, 2015, 3:10:38 PM8/21/15

to kubernetes-sig-network

On Thursday, August 20, 2015 at 10:30:15 AM UTC-7, Tim Hockin wrote:

It's hard to talk about network plugins without also getting into how
we can better align with docker and rocket-native plugins, but given
the immaturity of the whole space, let's try to ignore them and think
about what is the overall behavior we really want to get.

A philosophical decision one has to make when talking about these plugins is whether the role of the plugin is to:

1) Perform some abstract task like joining a container to a network. This is both the CNI and CNM model.

or

2) Just a callout to do any old manipulation of networking resources (e.g. bridges, iptables, veths, traffic classes, etc). I think this is what Tim proposed. This model is very flexible but is harder for the user to comprehend and configure. The user has to know what works with what and in which order.

I feel like we actually have a tiny experience with (1) but not with (2). Maybe resource intensive but is it worth doing a small POC for option 2?

Tim Hockin

unread,

Aug 21, 2015, 5:25:03 PM8/21/15

to Eugene Yakubovich, kubernetes-sig-network

Can you answer how, in CNI, something like Docker would work? They
want the "bridge" plugin but they want to add some per-container
iptables rules on top of it.

Should they fork the bridge plugin into their own and implement their
custom behavior? Should they make a 2nd plugin that follows "bridge"
and adds their iptables (not allowed in CNI)? Should they make a
wrapper plugin that calls bridge and then does their own work?

I think these are all viable. There's a simplicity win for admins
(especially user/admins) if the plugin is assumed monolithic, I guess.

> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kubernetes-sig-ne...@googlegroups.com.
> To post to this group, send email to
> kubernetes-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/kubernetes-sig-network.
> For more options, visit https://groups.google.com/d/optout.

eugene.y...@coreos.com

unread,

Aug 21, 2015, 5:41:40 PM8/21/15

to kubernetes-sig-network, eugene.y...@coreos.com

On Friday, August 21, 2015 at 2:25:03 PM UTC-7, Tim Hockin wrote:

Can you answer how, in CNI, something like Docker would work? They
want the "bridge" plugin but they want to add some per-container
iptables rules on top of it.

Should they fork the bridge plugin into their own and implement their
custom behavior? Should they make a 2nd plugin that follows "bridge"
and adds their iptables (not allowed in CNI)? Should they make a
wrapper plugin that calls bridge and then does their own work?

They can either fork the bridge plugin or do a wrapper one. Ideally they

would abstract out the iptables rules into something they can contribute upstream

to CNI's bridge plugin.

Tim Hockin

unread,

Aug 21, 2015, 6:05:12 PM8/21/15

to Eugene Yakubovich, kubernetes-sig-network

Do you really want the "base" plugins to accumulate those sorts of
features? I like the idea of wrapping other plugins - formalizing
that pattern would be interesting. Keep a handful of very stable,
reasonably configurable (but not crazy) base plugins that people can
decorate.

eugene.y...@coreos.com

unread,

Aug 21, 2015, 6:17:46 PM8/21/15

to kubernetes-sig-network, eugene.y...@coreos.com

I guess I'd only want these base plugins to get the features if they're of general use.

For example, if we're talking about Docker links, then no. But if it's to

restrict cross talk between networks (which CNI does not currently do), then

sure.

On Friday, August 21, 2015 at 3:05:12 PM UTC-7, Tim Hockin wrote:

Do you really want the "base" plugins to accumulate those sorts of
features? I like the idea of wrapping other plugins - formalizing
that pattern would be interesting. Keep a handful of very stable,
reasonably configurable (but not crazy) base plugins that people can
decorate.

On Fri, Aug 21, 2015 at 2:41 PM, <eugene.y...@coreos.com> wrote:
>
> On Friday, August 21, 2015 at 2:25:03 PM UTC-7, Tim Hockin wrote:
>>
>> Can you answer how, in CNI, something like Docker would work? They
>> want the "bridge" plugin but they want to add some per-container
>> iptables rules on top of it.
>>
>> Should they fork the bridge plugin into their own and implement their
>> custom behavior? Should they make a 2nd plugin that follows "bridge"
>> and adds their iptables (not allowed in CNI)? Should they make a
>> wrapper plugin that calls bridge and then does their own work?
>
>
> They can either fork the bridge plugin or do a wrapper one. Ideally they
> would abstract out the iptables rules into something they can contribute
> upstream
> to CNI's bridge plugin.
>
> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to kubernetes-sig-network+unsub...@googlegroups.com.

Paul Tiplady

unread,

Aug 24, 2015, 8:45:15 PM8/24/15

to kubernetes-sig-network

On Thursday, August 20, 2015 at 10:30:15 AM UTC-7, Tim Hockin wrote:

The MASQUERADE stuff is needed regardless of whether you use a docker
bridge or Calico or Weave or Flannel, but it's actually pretty
site-specific. In GCE we shouldbasically say "anything that is not
destined for 10.0.0.0/8 needs masquerade", but that's not right. It
should be "anything destined for and RFC1918 address". But there are
probably cases where we would want the masquerade even withing RFC1918
space (I'm thinking VPNs, maybe?). Outside of GCE, the rules are
obviously totally different. sShould this be something that
kubernetes understands or handles at all? Or can we punt this down to
the node-management layer (in as much as we have one)?

As you say, this is site-specific not plugin-specific; I think it's reasonable that, if needed, NAT should be set up when configuring the node (since the cloud-specific provisioner is better placed than the plugin to understand this requirement). Depending on the datacenter, there could be NAT on the node, NAT at the gateway, or no NAT at all if your pod IPs are publicly routable. Would be a win to keep that complexity out of k8s itself.

>
> We've been talking about locking down host-pod communication in the general
> case as part of our thoughts on security. There are still cases where
> host->pod communication is needed (e.g. NodePort), but at the moment our
> thinking is to treat infrastructure as "outside the cluster". As far as
> security is concerned, we think the default behavior should be "allow from
> pods within my namespace". Anything beyond that can be configured using
> security policy.

See above - what about cases where the node needs to legitimately
access cluster services (the canonical case being a docker registry)?

I think we'd like to have an intermediate level of access to a service for "allow from all pods and k8s infrastructure (but not from outside the datacenter)", but this gets tricky because "from k8s infrastructure" could include traffic originally from a load balancer which has been forwarded from a NodePort service on one node to a pod on a second node (i.e. indistinguishable from internal node->pod traffic). I think we can just document the security impact of the combination [nodePort service + cluster-accessible pods => globally-accessible service]. Hopefully this goes away when LBs are smart enough that we don't need nodePort (or when we can use headless services + DNS instead).

> I like this model because it would allow Calico to provide a single CNI
> plugin for Kubernetes, and have it run for any containerizer (docker, rkt,
> runc, ...). As k8s support for different runtimes grows, this will become an
> increasingly significant issue. (Right now we can just target docker and be
> done with it).

Does CNI work with Docker?

Not natively (i.e. Docker calling into CNI). Though if it finds success in k8s, then this network plugin model could nudge the eventual standardized API that the OCI arrives upon.

CNI can quite straightforwardly configure networking for Docker containers in Kubernetes though. The approach I took for my prototype is to create the pod infra docker container with `--net=none`, and then have k8s call directly into CNI to set up networking. The main bit of complexity was rewiring the IP-learning for the new pod (since CNI returns the IP and expects the orchestrator to remember it, and there is no analogue to the 'docker inspect' command). Now that I I've got that working correctly, it's also removed the requirement for the STATUS plugin action, too.

Paul Tiplady

unread,

Aug 31, 2015, 8:29:48 PM8/31/15

to kubernetes-sig-network

As mentioned before, I've done some prototyping of replacing the current plugin interface with CNI. I've written up a design doc for my proposed changes.

I'd be interested to get folks' feedback on this approach; I'm going to spend the next day or two polishing my prototype so that I can have the code the talking.

eugene.y...@coreos.com

unread,

Aug 31, 2015, 9:12:15 PM8/31/15

to kubernetes-sig-network

Paul,

This is a great start. One shortcoming of CNI right now is that there's no good library. There's some code in https://github.com/appc/cni/tree/master/pkg and some still in rkt (https://github.com/coreos/rkt/tree/master/networking). It really needs to be pulled together to make using CNI easier from both the container runtime and for the plugin writers (the plugin side is currently better). To this end, Michael Bridgen from Weave was working on putting a library together (https://github.com/squaremo/cni/tree/libcni) but I don't know where he is at with it.

-Eugene

Tim Hockin

unread,

Sep 1, 2015, 1:16:40 AM9/1/15

to Paul Tiplady, kubernetes-sig-network

Notes as I read.

The biggest problem I have with this (and it's not necessarily a
show-stopper) is that a container created with plain old 'docker run'
will not be part of the kubernetes network because we will have
orchestrated the network at a higher level. In an ideal world, we'd
teach docker itself about the plugins and then simply delegate to it
as we do today.

That said, the more I dig into Docker's networking plugins the less I
like them. Philosophically and practically a daemon-free model built
around exec is so much cleaner. It seems at least theoretically
possible to bridge libnetwork to run CNI plugins, but probably not
without mutating the CNI spec to the more proscriptive libnetwork
model.

You say you'll push the IP to the apiserver - I guess you mean in
pod.status.podIP ?

Regarding CNI network configs, I assume that over time this might even
be something we expose through kubernetes - a la Docker networks.
The advantage here is that network management is a clean abstraction
distinct from interface management.

To your questions:

1) Can we eliminate Init?

I think yes.

2) Can we eliminate Status?

I think yes.

3) Can we cut over immediately to CNI, or do we need to keep the old
plugin interface for a time? If so, how long?

I think this becomes a community decision. There are a half-dozen to
a dozen places I know of using this feature. IFF they were OK making
a jump to something like CNI, we could do a hard cutover.

4) Can we live without the vendoring naming rules? Can we establish
that convention for plugins is to vendor-name the binary?
mycompany.com~myplug or something? Maybe it's not a huge deal.

I'll add #5 - does this mean we have no concept of in-process plugin?
Or do we retain the facade of an in-process API like we have now.

Overall this looks plausible, but I'd like to hear from all the folks
who have plugins implemented today, especially if you have both CNI
and libnetwork experience. The drawback I listed above (plain old
'docker run') is real, but maybe something we can live with. Maybe
it's actually a feature?

As a discussion point - how much would we have to adulterate CNI to
make a bridge? It sure would be nice to use the same plugins in both
Docker and rkt - I sure as hell don't want to tweak and debug this
stuff twice.

We could have a little wrapper binary that knew about a static network
config, and anyone who asked for a new network from our plugin would
get an error, then we just feed the static config(s) to the wrapped
CNI binary. We'd have to split Add into create/join but that doesn't
seem SO bad. What else?

> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to kubernetes-sig-ne...@googlegroups.com.

Paul Tiplady

unread,

Sep 1, 2015, 12:39:30 PM9/1/15

to kubernetes-sig-network

Hi Eugene,

I've been talking with Michael, and the prototype integration that I built uses his libcni branch. Thus far his code has met my needs, so I think his approach is sound.

Cheers,
Paul

Paul Tiplady

unread,

Sep 1, 2015, 1:46:35 PM9/1/15

to kubernetes-sig-network, paul.t...@metaswitch.com

On Monday, August 31, 2015 at 10:16:40 PM UTC-7, Tim Hockin wrote:

Notes as I read.

The biggest problem I have with this (and it's not necessarily a
show-stopper) is that a container created with plain old 'docker run'
will not be part of the kubernetes network because we will have
orchestrated the network at a higher level. In an ideal world, we'd
teach docker itself about the plugins and then simply delegate to it
as we do today.

Agree that this is a change in workflow -- though `docker run` was already broken with the existing network plugin API for plugins which don't use the docker bridge (e.g. Calico).

I think we can get round this by using `kubectl run|exec`; now that exec has -i and --tty options, I think the main usecases are covered.

That said, the more I dig into Docker's networking plugins the less I
like them. Philosophically and practically a daemon-free model built
around exec is so much cleaner. It seems at least theoretically
possible to bridge libnetwork to run CNI plugins, but probably not
without mutating the CNI spec to the more proscriptive libnetwork
model.

You say you'll push the IP to the apiserver - I guess you mean in
pod.status.podIP ?

Yep

Regarding CNI network configs, I assume that over time this might even
be something we expose through kubernetes - a la Docker networks.
The advantage here is that network management is a clean abstraction
distinct from interface management.

Good point -- this could be a nice feature, if you have one group of pods which have a very different set of network requirements (e.g. latency-sensitive, or L2 vs. pure-L3) then you can bundle them onto a different network. Routing between networks could be fun though.

To your questions:

1) Can we eliminate Init?

I think yes.

2) Can we eliminate Status?

I think yes.

3) Can we cut over immediately to CNI, or do we need to keep the old
plugin interface for a time? If so, how long?

I think this becomes a community decision. There are a half-dozen to
a dozen places I know of using this feature. IFF they were OK making
a jump to something like CNI, we could do a hard cutover.

4) Can we live without the vendoring naming rules? Can we establish
that convention for plugins is to vendor-name the binary?
mycompany.com~myplug or something? Maybe it's not a huge deal.

I'll add #5 - does this mean we have no concept of in-process plugin?
Or do we retain the facade of an in-process API like we have now.

Added a bullet for this in the doc.

CNI doesn't currently have the concept of an in-process plugin. Looks like with the current API this only for vendors that are extending the kubernetes codebase, or am I missing something?

With Michael Bridgen's work to turn CNI into a library, in-process CNI-style plugins become a viable option. A couple possible approaches:

* Extend libcni to add the concept of an in-process plugin as a native concept. (libcni could formalize an interface to run these in-process plugins as standalone plugins as well, which would mean developers can target both in- and out-of-process plugins if they care).

* Create a hook in our plugin code where in-process code can run (consuming the CNI API objects that we created to pass to the CNI exec-plugin).

I think the latter could be done with the current proposal on a per-vendor basis, but there might be benefit in formalizing that interface.

Overall this looks plausible, but I'd like to hear from all the folks
who have plugins implemented today, especially if you have both CNI
and libnetwork experience. The drawback I listed above (plain old
'docker run') is real, but maybe something we can live with. Maybe
it's actually a feature?

As a discussion point - how much would we have to adulterate CNI to
make a bridge? It sure would be nice to use the same plugins in both
Docker and rkt - I sure as hell don't want to tweak and debug this
stuff twice.

We could have a little wrapper binary that knew about a static network
config, and anyone who asked for a new network from our plugin would
get an error, then we just feed the static config(s) to the wrapped
CNI binary. We'd have to split Add into create/join but that doesn't
seem SO bad. What else?

I was pondering this approach -- the big stumbling block for me is that a CNM createEndpoint can occur on a different host than the joinEndpoint call, so naively we'd need a cluster-wide distributed datastore to keep track of the Create calls.

Short of breaking the spec and disallowing Create and Join from being called on different hosts, I don't see a way around that issue.

> email to kubernetes-sig-network+unsub...@googlegroups.com.

eugene.y...@coreos.com

unread,

Sep 1, 2015, 3:07:16 PM9/1/15

to kubernetes-sig-network, paul.t...@metaswitch.com

On Tuesday, September 1, 2015 at 10:46:35 AM UTC-7, Paul Tiplady wrote:

I'll add #5 - does this mean we have no concept of in-process plugin?
Or do we retain the facade of an in-process API like we have now.

Added a bullet for this in the doc.

CNI doesn't currently have the concept of an in-process plugin. Looks like with the current API this only for vendors that are extending the kubernetes codebase, or am I missing something?

CNI doesn't have in-process plugins because that requires shared object (.so) support and I believe that Go has problems with that (although it maybe fixed in 1.5). Technically CNI is not Go specific but realistically so much software in this space is written in Go. Having "in-tree" plugins don't require .so support but to be honest those never pass my definition of "plugins". FWIW, I would have been quite happy to just have .so plugins as there's no fork/exec overhead.

I was pondering this approach -- the big stumbling block for me is that a CNM createEndpoint can occur on a different host than the joinEndpoint call, so naively we'd need a cluster-wide distributed datastore to keep track of the Create calls.

Short of breaking the spec and disallowing Create and Join from being called on different hosts, I don't see a way around that issue.

Wow, I was not aware of that. How does it work now? CreateEndpoint creates the interface (e.g. veth pair) on the host. Join then specifies the interface names that should be moved into the sandbox. I don't really understand how Join can be called on a different host -- wouldn't there be no interface to move on that host then?

Gurucharan Shetty

unread,

Sep 1, 2015, 3:43:42 PM9/1/15

to eugene.y...@coreos.com, kubernetes-sig-network, paul.t...@metaswitch.com

Let us not make an assumption that all plugins will be Golang based.
OpenStack Neutron currently has python libraries for clients and my
plugin that integrates containers with openstack is python based.

Fwiw, Docker's libnetwork does not mandate golang plugins. It uses
REST APIs to talk to plugins.

> Wow, I was not aware of that. How does it work now? CreateEndpoint creates
> the interface (e.g. veth pair) on the host.

veth pairs are not mandated to be created on CreateEndpoint (). You
are only to return IP addresses, MAC addresses, Gateway etc. What this
does in theory is that it provides flexibility with container mobility
across hosts. So you can effectively create an endpoint from a central
location and ask a container to join that endpoint from any host.

>Join then specifies the
> interface names that should be moved into the sandbox. I don't really
> understand how Join can be called on a different host -- wouldn't there be
> no interface to move on that host then?
>

> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to kubernetes-sig-ne...@googlegroups.com.

eugene.y...@coreos.com

unread,

Sep 1, 2015, 5:11:21 PM9/1/15

to kubernetes-sig-network, eugene.y...@coreos.com, paul.t...@metaswitch.com

On Tuesday, September 1, 2015 at 12:43:42 PM UTC-7, Gurucharan Shetty wrote:

Let us not make an assumption that all plugins will be Golang based.
OpenStack Neutron currently has python libraries for clients and my
plugin that integrates containers with openstack is python based.

Sure, I would never assume that all plugins will be Go based. Rather I want to not

exclude Go based ones.

Fwiw, Docker's libnetwork does not mandate golang plugins. It uses
REST APIs to talk to plugins.

Right but REST is for out of process.

> Wow, I was not aware of that. How does it work now? CreateEndpoint creates
> the interface (e.g. veth pair) on the host.
veth pairs are not mandated to be created on CreateEndpoint (). You
are only to return IP addresses, MAC addresses, Gateway etc. What this
does in theory is that it provides flexibility with container mobility
across hosts. So you can effectively create an endpoint from a central
location and ask a container to join that endpoint from any host.

I think I understand what you and Paul meant by different hosts. Yes, their

model of decoupling the container from the interfaces is slick and something

that CNI can't do. However for all its slickness, I am not a fan of moving containers

or IPs around. If a container dies, start a new one. And give it a new IP -- don't equate

the IP to a service (yes, you need a service discovery). Anyway, that's all in line

with Kubernetes thinking and so does not need to be supported in a Kubernetes cluster.

I think if a user wants to have this mobility, they will not be running Kubernetes.

>Join then specifies the
> interface names that should be moved into the sandbox. I don't really
> understand how Join can be called on a different host -- wouldn't there be
> no interface to move on that host then?
>
> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to kubernetes-sig-network+unsub...@googlegroups.com.

Tim Hockin

unread,

Sep 1, 2015, 5:54:48 PM9/1/15

to Eugene Yakubovich, kubernetes-sig-network, Paul Tiplady

On Tue, Sep 1, 2015 at 12:07 PM, <eugene.y...@coreos.com> wrote:
>
> On Tuesday, September 1, 2015 at 10:46:35 AM UTC-7, Paul Tiplady wrote:
>>>
>>>
>>> I'll add #5 - does this mean we have no concept of in-process plugin?
>>> Or do we retain the facade of an in-process API like we have now.
>>>
>> Added a bullet for this in the doc.
>>
>> CNI doesn't currently have the concept of an in-process plugin. Looks like
>> with the current API this only for vendors that are extending the kubernetes
>> codebase, or am I missing something?
>>
>
> CNI doesn't have in-process plugins because that requires shared object
> (.so) support and I believe that Go has problems with that (although it
> maybe fixed in 1.5). Technically CNI is not Go specific but realistically so
> much software in this space is written in Go. Having "in-tree" plugins don't
> require .so support but to be honest those never pass my definition of
> "plugins". FWIW, I would have been quite happy to just have .so plugins as
> there's no fork/exec overhead.

I didn't mean to imply .so, though that's a way to do it too. I meant
to ask whether kubernetes/docker/rkt could have network plugins
defined in code, one of which was an exec-proxy, or whether exec was
it. I don't feel strongly that in-process is needed at this point.

>> I was pondering this approach -- the big stumbling block for me is that a
>> CNM createEndpoint can occur on a different host than the joinEndpoint call,
>> so naively we'd need a cluster-wide distributed datastore to keep track of
>> the Create calls.
>>
>> Short of breaking the spec and disallowing Create and Join from being
>> called on different hosts, I don't see a way around that issue.
>
> Wow, I was not aware of that. How does it work now? CreateEndpoint creates
> the interface (e.g. veth pair) on the host. Join then specifies the
> interface names that should be moved into the sandbox. I don't really
> understand how Join can be called on a different host -- wouldn't there be
> no interface to move on that host then?

Yeah, where do you get that information? Not calling you wrong, just
something that was not at all clear, if it is true.

Tim Hockin

unread,

Sep 1, 2015, 5:56:46 PM9/1/15

to Gurucharan Shetty, Eugene Yakubovich, kubernetes-sig-network, Paul Tiplady

On Tue, Sep 1, 2015 at 12:43 PM, Gurucharan Shetty <she...@nicira.com> wrote:
> Let us not make an assumption that all plugins will be Golang based.
> OpenStack Neutron currently has python libraries for clients and my
> plugin that integrates containers with openstack is python based.
>
> Fwiw, Docker's libnetwork does not mandate golang plugins. It uses
> REST APIs to talk to plugins.
>
>
>> Wow, I was not aware of that. How does it work now? CreateEndpoint creates
>> the interface (e.g. veth pair) on the host.
> veth pairs are not mandated to be created on CreateEndpoint (). You
> are only to return IP addresses, MAC addresses, Gateway etc. What this
> does in theory is that it provides flexibility with container mobility
> across hosts. So you can effectively create an endpoint from a central
> location and ask a container to join that endpoint from any host.

I feel dumb, but I don't get it. Since you seem to understand it, can
you spell it out in more detail?

Paul Tiplady

unread,

Sep 1, 2015, 6:41:35 PM9/1/15

to kubernetes-sig-network, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com

On Tuesday, September 1, 2015 at 2:56:46 PM UTC-7, Tim Hockin wrote:

I feel dumb, but I don't get it. Since you seem to understand it, can
you spell it out in more detail?

This isn't spelled out in the libnetwork docs, though it should be since it's highly unintuitive. We had to do a lot of code/IRC spelunking to get a clear picture of this.

The best I can do from the docs is this:

"

One of a FAQ on endpoint join() API is that, why do we need an API to create an Endpoint and another to join the endpoint.

The answer is based on the fact that Endpoint represents a Service which may or may not be backed by a Container. When an Endpoint is created, it will have its resources reserved so that any container can get attached to the endpoint later and get a consistent networking behaviour.

"

Note "resources reserved", not "interfaces created".

Here's an issue that Gurucharan raised on the libnetwork repo which clarifies somewhat: https://github.com/docker/libnetwork/issues/133#issuecomment-99927188

Although see this issue, which suggests that in CNM the CreateEndpoint ("service publish") event gets broadcasted (via the docker daemon's distributed KV store) to the network plugins on every host, so it looks like it's not even possible to optimistically create a veth and then hope that the Endpoint.Join gets run on the same host.

Note that things like interface name and MAC address are assigned at CreateEndpoint time, as if you were creating the veth at that stage.

eugene.y...@coreos.com

unread,

Sep 1, 2015, 7:19:10 PM9/1/15

to kubernetes-sig-network, eugene.y...@coreos.com, paul.t...@metaswitch.com

On Tuesday, September 1, 2015 at 2:54:48 PM UTC-7, Tim Hockin wrote:

On Tue, Sep 1, 2015 at 12:07 PM, <eugene.y...@coreos.com> wrote:
>
> On Tuesday, September 1, 2015 at 10:46:35 AM UTC-7, Paul Tiplady wrote:
>>>
>>>
>>> I'll add #5 - does this mean we have no concept of in-process plugin?
>>> Or do we retain the facade of an in-process API like we have now.
>>>
>> Added a bullet for this in the doc.
>>
>> CNI doesn't currently have the concept of an in-process plugin. Looks like
>> with the current API this only for vendors that are extending the kubernetes
>> codebase, or am I missing something?
>>
>
> CNI doesn't have in-process plugins because that requires shared object
> (.so) support and I believe that Go has problems with that (although it
> maybe fixed in 1.5). Technically CNI is not Go specific but realistically so
> much software in this space is written in Go. Having "in-tree" plugins don't
> require .so support but to be honest those never pass my definition of
> "plugins". FWIW, I would have been quite happy to just have .so plugins as
> there's no fork/exec overhead.

I didn't mean to imply .so, though that's a way to do it too. I meant
to ask whether kubernetes/docker/rkt could have network plugins
defined in code, one of which was an exec-proxy, or whether exec was
it. I don't feel strongly that in-process is needed at this point.

Kubernetes/docker/rkt could certainly have "built-in types" aside from the exec based ones.

But if that built-in type is useful in general, it should be a separate executable so

it could be re-used in other projects. And since we should strive to make these

networking plugins not tied to a container runtime, they should all be executables

by my logic.

Gurucharan Shetty

unread,

Sep 1, 2015, 7:30:05 PM9/1/15

to Tim Hockin, Eugene Yakubovich, kubernetes-sig-network, Paul Tiplady

Let me try to explain what I mean to the best of my ability with an
analogy of VMs and Network Virtualization (But before that let me
clarify that since k8 is single tenant orchestrator and has been
designed with VIP and load balancers as a basic building block, the
feature is not really useful for k8. )

With Network Virtualization, you can have 2 VMs belonging to 2
different tenants run on the same hypervisor with the same IP address.
The packet sent by VM of one tenant will never reach the VM of another
tenant, even though they are connected to the same vSwitch (e.g
openvswitch). You can apply policies to these VM interfaces (e.g.
QoS, Firewall) etc. And then you can move one of the VM to a different
hypervisor (vMotion). All the policies (e.g QoS, firewall) will now
follow the VM to the new hypervisor automatically. The IP address and
MAC address follows too to the new VM. The network controller simply
reprograms the various vswitches so that the packet forwarding happens
to the new location.

Since you have already associated your policies (firewall, QoS etc)
with the endpoint, you can destroy the VM that the endpoint is
connected to and then create a new VM at a different hypervisor and
attach the old endpoint (with its old policies) to the new VM.

My reading of what libnetwork achieves with containers is the same as
above. i.e., you can create a network endpoint with policies applied
and then attach it to any container on any host.

PS: OpenStack Neutron has the same model. The network endpoints are
created first. An IP and MAC is provisioned to that network endpoint.
And then a VM is created asking it to attach to that network endpoint.

eugene.y...@coreos.com

unread,

Sep 1, 2015, 7:45:26 PM9/1/15

to kubernetes-sig-network, tho...@google.com, eugene.y...@coreos.com, paul.t...@metaswitch.com

Let me try to explain what I mean to the best of my ability with an
analogy of VMs and Network Virtualization (But before that let me
clarify that since k8 is single tenant orchestrator and has been
designed with VIP and load balancers as a basic building block, the
feature is not really useful for k8. )

With Network Virtualization, you can have 2 VMs belonging to 2
different tenants run on the same hypervisor with the same IP address.
The packet sent by VM of one tenant will never reach the VM of another
tenant, even though they are connected to the same vSwitch (e.g
openvswitch). You can apply policies to these VM interfaces (e.g.
QoS, Firewall) etc. And then you can move one of the VM to a different
hypervisor (vMotion). All the policies (e.g QoS, firewall) will now
follow the VM to the new hypervisor automatically. The IP address and
MAC address follows too to the new VM. The network controller simply
reprograms the various vswitches so that the packet forwarding happens
to the new location.

Since you have already associated your policies (firewall, QoS etc)
with the endpoint, you can destroy the VM that the endpoint is
connected to and then create a new VM at a different hypervisor and
attach the old endpoint (with its old policies) to the new VM.

My reading of what libnetwork achieves with containers is the same as
above. i.e., you can create a network endpoint with policies applied
and then attach it to any container on any host.

That makes sense except for this conflation of endpoint and service.

If the CreateEndpoint is really just reserving an IP for the service that

can later be backed by some container, there is really no reason to

allocate a MAC at that point (which CreateEndpoint requires as it is expected

that it will call AddInterface whose second arg is a MAC).

While I certainly appreciate having a driver type that allows this type of migration,

I don't like every driver having to support this model. For example, this won't

really work with ipvlan where the MAC address can't be generated (it's the host's MAC)

and moved around.

Considering above statement, I don't want to modify CNI towards it. This still leaves

me hanging on how to change CNI enough to make the libnetwork interop possible.

Tim Hockin

unread,

Sep 1, 2015, 8:18:57 PM9/1/15

to Paul Tiplady, kubernetes-sig-network, Gurucharan Shetty, Eugene Yakubovich

On Tue, Sep 1, 2015 at 3:41 PM, Paul Tiplady
<paul.t...@metaswitch.com> wrote:

> Here's an issue that Gurucharan raised on the libnetwork repo which
> clarifies somewhat:
> https://github.com/docker/libnetwork/issues/133#issuecomment-99927188

This was not answered:

"""Is your thought process that the driver can create vethnames based
on endpointuuid to make it truly portable. i.e., one can call
driver.CreateEndpoint() on one host and return back vethnames based on
eid, but not actually create the veths. Call driver.Join() on a
different host. So even though veth names are created during 'docker
service create' but veths are physically created only during 'docker
service join'. (But, vethnames can only be 15 characters long on
Linux, so there is a very very small possibility of collisions.)"""

> Although see this issue, which suggests that in CNM the CreateEndpoint
> ("service publish") event gets broadcasted (via the docker daemon's
> distributed KV store) to the network plugins on every host, so it looks like
> it's not even possible to optimistically create a veth and then hope that
> the Endpoint.Join gets run on the same host.

That sounds ludicrous to me. Can we get some confirmation from
libnetwork folks?

> Note that things like interface name and MAC address are assigned at
> CreateEndpoint time, as if you were creating the veth at that stage.

So I make up l locally-random name and expect it to be globally unique?

Tim Hockin

unread,

Sep 1, 2015, 8:46:25 PM9/1/15

to Gurucharan Shetty, Eugene Yakubovich, kubernetes-sig-network, Paul Tiplady

On Tue, Sep 1, 2015 at 4:30 PM, Gurucharan Shetty <she...@nicira.com> wrote:

> Since you have already associated your policies (firewall, QoS etc)
> with the endpoint, you can destroy the VM that the endpoint is
> connected to and then create a new VM at a different hypervisor and
> attach the old endpoint (with its old policies) to the new VM.
>
> My reading of what libnetwork achieves with containers is the same as
> above. i.e., you can create a network endpoint with policies applied
> and then attach it to any container on any host.

Thanks for the explanation. I understand it better. It's incredibly
complicated, isn't it? I think a main distinction between this
example with VMs and the container ethos is identity. A VM's IP
really is part of its identity, but a container is a point in time. I
know people will argue with this in both directions, but that seems to
be "generally" the way people think about things.

A VM's IP is expected to remain constant across moves and restarts. A
container's IP is less constant (not at all today).

More importantly, there are things that container networking can do
that preclude this level of migratability - ipvlan or plain old docker
bridges being good examples. What are "simple" plugins supposed to
assume?

Straw man, thanks to Prashanth here for discussion:

Write a "cni-exec" libnetwork driver.

You can not create new networks using it. When a CreateNetwork() call
is received we check for a static config file on disk.
E.g.CreateNetwork(name = "foobar") looks for
/etc/cni/networks/foobar.json, and if it does not exist or does not
match, fail.

PROBLEM: it looks like the CreateNetwork() call can not see the name
of the network. Let's assume that could be fixed.

CreateEndpoint() does just enough work to satisfy the API, and save
all of its state in memory.

PROBLEM: If docker goes down, how does this state get restored?

endpoint.Join() takes the saved info from CreateEndpoint(), massages
it into CNI-compatible data, and calls the CNI plugin.

Someone shoot this down? It's not general purpose in the sense that
docker's network CLI can't be used, but would it be good enough to
enable people to use the same CNI plugins across docker and rkt?

Gurucharan Shetty

unread,

Sep 1, 2015, 8:52:52 PM9/1/15

to Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich

>> Note that things like interface name and MAC address are assigned at
>> CreateEndpoint time, as if you were creating the veth at that stage.
>
> So I make up l locally-random name and expect it to be globally unique?

Actually that is not true. From:
https://github.com/docker/libnetwork/blob/master/docs/remote.md

The veth names are returned back during network join call. And network
join is not broadcasted to all hosts. If I remember correctly only
create network, delete network, create endpoint, delete endpoint are
broadcasted to all nodes.

Tim Hockin

unread,

Sep 1, 2015, 8:59:44 PM9/1/15

to Gurucharan Shetty, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich

They are broadcasted or they are posted to the KV store? In other
words, are plugins expected to watch() the KV store for new endpoints
and networks, or to lazily fetch them?

How does a "remote" plugin do this?

Gurucharan Shetty

unread,

Sep 1, 2015, 9:09:39 PM9/1/15

to Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich

On Tue, Sep 1, 2015 at 5:59 PM, Tim Hockin <tho...@google.com> wrote:
> They are broadcasted or they are posted to the KV store? In other
> words, are plugins expected to watch() the KV store for new endpoints
> and networks, or to lazily fetch them?

Docker daemon in every host reads the kv store and send that
information to the remote plugin drivers on that host via the plugin
API. The remote drivers are not supposed to look at Docker's kv store
but only rely on the information received via the API.

>
> How does a "remote" plugin do this?

The current design is harsh on remote plugins (the libnetwork
developers have promised to look into it to see if they can come up
with a viable solution. See
https://github.com/docker/libnetwork/issues/313). My remote driver
(which integrates with OpenStack neutron, but runs the containers
inside the VMs) makes call to the OpenStack Neutron database to
store/fetch the information. With the current design a single user
request via docker CLI gets converted into 'X' requests to Neutron
database (where 'X' = number of hosts in the cluster) and that is
unworkable for large number of hosts and large number of containers.
That is one reason I like the k8 plugins.

Tim Hockin

unread,

Sep 1, 2015, 9:19:06 PM9/1/15

to Gurucharan Shetty, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich

Guru,

Thanks for the explanations. I appreciate you being "docker by proxy" here :)

I do feel like I mostly understand the libnetwork model now, though
there are some very clear limitations of it. I also feel like I could
work around most of the limitations, but the solutions are the same as
you - make our own side-band calls to our own API and fetch
information that Docker does not provide. We can't implement their
libkv in terms of our API server because it is not a general purpose
store, though maybe we could intercept accesses and parse the path?
Puke.

The more I learn, the less I like it. It feels incredibly convoluted
for simple drivers to do anything useful.

Gurucharan Shetty

unread,

Sep 1, 2015, 9:57:07 PM9/1/15

to Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich

I think if docker daemons are not started as part of a distributed kv
store, but rather as part of a local only kv store (for e.g each
minion will have a consul/etcd client running that is local only),
then libnetwork can be abused for k8 purposes for IP address reporting
via 'docker inspect'. If such a thing is done, any commands like
'docker network ls', 'docker service ls' etc will report false data,
but k8 need not show that to the user.

Prashanth B

unread,

Sep 1, 2015, 11:06:04 PM9/1/15

to Gurucharan Shetty, Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich

> Docker daemon in every host reads the kv store and send that

information to the remote plugin drivers on that host via the plugin

API.

Your statement is confusing, can you clarify? It seems like the remote driver is just like a native driver that does nothing but parse its arguments into json and post them to the plugin using an http client. There is code to watch the kv store in the controller itself, but it no-ops if a store isn't provided (that's how the bridge driver works). IIUC the plugin just needs to run an HTTP server bound to a unix socket in /run/docker/plugins.

If this is right it makes our first cut driver simpler, we can directly use the apiserver from our (hypothetical) kubelet-driver for things that need persistence, without running another database.

Gurucharan Shetty

unread,

Sep 1, 2015, 11:59:03 PM9/1/15

to Prashanth B, Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich

On Tue, Sep 1, 2015 at 8:06 PM, Prashanth B <be...@google.com> wrote:
>> Docker daemon in every host reads the kv store and send that
> information to the remote plugin drivers on that host via the plugin
> API.
>
> Your statement is confusing, can you clarify? It seems like the remote
> driver is just like a native driver that does nothing but parse its
> arguments into json and post them to the plugin using an http client. There
> is code to watch the kv store in the controller itself, but it no-ops if a
> store isn't provided (that's how the bridge driver works). IIUC the plugin
> just needs to run an HTTP server bound to a unix socket in
> /run/docker/plugins.
>
> If this is right it makes our first cut driver simpler, we can directly use
> the apiserver from our (hypothetical) kubelet-driver for things that need
> persistence, without running another database.

I did not understand your question/assertion correctly. So I am going
to make a detailed elaboration.

When I say a "remote driver", I mean a server which listens for REST
API calls from docker daemon.

I have a remote driver written in Python here:
https://github.com/shettyg/ovn-docker/blob/master/ovn-docker-driver

That driver writes the line "tcp://0.0.0.0:5000" in the file
"/etc/docker/plugins/openvswitch.spec"

Now, when docker daemon starts, it will send the equivalent of:
curl -i -H 'Content-Type: application/json' -X POST
http://localhost:5000/Plugin.Activate

And my driver is suppose to respond with:

{
"Implements": ["NetworkDriver"]
}

1. User creates a network:
docker network create -d openvswitch foo

This command from the user results in my server receiving the
equivalent of the following POST request.

i.e:
curl -i -H 'Content-Type: application/json' -X POST -d
'{"NetworkID":"UUID","Options":{"blah":"bar"}}'
http://localhost:5000/NetworkDriver.CreateNetwork

Now things get interesting. The above POST request gets repeated on
every host that belongs to the docker cluster.
So your driver should figure out what is a duplicate request.

2. User creates a service
docker service publish my-service.foo

The above command will call the equivalent of:

curl -i -H 'Content-Type: application/json' -X POST -d
'{"NetworkID":"UUID","EndpointID":"dummy-endpoint","Interfaces":[],"Options":{}}'
http://localhost:5000/NetworkDriver.CreateEndpoint

Again the same command gets called in every host.

I hope that answers your question?

Tim Hockin

unread,

Sep 2, 2015, 12:35:53 AM9/2/15

to Gurucharan Shetty, Prashanth B, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich

Prashanth's point, I think, was that we could have Kubelet act as a
network plugin, lookup/store network and endpoint config in our api
server, and exec CNI plugins for join.

This falls down a bit for a few reasons. First, libkv assumes an
arbitrary KV store, which our APIserver is not. Second, the fact that
Network objects can be created through Docker or kubernetes is not
cool. Third, if we only allow network objects through kubernetes we
can't see the name of the object Docker thinks it is creating.

Prashanth B

unread,

Sep 2, 2015, 1:22:23 AM9/2/15

to Tim Hockin, Gurucharan Shetty, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich

> I hope that answers your question?

Thanks for the example. So what I'm proposing is a networking model with the following limitations for the short term:

1. Only one (docker) network object, this is the kubernetes network. All endpoints must join it.

2. Containers can only join endpoints on the same host.

3. A join execs CNI plugin with json composed from the join and endpoint create, derived from storage (physical memory, sqlite, apiserver -- as long as it's not another database it remains an implementation detail).

> First, libkv assumes an arbitrary KV store, which our APIserver is not.

Doesn't look like libkv is a requirement for remote plugins. If we start docker with a plugin but without a kv store, the json will get posted to the localhost http server, but not propogated to the other hosts (untested, this from staring at code). This is ok, because there is only 1 network and no cross host endpoint joining. If we really need cross host consistency, we have an escape hatch via apiserver watch.

> Third, if we only allow network objects through kubernetes we can't see the name of the object Docker thinks it is creating.

We don't even have to allow this. The cluster is bootstrapped with a network object. It's readonly thereafter. Create network will noop after that.

This would give users the ability to dump their own docker plugins into /etc/docker/plugins, start the kubelet with --manage-networking=false, and use dockers remote plugin api. At the same time CNI should work with manage-networking=true.

Tim Hockin

unread,

Sep 2, 2015, 2:23:15 AM9/2/15

to Prashanth B, Gurucharan Shetty, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich

We will eventually want to add something akin to multiple Networks, so
I want to be dead sure that it is viable before we choose and
implement a model.

Gurucharan Shetty

unread,

Sep 2, 2015, 10:36:08 AM9/2/15

to Prashanth B, Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich

On Tue, Sep 1, 2015 at 10:22 PM, Prashanth B <be...@google.com> wrote:
>> I hope that answers your question?
>
> Thanks for the example. So what I'm proposing is a networking model with the
> following limitations for the short term:
> 1. Only one (docker) network object, this is the kubernetes network. All
> endpoints must join it.

IMO, the "one" network object theoretically fits into the current k8
model wherein all pods can communicate with each other over L3. But
let me bring up a couple of points that provides a counter-view.

My understanding of implementation of Docker's inbuilt overlay
solution is that a "network" is a broadcast domain. So if you impose
the same meaning on k8 networking, you actually end up with a
humungous broadcast domain across multiple hosts and it won;t scale.

So one could argue that the current k8 model is that each minion is
one network and all networks are connected to each other via a router.

> 2. Containers can only join endpoints on the same host.
> 3. A join execs CNI plugin with json composed from the join and endpoint
> create, derived from storage (physical memory, sqlite, apiserver -- as long
> as it's not another database it remains an implementation detail).
>
>> First, libkv assumes an arbitrary KV store, which our APIserver is not.
>
> Doesn't look like libkv is a requirement for remote plugins.
> If we start
> docker with a plugin but without a kv store, the json will get posted to the
> localhost http server, but not propogated to the other hosts (untested, this
> from staring at code). This is ok, because there is only 1 network and no
> cross host endpoint joining. If we really need cross host consistency, we
> have an escape hatch via apiserver watch.

You have to start Docker daemon with libkv for libnetwork to work
(atleast based on my observation). It does not matter whether it is
remote driver or the native overlay solution. I will be happy if it
turns out that my assertion is wrong.

Prashanth B

unread,

Sep 2, 2015, 11:41:22 AM9/2/15

to Gurucharan Shetty, Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich

On Wed, Sep 2, 2015 at 7:36 AM, Gurucharan Shetty <she...@nicira.com> wrote:

On Tue, Sep 1, 2015 at 10:22 PM, Prashanth B <be...@google.com> wrote:
>> I hope that answers your question?
>
> Thanks for the example. So what I'm proposing is a networking model with the
> following limitations for the short term:
> 1. Only one (docker) network object, this is the kubernetes network. All
> endpoints must join it.

IMO, the "one" network object theoretically fits into the current k8
model wherein all pods can communicate with each other over L3. But
let me bring up a couple of points that provides a counter-view.

My understanding of implementation of Docker's inbuilt overlay
solution is that a "network" is a broadcast domain. So if you impose
the same meaning on k8 networking, you actually end up with a
humungous broadcast domain across multiple hosts and it won;t scale.

So one could argue that the current k8 model is that each minion is
one network and all networks are connected to each other via a router.

I think this is do-able with the same limitations mentioned in Tim's straw man.

> 2. Containers can only join endpoints on the same host.
> 3. A join execs CNI plugin with json composed from the join and endpoint
> create, derived from storage (physical memory, sqlite, apiserver -- as long
> as it's not another database it remains an implementation detail).
>
>> First, libkv assumes an arbitrary KV store, which our APIserver is not.
>
> Doesn't look like libkv is a requirement for remote plugins.
> If we start
> docker with a plugin but without a kv store, the json will get posted to the
> localhost http server, but not propogated to the other hosts (untested, this
> from staring at code). This is ok, because there is only 1 network and no
> cross host endpoint joining. If we really need cross host consistency, we
> have an escape hatch via apiserver watch.

You have to start Docker daemon with libkv for libnetwork to work
(atleast based on my observation).

You're probably right, since you've written a driver :)

I plan to dig a little and file a docker issue to get their opinions. This is probably a deal breaker. We have a CP store, it just isn't a KV store because it has a well defined schema enforced by the apiserver.

dc...@redhat.com

unread,

Sep 2, 2015, 4:00:37 PM9/2/15

to kubernetes-sig-network, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com

On Tuesday, September 1, 2015 at 7:46:25 PM UTC-5, Tim Hockin wrote:

On Tue, Sep 1, 2015 at 4:30 PM, Gurucharan Shetty <she...@nicira.com> wrote:

> Since you have already associated your policies (firewall, QoS etc)
> with the endpoint, you can destroy the VM that the endpoint is
> connected to and then create a new VM at a different hypervisor and
> attach the old endpoint (with its old policies) to the new VM.
>
> My reading of what libnetwork achieves with containers is the same as
> above. i.e., you can create a network endpoint with policies applied
> and then attach it to any container on any host.

Write a "cni-exec" libnetwork driver.

I started doing that a month ago. It has some fundamental problems, some of which you've outlined and others while I'll talk about below.

https://github.com/dcbw/cni-docker-plugin

You can not create new networks using it. When a CreateNetwork() call
is received we check for a static config file on disk.
E.g.CreateNetwork(name = "foobar") looks for
/etc/cni/networks/foobar.json, and if it does not exist or does not
match, fail.

With libnetwork you create the network definitions with the libnetwork API. If you have a KV store backing libnetwork then it's persistent, but if not then all the network definitions go away when docker exits. So my thought was that k8s would create the networks itself, but as others have mentioned it's a problem that networks can be created with 'docker network add' too.

The first fundamental mismatch between libnetwork/CNM and CNI is that CNM is much more granular than CNI, and it wants more information at each step that CNI isn't willing to give back until the end.

The second fundamental mismatch is that libnetwork/CNM does more than CNI does; it handles moving the interfaces into the right NetNS, setting up routes, and setting the IP address on the interfaces. The plugin's job is simply to create the interface and allocate the addresses, and pass all that back to libnetwork. CNI plugins currently expect to handle all this themselves.

These two issues ensure that there cannot be a direct mapping between CNM and CNI right now due to how they handle interface configuration.

Third, remote plugins are called in a blocking manner so they cannot call back into docker's API to retrieve any extra information they might need (eg, network name).

PROBLEM: it looks like the CreateNetwork() call can not see the name
of the network. Let's assume that could be fixed.

In my implementation I just cached the network ID and started a network watch to grab the name, and all the actual CNM work was done in Join().

CreateEndpoint() does just enough work to satisfy the API, and save
all of its state in memory.

PROBLEM: If docker goes down, how does this state get restored?

You're expected to build libnetwork with a KV store if you want persistence.

endpoint.Join() takes the saved info from CreateEndpoint(), massages
it into CNI-compatible data, and calls the CNI plugin.

Yeah, and at this point in my attempt we have the network name so we can pass that on.

BUT the deal-breaker is that the CNI plugin will expect to move the interface into the right NetNS itself, configure the interface's IP address itself, and more. CNM doens't allow that. CNM also doesn't expose the NetNS FD to the plugins in anyway (though in-process plugins might be able to find it), so the CNI plugin has no idea what network namespace to move the interface into. That's where I stopped with cni-docker-plugin because it just wasn't possible without some changes to CNI or CNM itself.

Someone shoot this down? It's not general purpose in the sense that
docker's network CLI can't be used, but would it be good enough to
enable people to use the same CNI plugins across docker and rkt?

Unfortunately I can't see a way to make existing CNI plugins work with libnetwork/CNM right now due to the fundamental difference in their granularity and handling of IP addressing and network namespace management...

dc...@redhat.com

unread,

Sep 2, 2015, 4:20:45 PM9/2/15

to kubernetes-sig-network, tho...@google.com, she...@nicira.com, paul.t...@metaswitch.com, eugene.y...@coreos.com

On Wednesday, September 2, 2015 at 12:22:23 AM UTC-5, Prashanth B wrote:

> I hope that answers your question?

Thanks for the example. So what I'm proposing is a networking model with the following limitations for the short term:
1. Only one (docker) network object, this is the kubernetes network. All endpoints must join it.

2. Containers can only join endpoints on the same host.

3. A join execs CNI plugin with json composed from the join and endpoint create, derived from storage (physical memory, sqlite, apiserver -- as long as it's not another database it remains an implementation detail)

> First, libkv assumes an arbitrary KV store, which our APIserver is not.

Doesn't look like libkv is a requirement for remote plugins. If we start docker with a plugin but without a kv store, the json will get posted to the localhost http server, but not propogated to the other hosts (untested, this from staring at code). This is ok, because there is only 1 network and no cross host endpoint joining. If we really need cross host consistency, we have an escape hatch via apiserver watch

It's not a requirement. It's only a requirement for libnetwork if you want persistence of the libnetwork-defined networks. Your plugin can do whatever it wants, but fundamentally you'll be operating with an externally defined network name and possibly configuration too (eg, 'docker network add').

> Third, if we only allow network objects through kubernetes we can't see the name of the object Docker thinks it is creating.

We don't even have to allow this. The cluster is bootstrapped with a network object. It's readonly thereafter. Create network will noop after that.

We're already using multi-network functionality for our isolation features in OpenShift, and we'd like to make sure that keeps working.

So if I understand your model correctly, there would be one defined "fake" network in docker/libnetwork that was the "kubernetes" network that any docker-managed container that wanted to interoperate with k8s would need to join. All the actual network intelligence would be in k8s and k8s would pass that information to the *actual* CNI plugin outside of the docker/libnetwork paths. So essentailly:

1) something creates a new "network" object in k8s
2) k8s wants to start a pod in that network so it tells docker to start the pause container in the 'kubernetes' (docker) network
3) docker then calls the 'kubernetes' CNM plugin, which happens to be Kube
4) Kube looks up whatever info it needs (the actual network name, permissions, whatever) and then executes a CNI plugin that handles all that

Is that more or less correct?

This would give users the ability to dump their own docker plugins into /etc/docker/plugins, start the kubelet with --manage-networking=false, and use dockers remote plugin api. At the same time CNI should work with manage-networking=true.

What would happen in the --manage-networking=false case?

dc...@redhat.com

unread,

Sep 2, 2015, 4:21:33 PM9/2/15

to kubernetes-sig-network, be...@google.com, she...@nicira.com, paul.t...@metaswitch.com, eugene.y...@coreos.com

On Wednesday, September 2, 2015 at 1:23:15 AM UTC-5, Tim Hockin wrote:

We will eventually want to add something akin to multiple Networks, so
I want to be dead sure that it is viable before we choose and
implement a model.

We're already doing this with OpenShift and we'd like to ensure it keeps working with whatever we come up with here. We'll help make that happen.

Rajat Chopra

unread,

Sep 2, 2015, 5:23:15 PM9/2/15

to kubernetes-sig-network, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com

BUT the deal-breaker is that the CNI plugin will expect to move the interface into the right NetNS itself, configure the interface's IP address itself, and more. CNM doens't allow that. CNM also doesn't expose the NetNS FD to the plugins in anyway (though in-process plugins might be able to find it), so the CNI plugin has no idea what network namespace to move the interface into. That's where I stopped with cni-docker-plugin because it just wasn't possible without some changes to CNI or CNM itself.

To this point and the one below, the only alternative I see at this point is for the CNI plugin to know whether it is being called for a CNM interface or not. If yes, then the plugin should not move the interface into net namespace.
For the independent case, no problem. And this becomes the responsibility of the glue driver (plant an ENV var called CNM=true?).

Someone shoot this down? It's not general purpose in the sense that
docker's network CLI can't be used, but would it be good enough to
enable people to use the same CNI plugins across docker and rkt?

Unfortunately I can't see a way to make existing CNI plugins work with libnetwork/CNM right now due to the fundamental difference in their granularity and handling of IP addressing and network namespace management...

For handling the granularity, how about splitting CNI's IPAM and ADD from the glue driver? The IPAM is anyway separately define-able in CNI, just that it is not called separately from ADD. And we make the IPAM understand if it is called directly by the glue code or through the ADD command (switch behaviour accordingly).

dc...@redhat.com

unread,

Sep 2, 2015, 5:31:32 PM9/2/15

to kubernetes-sig-network, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com

Yeah, if we can make some small changes to CNI to accommodate "doing less" that would work for us, at least. But I'm not sure how well it would work for the others in this discusson, eg Calico & Weave & etc? Would be good to get their input.

dc...@redhat.com

unread,

Sep 2, 2015, 5:40:05 PM9/2/15

to kubernetes-sig-network, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com

Also, just to be clear, when creating the cni-docker-plugin I posted above, I was attempting to work with existing CNI plugins, and assumed that I could not change CNM or CNI at all. If we can change them (and I think we probably can) then we may be able to get docker to run CNI plugins via the CNM API.

Tim Hockin

unread,

Sep 2, 2015, 5:45:34 PM9/2/15

to dc...@redhat.com, Madhu Venugopal, kubernetes-sig-network, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady

On Wed, Sep 2, 2015 at 1:00 PM, <dc...@redhat.com> wrote:
> On Tuesday, September 1, 2015 at 7:46:25 PM UTC-5, Tim Hockin wrote:
>
>> Write a "cni-exec" libnetwork driver.
>
>
> I started doing that a month ago. It has some fundamental problems, some of
> which you've outlined and others while I'll talk about below.
>
> https://github.com/dcbw/cni-docker-plugin
>
>> You can not create new networks using it. When a CreateNetwork() call
>> is received we check for a static config file on disk.
>> E.g.CreateNetwork(name = "foobar") looks for
>> /etc/cni/networks/foobar.json, and if it does not exist or does not
>> match, fail.
>
>
> With libnetwork you create the network definitions with the libnetwork API.
> If you have a KV store backing libnetwork then it's persistent, but if not
> then all the network definitions go away when docker exits. So my thought
> was that k8s would create the networks itself, but as others have mentioned
> it's a problem that networks can be created with 'docker network add' too.

I think we have two possible approaches - one is to build a truly
generic CNM-CNI bridge, the other is to build a kubernetes-centric CNI
bridge. Solving the former sounds great, but I'll settle for the
latter if I have to. I want to explore the generic approach first.

To build a truly generic CNM-CNI bridge, let's assume a working libkv
plugin. In the bridge driver, catch the "create network" call and
write a CNI-like config file. Assuming libkv works, every node will
see that operation and create the same file in their local
filesystems. Right?

> The first fundamental mismatch between libnetwork/CNM and CNI is that CNM is
> much more granular than CNI, and it wants more information at each step that
> CNI isn't willing to give back until the end.

What about actually dong the CNI "add" operation on CNM's "create
endpoint" ? Is there a guarantee that "create endpoint" runs on only
one node? If not, this seems hard to surmount.

> The second fundamental mismatch is that libnetwork/CNM does more than CNI
> does; it handles moving the interfaces into the right NetNS, setting up
> routes, and setting the IP address on the interfaces. The plugin's job is
> simply to create the interface and allocate the addresses, and pass all that
> back to libnetwork. CNI plugins currently expect to handle all this
> themselves.

Hack: move into a tmp namespace in CNI plugins, and then move it out
in the bridge.

> Third, remote plugins are called in a blocking manner so they cannot call
> back into docker's API to retrieve any extra information they might need
> (eg, network name).

Do we understand WHY we can't have the network name?

>> PROBLEM: it looks like the CreateNetwork() call can not see the name
>> of the network. Let's assume that could be fixed.
>
> In my implementation I just cached the network ID and started a network
> watch to grab the name, and all the actual CNM work was done in Join().

Network watch of the libkv backend? Or something else?

>> CreateEndpoint() does just enough work to satisfy the API, and save
>> all of its state in memory.
>>
>> PROBLEM: If docker goes down, how does this state get restored?
>
>
> You're expected to build libnetwork with a KV store if you want persistence.

Isn't there a berkeley DB implementation of libkv that can be used for
local-only persistence?

> BUT the deal-breaker is that the CNI plugin will expect to move the
> interface into the right NetNS itself, configure the interface's IP address
> itself, and more. CNM doens't allow that. CNM also doesn't expose the
> NetNS FD to the plugins in anyway (though in-process plugins might be able
> to find it), so the CNI plugin has no idea what network namespace to move
> the interface into. That's where I stopped with cni-docker-plugin because
> it just wasn't possible without some changes to CNI or CNM itself.
>
>> Someone shoot this down? It's not general purpose in the sense that
>> docker's network CLI can't be used, but would it be good enough to
>> enable people to use the same CNI plugins across docker and rkt?
>
>
> Unfortunately I can't see a way to make existing CNI plugins work with
> libnetwork/CNM right now due to the fundamental difference in their
> granularity and handling of IP addressing and network namespace
> management...

I feel like IF we could make it work it might be worth some changes in CNI.

Now I want to think through the kubernetes-centric (less generic)
approach. The key here being that anything that affects kubernetes
needs to flow down from kubernetes, not up from docker. Just as you
can't 'docker run' and have a pod appear in kubernetes, you can't
'docker network create' and have a network appear in kubernetes.

Starting assumptions: There is one "global" network for kubernetes
today (let's call it "kubernetes"), but there will eventually be more
granular control (more networks). We will not use libkv.

Assume a docker remote plugin which is a CNM-to-k8s+CNI bridge. This
is the default docker network driver. Any calls to "create network"
for anything other than "kubernetes" will fail. When kubelet starts
up it will 'docker network create' to make sure the "kubernetes"
network exists.

When a pod is created kubelet will tell it to join the "kubernetes"
network. libnetwork will call "create endpoint". The bridge will
exec the CNI plugin and then move the resulting interface back out of
the namespace (as CNM demands). libnetwork will call "join", which
should just be a no op (I think?) for the bridge.

Now, what happens if a user wants to use the docker-included overlay
network? When kubelet starts up it will 'docker network create' to
make sure the "kubernetes" network exists. All of the existing
libnetwork stuff should work. Unfortunately this requires a libkv
implementation, which we can't satisfy easily with kubernetes, so
users have to run YET ANOTHER cluster system. Pukey, but it works for
people who want it. Maybe we could add a libkv implementation in
terms of a kubernetes config object?

This seems to add up to:
1) use libnetwork and libnetwork drivers when running docker
2) offer a slightly hacky bridge from libnetwork to CNI drivers (is it
worth the cost?)

None of this addresses the other issues in lbnetwork - not wrappable,
IPAM is too baked-in, not available today, no access to network Name
field, complex model, etc. I keep hearing from people who tried to
implement libnetwork drivers that it's sort of a bad experience, and
docker doesn't seem keen to make it better (hearsay).

Are those issues really painful enough to warrant NOT using Docker's
network plugins?

The big alternative is to say "forget it", and just run all our pods
with --net=none to docker, and use CNI ourselves to set up networking.
This means (as discussed) 'docker run' can never join the kubernetes
network and that we don't take advantage of vendors who implement
docker plugins (could we bridge it the other way? A CNI binary that
drives docker remote plugins :)

I feel like a prototype is warranted, and then maybe a get-together?

I'm adding Madhu on this email.

eugene.y...@coreos.com

unread,

Sep 2, 2015, 6:11:47 PM9/2/15

to kubernetes-sig-network, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com

On Wednesday, September 2, 2015 at 2:23:15 PM UTC-7, Rajat Chopra wrote:

For handling the granularity, how about splitting CNI's IPAM and ADD from the glue driver? The IPAM is anyway separately define-able in CNI, just that it is not called separately from ADD. And we make the IPAM understand if it is called directly by the glue code or through the ADD command (switch behaviour accordingly).

The reason IPAM plugin is invoked by the top-level plugin is to make something like DHCP work. You first need to create your interface (e.g. macvlan) and then use it to get the IP. Finally you need to apply the values from DHCP to the interface and things like routing table. Therefore the IPAM plugin invocation need to be "sandwiched" in the top level.

Actually with the DHCP example, you could call IPAM plugin after the top-level one. So suppose we have a "macvlan" plugin and IPAM plugins as "peers". We could call "macvlan" first to have it create the interface. We then call "dhcp" which uses the newly created interface to get the IP/gw/routes and applies it (or maybe returns it for libcni to apply).

But what about a solution that uses host routing. There we first need to call out to IPAM to get the IP allocated and then add a route to the host. It was for this reason that IPAM is invoked by the top-level plugin at the appropriate time.

-Eugene

eugene.y...@coreos.com

unread,

Sep 2, 2015, 6:18:08 PM9/2/15

to kubernetes-sig-network, dc...@redhat.com, ma...@docker.com, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com

On Wednesday, September 2, 2015 at 2:45:34 PM UTC-7, Tim Hockin wrote:

The big alternative is to say "forget it", and just run all our pods
with --net=none to docker, and use CNI ourselves to set up networking.
This means (as discussed) 'docker run' can never join the kubernetes
network and that we don't take advantage of vendors who implement
docker plugins (could we bridge it the other way? A CNI binary that
drives docker remote plugins :)

CNI binary that drives docker remote plugins should be easier as CNI

is more coarse. Not sure how much value is in that.

I feel like a prototype is warranted, and then maybe a get-together?

I am going to hack on the prototype that uses a tmp namespace and calls

out to CNI plugin in CreateEndpoint. It then moves things out into root ns

and Join becomes a no-op. I realize that CreateEndpoint is not supposed to

be doing things like veth creation but even libnetwork's bridge plugin works this way.

dc...@redhat.com

unread,

Sep 2, 2015, 6:22:20 PM9/2/15

to kubernetes-sig-network, dc...@redhat.com, ma...@docker.com, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com

Be careful with "bridge" driver :) I thought you were talking about the the docker/libnetwork "bridge" driver for a second.

But anyway, I don't think we need to write anything out at all? What would that file be used for?

> The first fundamental mismatch between libnetwork/CNM and CNI is that CNM is
> much more granular than CNI, and it wants more information at each step that
> CNI isn't willing to give back until the end.

What about actually dong the CNI "add" operation on CNM's "create
endpoint" ? Is there a guarantee that "create endpoint" runs on only
one node? If not, this seems hard to surmount.

Looking into the code (it appears that if your KV store is distributed, then *yes* all the objects (networks, endpoints, etc) will be replicated across all nodes. The Controller calls store.go::initDataStore() which reads existing networks and endpoints and appears to call into the libnetwork drivers for those operations too. But maybe that's not the intended model? I've only used local data stores, so I have no clue what's supposed to happen when you're using the same distributed data store with docker/libnetwork.

Having all nodes replicate all networks/endpoints/etc seems pretty weird, since not every node would care about every network and you don't want to allocate resources until the node knows that its going to actually run a container connected to that endpoint.

> The second fundamental mismatch is that libnetwork/CNM does more than CNI
> does; it handles moving the interfaces into the right NetNS, setting up
> routes, and setting the IP address on the interfaces. The plugin's job is
> simply to create the interface and allocate the addresses, and pass all that
> back to libnetwork. CNI plugins currently expect to handle all this
> themselves.

Hack: move into a tmp namespace in CNI plugins, and then move it out
in the bridge.

I don't think that'll work, because there's no way for libnetwork to know about the temp namespace (it only operates in the global one), and so when we pass the interface name back that interface is now invisible to libnetwork.

> Third, remote plugins are called in a blocking manner so they cannot call
> back into docker's API to retrieve any extra information they might need
> (eg, network name).

Do we understand WHY we can't have the network name?

No; the answer may well be "because none of the existing libnetwork plugins need it", and that upstream libnetwork would be happy to take a patch adding it. If we proceed with this, we should do that patch.

>> PROBLEM: it looks like the CreateNetwork() call can not see the name
>> of the network. Let's assume that could be fixed.
>
> In my implementation I just cached the network ID and started a network
> watch to grab the name, and all the actual CNM work was done in Join().

Network watch of the libkv backend? Or something else?

It watches the docker API event stream actually, so it works even if you don't use a KV store with libnetwork. docker simply godeps libnetwork and wraps the libnetwork API in the docker API, so when we talk about libnetwork I think we all mean "libnetwork as wrapped by docker"?

Dan

Tim Hockin

unread,

Sep 2, 2015, 6:44:24 PM9/2/15

to Madhu Venugopal, dc...@redhat.com, kubernetes-sig-network, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady, Jana Radhakrishnan

Madhu,

Thanks for the clue on scope. It looks like all remote drivers are
assumed to be global.

https://github.com/docker/libnetwork/blob/master/drivers/remote/driver.go#L32

On Wed, Sep 2, 2015 at 3:28 PM, Madhu Venugopal <ma...@docker.com> wrote:
> Copying Jana as well.
> Will read & reply to this thread later today.
>
> Just a quick clarification on some misunderstanding on libkv usage.
> libkv supports local persistence (using boltdb) and libnetwork makes use of
> it for local persistence (https://github.com/docker/libnetwork/pull/466).
>
> And the questions about global vs local libnetwork events are purely a
> matter of scope of the driver.
> If the driver scope is global, endpoint & network create calls are global.
> But Join is local.
> But if driver is scoped local, then all the calls are local.
>
> Thanks,
> -Madhu

Rajat Chopra

unread,

Sep 2, 2015, 6:47:48 PM9/2/15

to kubernetes-sig-network, dc...@redhat.com, ma...@docker.com, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com

Join can be a no-op in actual terms but it should still return the interface names, or you will hit this:
https://github.com/docker/libnetwork/blob/master/drivers/remote/driver.go#L176

Madhu Venugopal

unread,

Sep 2, 2015, 6:48:43 PM9/2/15

to Tim Hockin, dc...@redhat.com, kubernetes-sig-network, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady, Jana Radhakrishnan

On Sep 2, 2015, at 3:44 PM, Tim Hockin <tho...@google.com> wrote:

Madhu,

Thanks for the clue on scope. It looks like all remote drivers are
assumed to be global.

https://github.com/docker/libnetwork/blob/master/drivers/remote/driver.go#L32

That must be fixed & we just had a discussion on the topic in #docker-network channel.

The remote network driver capability exchange must provide this detail during the

RegisterDriver phase. Please open a PR and this can be quickly fixed.

-Madhu

Jana Radhakrishnan

unread,

Sep 2, 2015, 10:24:19 PM9/2/15

to kubernetes-sig-network, dc...@redhat.com, ma...@docker.com, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com

None of this addresses the other issues in lbnetwork - not wrappable,
IPAM is too baked-in, not available today, no access to network Name
field, complex model, etc. I keep hearing from people who tried to
implement libnetwork drivers that it's sort of a bad experience, and
docker doesn't seem keen to make it better (hearsay).

I'll just provide some answers for the perceived libnetwork problems:

* It should be fairly easy to wrap a libnetwork plugin with another plugin.

* IPAM is coming out before we release. Please feel free to comment on the proposal: https://github.com/docker/libnetwork/issues/489

* Going to be available in stable release in 1.9

* Network names should not be that relevant to drivers if their only responsibility is to plumb low level stuff

* I am not too sure about complexity of the model because the model consists of just Networks and Endpoints :-)

* Implementing a libnetwork driver is all about just implementing 6 Apis some of them can be very minimal or no-op

On top of that there is a general perception that you need libkv for libnetwork to operate. But this is not true if the driver is available only in local scope.

-Jana

Tim Hockin

unread,

Sep 2, 2015, 11:52:33 PM9/2/15

to Jana Radhakrishnan, kubernetes-sig-network, dc...@redhat.com, Madhu Venugopal, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady

On Wed, Sep 2, 2015 at 7:24 PM, Jana Radhakrishnan
<jana.radh...@docker.com> wrote:
>>
>>
>> None of this addresses the other issues in lbnetwork - not wrappable,
>> IPAM is too baked-in, not available today, no access to network Name
>> field, complex model, etc. I keep hearing from people who tried to
>> implement libnetwork drivers that it's sort of a bad experience, and
>> docker doesn't seem keen to make it better (hearsay).
>
>
> I'll just provide some answers for the perceived libnetwork problems:
>
> * It should be fairly easy to wrap a libnetwork plugin with another plugin.

How? In CNI it's a shell script. How do I wrap a daemon?

> * IPAM is coming out before we release. Please feel free to comment on the
> proposal: https://github.com/docker/libnetwork/issues/489

Will do

> * Going to be available in stable release in 1.9

I'm anxious to see what happens with the separation of Services and
Networks. I think that conflation is part of what makes the
libnetwork model very complicated

> * Network names should not be that relevant to drivers if their only
> responsibility is to plumb low level stuff

I know you guys keep saying that, but lots of people implementing
drivers claim to need it, and now I see exactly why.

> * I am not too sure about complexity of the model because the model consists
> of just Networks and Endpoints :-)

And sandboxes. And KV stores, but optional. And IPAM. And global vs
local. And "creating endpoints" that get broadcasted across the
network. I'm sorry, the concept count on libnetwork is really high
and not at all obvious. Guru explained it up-thread in a way that was
pretty clear, but it was pretty clearly overkill.

> * Implementing a libnetwork driver is all about just implementing 6 Apis
> some of them can be very minimal or no-op
>
> On top of that there is a general perception that you need libkv for
> libnetwork to operate. But this is not true if the driver is available only
> in local scope.

...once that bug is fixed. do local-scope drivers have persistence?
If I create a local driver, create a Network, attach a container to
that Network, and then bounce docker daemon - do my networks come
back?

Jana Radhakrishnan

unread,

Sep 3, 2015, 12:18:01 AM9/3/15

to Tim Hockin, Jana Radhakrishnan, kubernetes-sig-network, dc...@redhat.com, Madhu Venugopal, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady

On Sep 2, 2015, at 8:52 PM, Tim Hockin <tho...@google.com> wrote:

On Wed, Sep 2, 2015 at 7:24 PM, Jana Radhakrishnan
<jana.radh...@docker.com> wrote:

None of this addresses the other issues in lbnetwork - not wrappable,
IPAM is too baked-in, not available today, no access to network Name
field, complex model, etc. I keep hearing from people who tried to
implement libnetwork drivers that it's sort of a bad experience, and
docker doesn't seem keen to make it better (hearsay).

I'll just provide some answers for the perceived libnetwork problems:

* It should be fairly easy to wrap a libnetwork plugin with another plugin.

How? In CNI it's a shell script. How do I wrap a daemon?

If the goal is to add additional functionality to an already existing plugin you can wrap that daemon with your custom daemon and chain the calls and implement additional functionality. So the virtue of “wrappability" is not mutually exclusive with a plugin being a daemon. If you don’t want a long running process the problem is easily solved by adding an “exec” driver in libnetwork very similar to “remote” driver so that it can invoke plugins by execing them with the api as a JSON encoded argument.

* IPAM is coming out before we release. Please feel free to comment on the
proposal: https://github.com/docker/libnetwork/issues/489

Will do

* Going to be available in stable release in 1.9

I'm anxious to see what happens with the separation of Services and
Networks. I think that conflation is part of what makes the
libnetwork model very complicated

This conflation is going away pretty soon.

* Network names should not be that relevant to drivers if their only
responsibility is to plumb low level stuff

I know you guys keep saying that, but lots of people implementing
drivers claim to need it, and now I see exactly why.

Why?

* I am not too sure about complexity of the model because the model consists
of just Networks and Endpoints :-)

And sandboxes. And KV stores, but optional. And IPAM. And global vs
local. And "creating endpoints" that get broadcasted across the
network. I'm sorry, the concept count on libnetwork is really high
and not at all obvious. Guru explained it up-thread in a way that was
pretty clear, but it was pretty clearly overkill.

I thought IPAM being a separate thing is a good thing. And sandboxes are not something that driver developers need to worry about in general. Some of the concepts may be an overkill for k8s because k8s itself provides such abstractions. But if you look at docker itself as an independent product then it’s not so much of an overkill.

* Implementing a libnetwork driver is all about just implementing 6 Apis
some of them can be very minimal or no-op

On top of that there is a general perception that you need libkv for
libnetwork to operate. But this is not true if the driver is available only
in local scope.

...once that bug is fixed. do local-scope drivers have persistence?
If I create a local driver, create a Network, attach a container to
that Network, and then bounce docker daemon - do my networks come
back?

Yes once this PR https://github.com/docker/libnetwork/pull/466 is merged.

-Jana

Tim Hockin

unread,

Sep 3, 2015, 12:32:17 AM9/3/15

to Jana Radhakrishnan, kubernetes-sig-network, dc...@redhat.com, Madhu Venugopal, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady

On Wed, Sep 2, 2015 at 9:17 PM, Jana Radhakrishnan
<jana.radh...@docker.com> wrote:

> * It should be fairly easy to wrap a libnetwork plugin with another plugin.
>
>
> How? In CNI it's a shell script. How do I wrap a daemon?
>
>
> If the goal is to add additional functionality to an already existing plugin
> you can wrap that daemon with your custom daemon and chain the calls and
> implement additional functionality. So the virtue of “wrappability" is not
> mutually exclusive with a plugin being a daemon. If you don’t want a long
> running process the problem is easily solved by adding an “exec” driver in
> libnetwork very similar to “remote” driver so that it can invoke plugins by
> execing them with the api as a JSON encoded argument.

Exec would be a big step forward

> * IPAM is coming out before we release. Please feel free to comment on the
> proposal: https://github.com/docker/libnetwork/issues/489

There's not much detail there to comment on. An example of something
like DHCP will be interesting

> I'm anxious to see what happens with the separation of Services and
> Networks. I think that conflation is part of what makes the
> libnetwork model very complicated
>
>
> This conflation is going away pretty soon.

That's great! does that mean the CNM will get simpler?

> * Network names should not be that relevant to drivers if their only
> responsibility is to plumb low level stuff
>
> I know you guys keep saying that, but lots of people implementing
> drivers claim to need it, and now I see exactly why.
>
> Why?

I want to orchestrate networks in Kubernetes. Kubernetes will manage
the networks in our API server. If I have a driver plugin that
integrates with Kubernetes, I need to know the network name for a
given endpoint, so I can look up info in our own API. I do not have
the random ID, especially if this acts as a local driver (each node
has a different ID).

Madhu Venugopal

unread,

Sep 3, 2015, 12:38:24 AM9/3/15

to Tim Hockin, Jana Radhakrishnan, kubernetes-sig-network, dc...@redhat.com, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady

> On Sep 2, 2015, at 9:31 PM, Tim Hockin <tho...@google.com> wrote:
>
> On Wed, Sep 2, 2015 at 9:17 PM, Jana Radhakrishnan
> <jana.radh...@docker.com> wrote:
>
>> * It should be fairly easy to wrap a libnetwork plugin with another plugin.
>>
>>
>> How? In CNI it's a shell script. How do I wrap a daemon?
>>
>>
>> If the goal is to add additional functionality to an already existing plugin
>> you can wrap that daemon with your custom daemon and chain the calls and
>> implement additional functionality. So the virtue of “wrappability" is not
>> mutually exclusive with a plugin being a daemon. If you don’t want a long
>> running process the problem is easily solved by adding an “exec” driver in
>> libnetwork very similar to “remote” driver so that it can invoke plugins by
>> execing them with the api as a JSON encoded argument.
>
> Exec would be a big step forward

PRs welcome :)

>
>> * IPAM is coming out before we release. Please feel free to comment on the
>> proposal: https://github.com/docker/libnetwork/issues/489
>
> There's not much detail there to comment on. An example of something
> like DHCP will be interesting

Please comment in the proposal so that we will get appropriate response.

>
>> I'm anxious to see what happens with the separation of Services and
>> Networks. I think that conflation is part of what makes the
>> libnetwork model very complicated
>>
>>
>> This conflation is going away pretty soon.
>
> That's great! does that mean the CNM will get simpler?

CNM doesn’t talk about services. It is made of Network, Endpoints and Sandbox
and that stays exactly as intended. The services discussions that we had in the
past were purely UX centric.

>
>> * Network names should not be that relevant to drivers if their only
>> responsibility is to plumb low level stuff
>>
>> I know you guys keep saying that, but lots of people implementing
>> drivers claim to need it, and now I see exactly why.
>>
>> Why?
>
> I want to orchestrate networks in Kubernetes. Kubernetes will manage
> the networks in our API server. If I have a driver plugin that
> integrates with Kubernetes, I need to know the network name for a
> given endpoint, so I can look up info in our own API. I do not have
> the random ID, especially if this acts as a local driver (each node
> has a different ID).

Why ? Network creation API returns the ID which you can make use of.
For a locally scoped driver, network name in 1 host is NOT the same as the other host
(just like docker0 in a host is not the same as other docker0 in other hosts).
and hence k8s can actually stitch these network-IDs together.

Tim Hockin

unread,

Sep 3, 2015, 12:43:24 AM9/3/15

to Madhu Venugopal, Jana Radhakrishnan, kubernetes-sig-network, dc...@redhat.com, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady

I'm not sure I understand this - what does "stitch these network-IDs
together" mean? If I have 500 nodes, each of which has a unique
network ID for the "same" network - what reasonable thing do I do with
that?

Madhu Venugopal

unread,

Sep 3, 2015, 12:46:33 AM9/3/15

to Tim Hockin, Jana Radhakrishnan, kubernetes-sig-network, dc...@redhat.com, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady

And thats exactly why we have global-scoped networks which provides the
network-id Uniqueness across the cluster.
If you create a global scoped “kubernetes” network, the response will be the
globally unique network-id which you can use on any other nodes.
Essentially, its the driver that determines if it wants global scoped objects
or locally scoped objects.

-Madhu

Tim Hockin

unread,

Sep 3, 2015, 12:48:50 AM9/3/15

to Madhu Venugopal, Jana Radhakrishnan, kubernetes-sig-network, dc...@redhat.com, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady

I don't have a libkv - I'm not going to run ANOTHER storage system
just for this. We already have an API that we want to use. I want to
excise the cleverness from libnetwork - I don't want it or need it.

Michael Bridgen

unread,

Sep 3, 2015, 4:46:16 AM9/3/15

to Tim Hockin, Madhu Venugopal, Jana Radhakrishnan, kubernetes-sig-network, dc...@redhat.com, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady

To sum up the impasse:

* A libnetwork driver that calls a CNI plugin is either impossible or
intolerably kludgy, not least because it would have to have
side-channel information to get the real names for things
* A CNI plugin that calls libnetwork would be intolerably kludgy,
because libnetwork needs a KV store to behave correctly (otherwise
network UUIDs are different on each host)

Perhaps it's possible to have a CNI plugin (let's call it
"cni-remote") that calls a libnetwork *plugin* though. Then the
co-ordination would be up to whatever is driving CNI. As a sketch,

* On initialisation, construct a UUID for the network and save it
against the name in shared storage (or use a hash of the name, or use
the name). Call `NetworkDriver.CreateNetwork` with the UUID. This is
done by the *driving process*, since CNI doesn't care about this
operation.
* When you want to give a pod an interface, "cni-remote" will call
`NetworkDriver.CreateEndpoint` to get an IP address, then
`NetworkDriver.Join` to create the interface, then do the work of
putting the device in the namespace.

Of course, this only works if you're using a libnetwork plugin in the
first place; and, it requires a side-channel if you want your docker
networks to line up with e.g., your kubernetes network(s). And
probably lots of other problems and fatal flaws I've missed in my
haste.

-Michael

On 3 September 2015 at 05:48, 'Tim Hockin' via kubernetes-sig-network

> --
> You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-ne...@googlegroups.com.
> To post to this group, send email to kubernetes-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/kubernetes-sig-network.
> For more options, visit https://groups.google.com/d/optout.

Alex Pollitt

unread,

Sep 3, 2015, 11:12:56 AM9/3/15

to Tim Hockin, Madhu Venugopal, Jana Radhakrishnan, kubernetes-sig-network, dc...@redhat.com, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady

Here’s some thoughts for discussion. (And I want to flag up front that these are all based on my own observations and the way my thinking has been evolving, not any special or expert knowledge of libnetwork motivations or design intentions or future direction.)

As I see it libnetwork has 2 main goals:

* northbound provide a consistent UX independent of the underlying networking implementation
* southbound provide a network plugin API that makes writing plugin network drivers simple.

Currently, as part of trying to make writing network drivers simple, libnetwork includes a fair amount of stateful control plane logic aimed at minimizing the concerns of the plugin network drivers. (Or as Tim puts it “cleverness”.)

1. This is great for people who want to quickly write a network driver for their favorite data plane technology.
2. This is not great for people who want to write a network driver that interfaces to their favorite existing SDN solution because most existing SDN solutions today already have their own stateful control planes. Keeping the libnetwork and and existing SDN solution’s state and control planes in sync robustly is challenging for the plug-in driver writers given the current “minimizing concerns” API. It also has implications for scalability, since the existing SDN’s solution is now limited by libnetwork’s scalability, even if the existing SDN solution itself is design to scale to hyper-scale deployments.

I think that (1) is a good thing. But unless the goal with libnetwork is to supersede the thousands of many years that have been invested in building today’s SDN solutions, then I think (2) is a bad thing. For Docker, for end users, and the industry.

One possible approach might be to split libnetwork in two. The first half would provide the same consistent UX northbound as today and a new southbound API (with less cleverness and less statefulness) for existing SDN solutions to plug into. The second half would plug into the new southbound API, add the cleverness and statefulness, and provide the existing southbound API for people who want to quickly write a network driver for their favorite data plane technology. (Something akin to CNI with some minor extensions might be a suitable basis for the new southbound API for SDN solutions to plug into.) We could have our cake and eat it.

Obviously there are other approaches to resolving this, and some may disagree with my (1) and (2) statements, in which case great and I’m all ears. This is all meant as thoughts to feed into this useful and evolving discussion.

- Alex

Dan Williams

unread,

Sep 3, 2015, 12:01:47 PM9/3/15

to Madhu Venugopal, Tim Hockin, kubernetes-sig-network, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady, Jana Radhakrishnan

On Wed, 2015-09-02 at 15:28 -0700, Madhu Venugopal wrote:
> Copying Jana as well.
> Will read & reply to this thread later today.
>
> Just a quick clarification on some misunderstanding on libkv usage.

> libkv supports local persistence (using boltdb) and libnetwork makes use of it for local persistence (https://github.com/docker/libnetwork/pull/466 <https://github.com/docker/libnetwork/pull/466>).

>
> And the questions about global vs local libnetwork events are purely a matter of scope of the driver.
> If the driver scope is global, endpoint & network create calls are global. But Join is local.
> But if driver is scoped local, then all the calls are local.

It wasn't clear from the code, but on libnetwork init, are global
endpoints created on every node as well? It looks like the code
explicitly gets all networks, but then simply starts watching for
endpoints in those networks. There doesn't seem to be an explicit
"ListEndpoints" call anywhere, but perhaps that's a side-effect of some
other behavior?

The reason I ask is that creating endpoints in the driver is a pretty
heavy operation, and if endpoints are global this would essentially
require the driver to create a kernel network interface for every
network, on every host, regardless of whether that host was running a
container that was joined to the endpoint.

Dan

> Thanks,
> -Madhu

Jana Radhakrishnan

unread,

Sep 3, 2015, 12:15:28 PM9/3/15

to Dan Williams, Madhu Venugopal, Tim Hockin, kubernetes-sig-network, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady

On Sep 3, 2015, at 9:01 AM, Dan Williams <dc...@redhat.com> wrote:

On Wed, 2015-09-02 at 15:28 -0700, Madhu Venugopal wrote:
Copying Jana as well.
Will read & reply to this thread later today.

Just a quick clarification on some misunderstanding on libkv usage.
libkv supports local persistence (using boltdb) and libnetwork makes use of it for local persistence (https://github.com/docker/libnetwork/pull/466<https://github.com/docker/libnetwork/pull/466>).

And the questions about global vs local libnetwork events are purely a matter of scope of the driver.
If the driver scope is global, endpoint & network create calls are global. But Join is local.
But if driver is scoped local, then all the calls are local.

It wasn't clear from the code, but on libnetwork init, are global
endpoints created on every node as well? It looks like the code
explicitly gets all networks, but then simply starts watching for
endpoints in those networks. There doesn't seem to be an explicit
"ListEndpoints" call anywhere, but perhaps that's a side-effect of some
other behavior?

The reason I ask is that creating endpoints in the driver is a pretty
heavy operation, and if endpoints are global this would essentially
require the driver to create a kernel network interface for every
network, on every host, regardless of whether that host was running a
container that was joined to the endpoint.

The important thing to clarify here is that the driver is not required to create network interfaces during CreateEndpoint call. CreateEndpoint is used to either

* Ask the driver to allocate host independent network resources like IP/MAC etc.

OR

* Notify driver that a new Endpoint has been created in the cluster and here are the network resources that are bound to the Endpoint

It is only during a Join call that the driver is required to allocate per-host network resources. This happens only in the host where a container Joins the Endpoint.

-Jana

Madhu Venugopal

unread,

Sep 3, 2015, 12:20:00 PM9/3/15

to Jana Radhakrishnan, Dan Williams, Tim Hockin, kubernetes-sig-network, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady

+1 to that. There has been some discussions on differentiating the above 2 operations into 2 separate calls (or) a hint on indicate create vs notification.

If it helps, we can just rename the call to be CreateEndpointNotification :-), but the functionality can be achieved just by this 1 call (maybe a hint to indicate if it is a notification).

But as Jana suggested, the real work is on the Join call & it is NOT a global event.

-Madhu

eugene.y...@coreos.com

unread,

Sep 3, 2015, 1:50:04 PM9/3/15

to kubernetes-sig-network, dc...@redhat.com, ma...@docker.com, tho...@google.com, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com

On Thursday, September 3, 2015 at 9:15:28 AM UTC-7, Jana Radhakrishnan wrote:

On Sep 3, 2015, at 9:01 AM, Dan Williams <dc...@redhat.com> wrote:

On Wed, 2015-09-02 at 15:28 -0700, Madhu Venugopal wrote:
The important thing to clarify here is that the driver is not required to create network interfaces during CreateEndpoint call. CreateEndpoint is used to either
* Ask the driver to allocate host independent network resources like IP/MAC etc

How do I handle the case when the only way I can find out the MAC (or even IP) is by the act of creating the actual interface on the host (or at least once I know what host the container will run on)?

Can CreateEndpoint just not call AddInterface at all but cache the value of EndpointInfo and then call it in the Join method?

Jana Radhakrishnan

unread,

Sep 3, 2015, 2:02:03 PM9/3/15

to eugene.y...@coreos.com, kubernetes-sig-network, Dan Williams, ma...@docker.com, tho...@google.com, she...@nicira.com, paul.t...@metaswitch.com

On Sep 3, 2015, at 10:50 AM, eugene.y...@coreos.com wrote:

How do I handle the case when the only way I can find out the MAC (or even IP) is by the act of creating the actual interface on the host (or at least once I know what host the container will run on)?

I am assuming you are talking about a specific case of a driver such as IPVLAN. In cases where you have to assign specific locality to the Endpoint, the network(and endpoint) needs to be backed by a local scoped driver. Once a network is created with a local scoped driver, CreateEndpoint and Join will happen in only one host and with that assumption it is trivial to solve your use case.

-Jana

eugene.y...@coreos.com

unread,

Sep 3, 2015, 2:22:34 PM9/3/15

to kubernetes-sig-network, eugene.y...@coreos.com, dc...@redhat.com, ma...@docker.com, tho...@google.com, she...@nicira.com, paul.t...@metaswitch.com

Yes, I was talking about IPVLAN in the case of MAC and something like flannel in the case of IP (in flannel, each host is allocated a subnet from which container IPs are drawn).

So it sounds like with GlobalScope one also gets the libnetwork's control plane while with LocalScope it's possible to opt-out of it. This is like what Alex Pollitt talked about above -- splitting libnetwork into two.

May I ask about the logic behind the driver creating interfaces in the root network namespace and libnetwork moving them into the container namespace (sandbox). Why not give the driver access to the container namespace (as well as root ns)? Granted, sometimes an interface has to be created in the root ns and then moved but ideally the interface should be created in its final resting place. Also, if a driver has access to the container namespace, it could manipulate it in a way that is not currently envisioned by libnetwork (sysctls, netfilter rules, etc).

Jana Radhakrishnan

unread,

Sep 3, 2015, 2:37:18 PM9/3/15

to eugene.y...@coreos.com, kubernetes-sig-network, dc...@redhat.com, ma...@docker.com, tho...@google.com, she...@nicira.com, paul.t...@metaswitch.com

On Sep 3, 2015, at 11:22 AM, eugene.y...@coreos.com wrote:

On Thursday, September 3, 2015 at 11:02:03 AM UTC-7, Jana Radhakrishnan wrote:

On Sep 3, 2015, at 10:50 AM, eugene.y...@coreos.com wrote:

How do I handle the case when the only way I can find out the MAC (or even IP) is by the act of creating the actual interface on the host (or at least once I know what host the container will run on)?

I am assuming you are talking about a specific case of a driver such as IPVLAN. In cases where you have to assign specific locality to the Endpoint, the network(and endpoint) needs to be backed by a local scoped driver. Once a network is created with a local scoped driver, CreateEndpoint and Join will happen in only one host and with that assumption it is trivial to solve your use case.

Yes, I was talking about IPVLAN in the case of MAC and something like flannel in the case of IP (in flannel, each host is allocated a subnet from which container IPs are drawn).

The flannel case can be solved by configuring a special IPAM for that network(based on the upcoming IPAM proposal) so that you can control how IP from that subnet specific to the host gets allocated.

So it sounds like with GlobalScope one also gets the libnetwork's control plane while with LocalScope it's possible to opt-out of it. This is like what Alex Pollitt talked about above -- splitting libnetwork into two.

Yes, that’s precisely the intent. We want to have best of both worlds. Give a seamless experience to app developers who don’t want to worry about orchestrating a control plane and also provide enough flexibility to plumbers to cook up their own low level plumbing.

May I ask about the logic behind the driver creating interfaces in the root network namespace and libnetwork moving them into the container namespace (sandbox). Why not give the driver access to the container namespace (as well as root ns)? Granted, sometimes an interface has to be created in the root ns and then moved but ideally the interface should be created in its final resting place. Also, if a driver has access to the container namespace, it could manipulate it in a way that is not currently envisioned by libnetwork (sysctls, netfilter rules, etc).

The reason for the decision to make libnetwork the central broker of anything going into the container namespace is to make sure there is one central place to resolve conflicts. This is especially important if the container joins multiple networks backed by multiple drivers. In that case there will be no way to make sure one driver doesn’t step into configuration made by another driver because there is no visibility to that at the driver level. For example if both the drivers decide to plumb their interface inside the namespace with name “eth0" then it wouldn’t work. Libnetwork resolves this conflict by asking the driver to just provide a prefix to their target interface name and we make sure interfaces are plumbed inside the namespace without conflict. This is just one example.

-Jana

Dan Williams

unread,

Sep 3, 2015, 3:35:47 PM9/3/15

to Jana Radhakrishnan, eugene.y...@coreos.com, kubernetes-sig-network, ma...@docker.com, tho...@google.com, she...@nicira.com, paul.t...@metaswitch.com

On Thu, 2015-09-03 at 11:37 -0700, Jana Radhakrishnan wrote:
> > On Sep 3, 2015, at 11:22 AM, eugene.y...@coreos.com wrote:
> >
> >
> > On Thursday, September 3, 2015 at 11:02:03 AM UTC-7, Jana Radhakrishnan wrote:
> >
> >> On Sep 3, 2015, at 10:50 AM, eugene.y...@coreos.com <javascript:> wrote:
> >>
> >> How do I handle the case when the only way I can find out the MAC (or even IP) is by the act of creating the actual interface on the host (or at least once I know what host the container will run on)?
> >>
> >
> > I am assuming you are talking about a specific case of a driver such as IPVLAN. In cases where you have to assign specific locality to the Endpoint, the network(and endpoint) needs to be backed by a local scoped driver. Once a network is created with a local scoped driver, CreateEndpoint and Join will happen in only one host and with that assumption it is trivial to solve your use case.
> >
> > Yes, I was talking about IPVLAN in the case of MAC and something like flannel in the case of IP (in flannel, each host is allocated a subnet from which container IPs are drawn).
>
> The flannel case can be solved by configuring a special IPAM for that network(based on the upcoming IPAM proposal) so that you can control how IP from that subnet specific to the host gets allocated.
>
> >
> > So it sounds like with GlobalScope one also gets the libnetwork's control plane while with LocalScope it's possible to opt-out of it. This is like what Alex Pollitt talked about above -- splitting libnetwork into two.
>
> Yes, that’s precisely the intent. We want to have best of both worlds. Give a seamless experience to app developers who don’t want to worry about orchestrating a control plane and also provide enough flexibility to plumbers to cook up their own low level plumbing.
>
> >
> > May I ask about the logic behind the driver creating interfaces in the root network namespace and libnetwork moving them into the container namespace (sandbox). Why not give the driver access to the container namespace (as well as root ns)? Granted, sometimes an interface has to be created in the root ns and then moved but ideally the interface should be created in its final resting place. Also, if a driver has access to the container namespace, it could manipulate it in a way that is not currently envisioned by libnetwork (sysctls, netfilter rules, etc).
>
> The reason for the decision to make libnetwork the central broker of anything going into the container namespace is to make sure there is one central place to resolve conflicts. This is especially important if the container joins multiple networks backed by multiple drivers. In that case there will be no way to make sure one driver doesn’t step into configuration made by another driver because there is no visibility to that at the driver level. For example if both the drivers decide to plumb their interface inside the namespace with name “eth0" then it wouldn’t work. Libnetwork resolves this conflict by asking the driver to just provide a prefix to their target interface name and we make sure interfaces are plumbed inside the namespace without conflict. This is just one example.

Obviously the driver controller must arbitrate routes, IP addresses, and
DNS details. I think both CNM and CNI do this.

But I don't believe interface names need to be on this list. The only
reason libnetwork has to arbitrate here is because libnetwork requires
that the interfaces are created in the host namespace, and then
libnetwork renames the interface into the container namespace. If the
interfaces were created initially in the container namespace, there
would be no problem with name conflicts since plugins already have to
handle name conflict failures when creating interfaces in the host
namespace.

What's the impetus for drivers to request a well-known interface name
anyway? Why would a driver care, when it can't specify the exact
interface name anyway? Is there any enforcement that different drivers
request different prefixes?

(eg, if drivers are allowed to care about interface name prefixes, but
libnetwork doesn't enforce that the prefix is unique to a driver, then a
container attached to two networks could have two interfaces in the
container with the same prefix, but different drivers, which defeats the
purpose of allowing a container to specify a prefix, it seems...)

Dan

> -Jana
>
> >
> > --
> > You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.

> > To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-ne...@googlegroups.com <mailto:kubernetes-sig-ne...@googlegroups.com>.
> > To post to this group, send email to kubernetes-...@googlegroups.com <mailto:kubernetes-...@googlegroups.com>.
> > Visit this group at http://groups.google.com/group/kubernetes-sig-network <http://groups.google.com/group/kubernetes-sig-network>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>

eugene.y...@coreos.com

unread,

Sep 3, 2015, 5:18:01 PM9/3/15

to kubernetes-sig-network, jana.radh...@docker.com, eugene.y...@coreos.com, ma...@docker.com, tho...@google.com, she...@nicira.com, paul.t...@metaswitch.com

CNI does not arbitrate anything except the interface name which it gives to the plugin. My view is that if there is a collision, it is a configuration error. While configuration errors are best handled via clear error messages that a central broker can provide, it is not worth the cost. With CNI you will roughly get the same experience as on the host. Suppose you configure networking on the host via ip, ifconfig or dhclient utilities. In the process of them doing their work, they can stump on each other and either error out or things just won't work. The latter is unpleasant but a known quantity -- we've all had to troubleshoot such situations. What's more is that while a central broker can catch conflicts, it still does not prevent the general problem of "things don't work and troubleshooting needed". I favor flexibility over better error messages.

> > To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-network+unsub...@googlegroups.com <mailto:kubernetes-sig-network+unsubscribe@googlegroups.com>.
> > To post to this group, send email to kubernetes-...@googlegroups.com <mailto:kubernetes-sig-net...@googlegroups.com>.

Jana Radhakrishnan

unread,

Sep 4, 2015, 12:30:51 AM9/4/15

to eugene.y...@coreos.com, kubernetes-sig-network, Jana Radhakrishnan, ma...@docker.com, tho...@google.com, she...@nicira.com, paul.t...@metaswitch.com

It’s a tradeoff b/w flexibility and better user experience. CNM has chosen better user experience.

And conflict resolution is not always about bad configuration to begin with. Sometimes it is conflict in the run time data provided by drivers. End users can reason about configuration and can always be corrected once they understand the configuration error. But conflicts in run time data is always very difficult to reason about for end user and that is where we feel the need to help them by either resolving the conflict (if ipossible) or fail the operation with a clear error message.

So conflict arising due to a container joining networks belonging to the same subnet is a configuration error and in that case user can reason about and resolve the error by changing the configuration. But conflicts over default g/w, for example, provided by two drivers is very difficult to reason about if the drivers resolve these conflicts in arbitrary fashion. libnetwork strives to provide predictable behavior to the end user that can reasoned with.

Without this only the end users who are able to look at the code and understand how a driver behaves can reason any problem happening in their system. And there is no guarantee that these drivers are open source in the first place.

-Jana

Alex Pollitt

unread,

Sep 4, 2015, 10:57:43 AM9/4/15

to Tim Hockin, Madhu Venugopal, Jana Radhakrishnan, kubernetes-sig-network, dc...@redhat.com, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady

Madhu, Jana - Do you have any comments to help build on these thoughts?

Michael Bridgen

unread,

Sep 4, 2015, 11:31:50 AM9/4/15

to Gurucharan Shetty, Prashanth B, Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich

On 2 September 2015 at 15:36, Gurucharan Shetty <she...@nicira.com> wrote:
> On Tue, Sep 1, 2015 at 10:22 PM, Prashanth B <be...@google.com> wrote:

>> Doesn't look like libkv is a requirement for remote plugins.
>> If we start
>> docker with a plugin but without a kv store, the json will get posted to the
>> localhost http server, but not propogated to the other hosts (untested, this
>> from staring at code). This is ok, because there is only 1 network and no
>> cross host endpoint joining. If we really need cross host consistency, we
>> have an escape hatch via apiserver watch.
>
> You have to start Docker daemon with libkv for libnetwork to work
> (atleast based on my observation). It does not matter whether it is
> remote driver or the native overlay solution. I will be happy if it
> turns out that my assertion is wrong.

This is true of libnetwork now, but I don't think it is a matter of
design, just historical (technical) accident. There's been recent
discussion in #docker-network:
https://botbot.me/freenode/docker-network/2015-09-02/?msg=48733861&page=1

So in theory drivers could report as "LocalScope" to a future
libnetwork, and not drag in libkv. However, I would be worried about
additional assumptions made by libnetwork.

Jana Radhakrishnan

unread,

Sep 4, 2015, 12:29:31 PM9/4/15

to Michael Bridgen, Gurucharan Shetty, Prashanth B, Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich

On Sep 2, 2015, at 8:46 AM, Michael Bridgen <mic...@weave.works> wrote:

So in theory drivers could report as "LocalScope" to a future
libnetwork, and not drag in libkv. However, I would be worried about
additional assumptions made by libnetwork.

Michael, What kind of additional assumptions?

-Jana

eugene.y...@coreos.com

unread,

Sep 4, 2015, 1:11:09 PM9/4/15

to kubernetes-sig-network, eugene.y...@coreos.com, jana.radh...@docker.com, ma...@docker.com, tho...@google.com, she...@nicira.com, paul.t...@metaswitch.com

It is a tradeoff and I can certainly respect the choice towards better UX. Do you think it would be reasonable to modify the design where the driver does get access to the namespace but needs to notify libnetwork (broker) of any changes it has made. At least those changes that libnetwork "knows about". Initially it would notify about IP and MAC addresses added and gateways. But if it changes a sysctl that is not a resource libnetwork knows about, it would not notify it and take responsibility for it. If users decides to use this driver, they are opt-in sub-par error detection. Over time libnetwork can add more resources that it tracks (I think something like this was first proposed by Tim Hockin for CNI). libnetwork itself would not be moving interfaces or assigning IPs but it would keep track of the changes made and provide reasonable error messages.

-Eugene

Paul Tiplady

unread,

Sep 4, 2015, 3:04:54 PM9/4/15

to kubernetes-sig-network, eugene.y...@coreos.com, jana.radh...@docker.com, ma...@docker.com, tho...@google.com, she...@nicira.com, paul.t...@metaswitch.com

Tim has been digging into the possibility of writing a CNM bridge to k8s+CNI. Given the inconsistencies in IPAM strategy and responsibility for moving veths and assigning IPs, I think a generic CNM->CNI bridge is not feasible without a raft of crass hacks, so I propose that we consider that option dead.

For the k8s-specific CNI bridge option there are a few outstanding issues:

1) Currently in libnetwork, all remote drivers are global. Strong preference to not bring in a libkv, and use local plugins instead. There is a libnetwork issue for the missing local remote driver function.

2) If we're doing local drivers orchestrated by k8s, we need the Network name (to avoid K8s having to keep the mapping of node=>network ID which could become very unwieldy). This issue has been raised before, and closed as WAD; Docker folks, is there any appetite to re-evaluating that decision in light of Kubernetes' requirements?

If these were resolved, we could probably build a k8s docker-bridge plugin in CNM, or a CNI bridge which mangles the CNM interface and does all the work in EndpointCreate.

However, digging into what a CNM-bound implementation would look like, there's an additional issue; CNM currently doesn't have a mechanism for describing the sort of metadata that kubernetes network plugins will need to implement even rudimentary policy.

The current minimalist proposal from Tim is to have k8s Namespaces as a firewall perimeter, with Services used to specify pinholes to be opened. For example, a closed Namespace with a Service exposing port 80; by default pods in such a namespace are not contactable by pods outside, but pods selected by the Service would be contactable from outside the namespace on their port 80.

It's unclear how we'd pass this policy through the CNM interface. CNM doesn't have yet have labels (though they are coming; would be useful to know when). Labels don't feel like the right concept for things like "PodID" or "k8s Namespace", though we could use labels to transmit them. What I think we really want is something like K8s annotations, for attaching arbitrary properties to the Endpoint that's being created. (Labels are KVs for selecting; Annotations can be anything including JSON blobs).

If we're talking about bare-minimum requirements from CNM, then I think we need to add label support to that list too.

Note that CNI does not have any of the limitations described above, and specifically it allows passing arbitrary parameters into the plugin.

Paul Tiplady

unread,

Sep 4, 2015, 3:46:29 PM9/4/15

to kubernetes-sig-network, eugene.y...@coreos.com, jana.radh...@docker.com, ma...@docker.com, tho...@google.com, she...@nicira.com, paul.t...@metaswitch.com

After the rigorous technical discussion that has preceded, I'd like to step back and try to measure the sentiment in the Kubernetes community. Can I ask those who have built or plan to build Kubernetes network plugins to give a summary of your opinions:

-----

Have you implemented a network plugin using the current k8s API? Did it meet your needs?

Going forwards, would you prefer to use CNI or CNM for implementing Kubernetes plugins? (Feel free to include implementation concerns and/or higher-level architectural factors.)

Is it important to you to be able to write one plugin for all k8s container runtimes? (e.g. rkt, runC as well as docker.)

Is it important to you to be able to write a k8s plugin that's usable outside of k8s? (e.g. works natively with something like 'docker run' or 'rkt run'.)

-----

I'll hold off on posting my own opinions until others have responded.

On Friday, September 4, 2015 at 10:11:09 AM UTC-7, eugene.y...@coreos.com wrote:

Ravi Gadde

unread,

Sep 4, 2015, 4:46:30 PM9/4/15

to Paul Tiplady, kubernetes-sig-network, eugene.y...@coreos.com, jana.radh...@docker.com, ma...@docker.com, Tim Hockin, she...@nicira.com

Paul,

I have two comments.

1) If a pod can ask for a specific network (in the future) and the network doesn't span all nodes in the cluster (natural with l2 networks), then the scheduler would have keep track of network->nodes/group of nodes mapping.

2) Prefer JSON blobs as annotations are severely limiting when you have two or more instances of something - like two network IDs for a pod. We are planning to submit a proposal for a JSON blob in pod spec if Brendan's efforts dont include it already.

Thanks,

Ravi

> > To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-ne...@googlegroups.com <mailto:kubernetes-sig-ne...@googlegroups.com>.
> > To post to this group, send email to kubernetes-...@googlegroups.com <mailto:kubernetes-...@googlegroups.com>.

> > Visit this group at http://groups.google.com/group/kubernetes-sig-network <http://groups.google.com/group/kubernetes-sig-network>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-ne...@googlegroups.com.
To post to this group, send email to kubernetes-...@googlegroups.com.
Visit this group at http://groups.google.com/group/kubernetes-sig-network.
For more options, visit https://groups.google.com/d/optout.

Tim Hockin

unread,

Sep 4, 2015, 5:01:11 PM9/4/15

to Ravi Gadde, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich, Jana Radhakrishnan, Madhu Venugopal, Gurucharan Shetty

I think that, for round 1, we're assuming all networks are equally
available from all nodes, though the IP space may be pre-apportioned
per node.

Using network membership as a scheduling constraint is akin to using
node-local storage - both are probably things that we want, but we do
not have the plumbing for that yet.

Paul Tiplady

unread,

Sep 4, 2015, 5:05:08 PM9/4/15

to ravi....@gmail.com, kubernetes-sig-network, eugene.y...@coreos.com, jana.radh...@docker.com, ma...@docker.com, Tim Hockin, she...@nicira.com

Ravi,

On 1), I think I need to dig more into the use-cases you are envisioning. How about we continue that discussion on the other thread you opened? I don’t think this point affects the CNI vs. CNM question, since both support multiple networks.

For 2), perhaps we’re using different definitions for annotations. In the current K8s API, the concept of Annotations is defined here as “arbitrary non-identifying metadata… structured or unstructured”. K8s Annotations are just a bag of arbitrary KV pairs, and the values can be JSON if you want. That’s the place to put a JSON blob associated with a Pod spec.

I’m also using the distinction in the k8s API between Labels (which are identifying metadata, i.e. something you select on) and Annotations (things you don’t want to select on). That distinction is not necessarily required when specifying attributes on a CNM Endpoint, depending on the allowed values for a CNM label. But it seems to me that keeping a semantically similar split would make sense.

Cheers,
Paul

Ravi Gadde

unread,

Sep 4, 2015, 5:06:43 PM9/4/15

to Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich, Jana Radhakrishnan, Madhu Venugopal, Gurucharan Shetty

The scheduler extension proposal #11470 is a first step in that direction to atleast get a pluggable version of it going. Eventually, I hope it can be done natively in Kubernetes. For the time being, these assumptions shouldn't break any implementations that are going to leverage pluggable scheduler and network ids/blobs in the pod spec.

Thanks,

Ravi

Ravi Gadde

unread,

Sep 4, 2015, 6:36:48 PM9/4/15

to Paul Tiplady, kubernetes-sig-network, eugene.y...@coreos.com, jana.radh...@docker.com, ma...@docker.com, Tim Hockin, she...@nicira.com

On Fri, Sep 4, 2015 at 2:05 PM, Paul Tiplady <Paul.T...@metaswitch.com> wrote:

Ravi,

On 1), I think I need to dig more into the use-cases you are envisioning. How about we continue that discussion on the other thread you opened? I don’t think this point affects the CNI vs. CNM question, since both support multiple networks.

Sure.

For 2), perhaps we’re using different definitions for annotations. In the current K8s API, the concept of Annotations is defined here as “arbitrary non-identifying metadata… structured or unstructured”. K8s Annotations are just a bag of arbitrary KV pairs, and the values can be JSON if you want. That’s the place to put a JSON blob associated with a Pod spec.

Sorry for diverting this topic..

Annotations are map[string]string

Proposed blob is map[string]interface{}

If the Value is an interface{} instead of string, it allows for more flexible composition/readability of a Pod spec.

Thanks,

Ravi

Paul Tiplady

unread,

Sep 4, 2015, 7:02:51 PM9/4/15

to kubernetes-sig-network, Paul.T...@metaswitch.com, eugene.y...@coreos.com, jana.radh...@docker.com, ma...@docker.com, tho...@google.com, she...@nicira.com, ravi....@gmail.com

On Friday, September 4, 2015 at 3:36:48 PM UTC-7, Ravi Gadde wrote:

Annotations are map[string]string
Proposed blob is map[string]interface{}

If the Value is an interface{} instead of string, it allows for more flexible composition/readability of a Pod spec.

Ah, I follow you now. Yes, +1 from me on this -- I think it's covered by this issue: https://github.com/kubernetes/kubernetes/issues/12226

feisky

unread,

Sep 6, 2015, 10:32:48 PM9/6/15

to kubernetes-sig-network, tho...@google.com, ma...@docker.com, jana.radh...@docker.com, dc...@redhat.com, she...@nicira.com, eugene.y...@coreos.com, Paul.T...@metaswitch.com

By networkprovider, anyone can easily integrate existing SDN solutions to kuberntes, see https://github.com/kubernetes/kubernetes/pull/13622

在 2015年9月3日星期四 UTC+8下午11:12:56，Alex Pollitt写道：

>To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-network+unsub...@googlegroups.com.

vipi...@insiemenetworks.com

unread,

Sep 9, 2015, 2:56:10 PM9/9/15

to kubernetes-sig-network, eugene.y...@coreos.com, jana.radh...@docker.com, ma...@docker.com, tho...@google.com, she...@nicira.com, paul.t...@metaswitch.com

Finally caught up with this thread... thanks everyone for opening up with their thoughts, use cases and ideas.

Have you implemented a network plugin using the current k8s API? Did it meet your needs?

Yes... can be found at Contiv. The current APIs meets the needs, the best part of current APIs are that they give full control, rather than API based access. The reason it meets my needs is because:
- Networking is a cluster level concept i.e. user intent and operational-zing doesn't work at host or a host-plugin level. With the current model, I can easily put my own fancy networking (multiple interfaces in a container, multiple tenants in the host), etc. and satisfy most of the use cases.
- A "scoped down host level plugin" that gets all instructions about how to program a linux bridge, or ipvlan, or an openvswitch configuration at host level is not attractive, unless the native networking model is as rich, because it doesn't really allow various use cases, for example, doing fancy service tagging or integration with NFV providers using Geneve or NSH.
- I am not advocating bringing in complexity natively, but just suggesting that the benefit of plugin also comes into play if it is not 'yet another implementation' of the same intent.

Going forwards, would you prefer to use CNI or CNM for implementing Kubernetes plugins? (Feel free to include implementation concerns and/or higher-level architectural factors.)

Niether of them implements the policy model, which is my 'base start'. If it implements a reasonable networking model, then would certainly consider whichever one is closer. Otherwise I am happy with what exists today. There needs to be a decision on how to model networking in k8s/CNI in a way that it is simple enough for now and is extensible, in order for folks to not take it off and do it all by themselves. The same argument holds if we'd like to use CNM for K8s, but want to define the networking attributes in K8s, rather than docker UI.

Is it important to you to be able to write one plugin for all k8s container runtimes? (e.g. rkt, runC as well as docker.)

Not terribly important, because I am not convinced that modeling would completely converge - so writing two is perhaps okay.

Is it important to you to be able to write a k8s plugin that's usable outside of k8s? (e.g. works natively with something like 'docker run' or 'rkt run'.)

Not so important i.e. I will structure my code/components to be reused across/beyond k8s and deal with the differences thereof - mainly because I am not convinced that they are (or will be in short run) completely aligned.

Few other notes from having read the thread:
- I'd prefer having a networking model doesn't assume that plugin can't provide persistent IPs. Having persistent IPs is guaranteed to avoid situations with applications that require load-balancers, virtual-ips, etc. when there is no need to make it complicated ,for example CaaS offering.
- Are we considering taking out infra-pod in mid-term (from Tim's comment it didn't sound like something in short term)? This gives a clean way for k8s to own its own networking today and provide the right hooks.
- Is aligning CNI and CNM is the goal? IMHO, while it may be a noble cause, the reason to not do would be if independently both can make better individual progress than doing it together.

Thanks,
-Vipin

Reply all

Reply to author

Forward