Network Plugins definition (was "Kicking off the network SIG")

597 views
Skip to first unread message

Tim Hockin

unread,
Aug 20, 2015, 1:30:15 PM8/20/15
to kubernetes-...@googlegroups.com
It's hard to talk about network plugins without also getting into how
we can better align with docker and rocket-native plugins, but given
the immaturity of the whole space, let's try to ignore them and think
about what is the overall behavior we really want to get.

Forking from the kickoff thread.

On Sun, Aug 16, 2015 at 10:00 AM, Gurucharan Shetty <she...@nicira.com> wrote:

> From what I understand, this hairpin flag is only needed if one uses
> the kube-proxy to do L4 load-balancing. If there are people that would
> be doing L7 load balancing or a more feature rich L4 load balancing
> (based not only on L4 of the packet but also based on the traffic load
> on a particular destination pod) and they don't want to use iptables,
> I guess, mandating the hair-pin is not needed. So would it be correct
> to say that, if a network solution intends to use kube-proxy to do L4
> load balancing then their network plugin should enable hair-pin? This
> also raises a more general question: Is kube-proxy replaceable by
> something else?

kube-proxy is supposed to be one implementation of Services, but not
the only. In fact I have seen at least two demos of Services without
kube-proxy and know of two or three others that I have not seen
directly.

Given that, it seems to make more sense to describe the behavior that
we expect from a network plugin (ideally in the form of a test) and
let people conform to that.

> 1. Right now, there is a requirement that all the pods should also be
> able to speak to every host that hosts pods. This was clearly needed
> for the previous kube-proxy implementation where the source ip and mac
> would change to that of the src host. With the new implementation of
> kube-proxy, do we still need the host to be able to speak to the pods
> via IP? May be there were other reasons for this requirement?
> Thoughts?

This question touches on both directions: pod-to-node and node-to-pod.
It's a fair question - we jump through a lot of hoops to ensure that
access to services from nodes works. Is that really necessary? It's
certainly useful for debugging. If we want things like docker being
able to pull from a registry that is running as a Service or nodes to
access DNS in the cluster we need this access, or we start doing
per-node proxies and host-ports and localhost for everything. See
recent work by one of our interns on making an in-cluster docker
registry work for a taste.

Those are the "obvious" ones. I am interested to hear what other
things people might be doing where a process on the node needs to
access a Pod or Service, or vice-versa. Simplifying this connectivity
would be a win.

> 2. The current network plugin architecture prevents the network plugin
> from changing the IP address of the pod (from the Docker generated
> one). Well, it does not prevent you from changing the IP, but things
> like 'kubectl get endpoints' would only see the docker generated IP
> address. Kubernetes currently is a single-tenant system, so it
> probably is not very important to be able to change the IP address of
> the pod. But in the future, if there are plans for multi-tenancy (dev,
> qa, prod, staging etc in single environment), then overlapping IP
> addresses and logical networks (SDN, network virtualization) may be
> needed , in which case, ability to change the IP address will become
> important. Any thoughts?

As of recently plugins can return status, which includes a different IP.


On Mon, Aug 17, 2015 at 11:01 AM, Casey Davenport
<casey.d...@metaswitch.com> wrote:

> I don't think the hairpin flag will come into play for Calico. We don't
> build on the Docker bridge, instead creating new veth pair with one end in
> the pod's namespace and one end in the host's for each new pod. I won't
> know for sure if this is true until I test it, but I plan on doing so this
> week.

So you're assuming (rightly, so far) that network plugins are
monolithic and not composeable. Is it valuable or interesting to have
network plugins be composeable? For example, should it be possible to
write a plugin that handles things like installing special iptables
rules and use that plugin alongside a Calico plugin? For a more
concrete example, let's look at what we do in GCE in the default
Docker bridge mode. All of this is done in kubelet, but should be
plugins (IMO).

1) set a broad MASQUERADE rule in iptables (required to egress traffic
from containers because of GCE's edge NAT).
2) configure cbr0 (our docker bridge) with the per-node CIDR
3) tweak Docker config (I think?) to use cbr0
4) soon: install hairpin config on each cbr0 interface

At least two things stand out as pretty distinct - the MASQUERADE
rules and the cbr0 stuff.

The MASQUERADE stuff is needed regardless of whether you use a docker
bridge or Calico or Weave or Flannel, but it's actually pretty
site-specific. In GCE we shouldbasically say "anything that is not
destined for 10.0.0.0/8 needs masquerade", but that's not right. It
should be "anything destined for and RFC1918 address". But there are
probably cases where we would want the masquerade even withing RFC1918
space (I'm thinking VPNs, maybe?). Outside of GCE, the rules are
obviously totally different. sShould this be something that
kubernetes understands or handles at all? Or can we punt this down to
the node-management layer (in as much as we have one)?

The bridge management seems like a pretty clear case of plugin. We
could/should move all of the cbr0/docker0 management into a plugin and
make that the default, probably. This touches on another sore point -
docker itself has flags that we have historically suggested people set
(or not set) around iptables and masquerade. Those flags do things
that conflict with what we want to do, sometimes, but we don't
actively check that they are not set.

Is there any use for composable plugins?

> From my perspective, this should be handled by each individual network
> plugin - each plugin might want to handle this differently (or not at all).
> Perhaps there are other cases to be made for chaining of network plugins,
> but the hairpin case alone doesn't convince me.

I think I am coming to the same conclusion.

>> 1. Right now, there is a requirement that all the pods should also be
>> able to speak to every host that hosts pods.
>
> We've been talking about locking down host-pod communication in the general
> case as part of our thoughts on security. There are still cases where
> host->pod communication is needed (e.g. NodePort), but at the moment our
> thinking is to treat infrastructure as "outside the cluster". As far as
> security is concerned, we think the default behavior should be "allow from
> pods within my namespace". Anything beyond that can be configured using
> security policy.

See above - what about cases where the node needs to legitimately
access cluster services (the canonical case being a docker registry)?


On Tue, Aug 18, 2015 at 5:45 AM, Michael Bridgen <mic...@weave.works> wrote:

> I have been working on adding an API library to CNI[1], which is used for
> rocket's networking, but was intended as a widely-applicable plugin
> mechanism. To date, CNI consists of a specification for invoking plugins[2],
> some plugins themselves, and a handful of example scripts that drive the
> plugins. With an API in a go library, it'd be much easier to use as common
> networking plugin infrastructure for kubernetes, rocket, runc and other
> things that come along.
>
> I like CNI because it does just what is needed, while giving plugins and
> applications a fair bit of freedom of implementation. It's pretty close, and
> at the same level of abstraction, to the networking hooks added to
> Kubernetes recently.

I'm fine with folding things together - that would be great, in fact.
I have not paid attention to CNI in the last 2 months, but I had some
concerns with it, last I looked. I was one of the people arguing that
CNI and CNM were too close to not fold together. I still feel that
there is not really room for more than one or MAYBE two plugin models
in this space. I don't have any particular attachment to owning one
of those, personally, but I am VERY concerned that:

a) implementors like Weave/Calico/... have to implement and maintain
Docker plugins and CNI/k8s plugins with slightly different semantics

b) users experience confusion and/or complexity about how to configure
a solution

> So far, with Eugene @ CoreOS's help, I have pulled together enough of a
> library that Rocket's networking can be ported to it[3] -- not too
> surprising, since much of the code was adapted from there -- and I've
> written a tiny command-line tool that can be used with e.g., runc.
> Meanwhile, Paul @ Calico is getting good results trying an integration with
> Kubernetes.

I'd like to see this expanded on. If we can reduce the space from 3
plugins to 2, that's a win.

> I am aware that I'm late to the party, and that Kubernetes + CNI and various
> other combinations have been discussed before. But I think things have moved
> on a bit[4], so if people don't mind some recapitulation, it'd be useful to
> hear objections and (unmet) requirements and so on. Perhaps it is needless
> to say that I would like this to become a "full community effort", if we
> find that it is a broadly acceptable approach.

I'll have to look at CNI again.


On Tue, Aug 18, 2015 at 9:40 AM, Paul Tiplady
<paul.t...@metaswitch.com> wrote:

> I like this model because it would allow Calico to provide a single CNI
> plugin for Kubernetes, and have it run for any containerizer (docker, rkt,
> runc, ...). As k8s support for different runtimes grows, this will become an
> increasingly significant issue. (Right now we can just target docker and be
> done with it).

Does CNI work with Docker?

> Of the plugin hooks, CNI maps cleanly to ADD and DELETE. It doesn't have a
> concept of daemons, so the k8s INIT action isn't covered (we don't currently
> use INIT, though we think we will eventually). To handle the functionality
> currently provided by INIT, CNI could potentially be extended to add the
> concept of a daemon, or we could leave the INIT hook as a kick to an
> arbitrary binary that is independent from the CNI mechanism. The latter is
> probably the short-term pragmatic solution, since current plugins' INIT
> hooks will remain unchanged.

k8s plugins were intended to be exec-and-terminate, but docker plugins
are assumed to be be daemons. In both cases we have open issues with
startup, containerization vs not, etc.

> As for the new STATUS plugin action, I'm not sure if that's needed if we use
> CNI; the plugin returns the IP from the ADD call so we can just update the
> pod status after it's created. Was another motivation of STATUS the idea
> that a pod's IP could change? If we don't need to support that use case then
> things integrate very cleanly.

I asked the same question. I think it was "following the existing
pattern on calling docker for status". I think simply returning it
once might be OK.


On Tue, Aug 18, 2015 at 10:09 AM, Paul Tiplady
<paul.t...@metaswitch.com> wrote:

> For the rate-limiting case, I can't see how you can implement this outside
> the plugin in a generic way; after the plugin has done its thing, how do you
> determine which interface is connected to the pod? For example Calico
> deletes the veth pair that docker creates, so we'd have to duplicate any tc
> config that was set anyway. IMO better to have all that logic in one place,
> where a plugin implementor can see what needs to be implemented.

If I recall, the TC logic is applied at the host interface per-CIDR
not per veth.

> Also, currently kubelet has code to setup cbr0. While that's a great
> pragmatic simplification, I don't think it quite fits with the concept of
> pluggable networking modules -- could that be handled in the plugin INIT
> step? That would make the docker-bridge plugin an equal peer to other
> plugins, which would help flush out issues with the API; if we can't
> implement docker-bridge entirely as a k8s plugin, then maybe the API isn't
> complete enough.

Yeah, I should have read the thread before responding - this is my
conclusion, too.

Tim Hockin

unread,
Aug 21, 2015, 2:40:14 AM8/21/15
to kubernetes-sig-network
I re-read the CNI spec and looked at some of the code. I have a lot
of questions, which I will write up tomorrow hopefully, but it seems
viable to me.

eugene.y...@coreos.com

unread,
Aug 21, 2015, 3:10:38 PM8/21/15
to kubernetes-sig-network

On Thursday, August 20, 2015 at 10:30:15 AM UTC-7, Tim Hockin wrote:
It's hard to talk about network plugins without also getting into how
we can better align with docker and rocket-native plugins, but given
the immaturity of the whole space, let's try to ignore them and think
about what is the overall behavior we really want to get.

A philosophical decision one has to make when talking about these plugins is whether the role of the plugin is to:
1) Perform some abstract task like joining a container to a network. This is both the CNI and CNM model.
or
2) Just a callout to do any old manipulation of networking resources (e.g. bridges, iptables, veths, traffic classes, etc). I think this is what Tim proposed. This model is very flexible but is harder for the user to comprehend and configure. The user has to know what works with what and in which order.

I feel like we actually have a tiny experience with (1) but not with (2). Maybe resource intensive but is it worth doing a small POC for option 2?

Tim Hockin

unread,
Aug 21, 2015, 5:25:03 PM8/21/15
to Eugene Yakubovich, kubernetes-sig-network
Can you answer how, in CNI, something like Docker would work? They
want the "bridge" plugin but they want to add some per-container
iptables rules on top of it.

Should they fork the bridge plugin into their own and implement their
custom behavior? Should they make a 2nd plugin that follows "bridge"
and adds their iptables (not allowed in CNI)? Should they make a
wrapper plugin that calls bridge and then does their own work?

I think these are all viable. There's a simplicity win for admins
(especially user/admins) if the plugin is assumed monolithic, I guess.
> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kubernetes-sig-ne...@googlegroups.com.
> To post to this group, send email to
> kubernetes-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/kubernetes-sig-network.
> For more options, visit https://groups.google.com/d/optout.

eugene.y...@coreos.com

unread,
Aug 21, 2015, 5:41:40 PM8/21/15
to kubernetes-sig-network, eugene.y...@coreos.com

On Friday, August 21, 2015 at 2:25:03 PM UTC-7, Tim Hockin wrote:
Can you answer how, in CNI, something like Docker would work?  They
want the "bridge" plugin but they want to add some per-container
iptables rules on top of it.

Should they fork the bridge plugin into their own and implement their
custom behavior?  Should they make a 2nd plugin that follows "bridge"
and adds their iptables (not allowed in CNI)?  Should they make a
wrapper plugin that calls bridge and then does their own work?

They can either fork the bridge plugin or do a wrapper one. Ideally they
would abstract out the iptables rules into something they can contribute upstream 
to CNI's bridge plugin. 

Tim Hockin

unread,
Aug 21, 2015, 6:05:12 PM8/21/15
to Eugene Yakubovich, kubernetes-sig-network
Do you really want the "base" plugins to accumulate those sorts of
features? I like the idea of wrapping other plugins - formalizing
that pattern would be interesting. Keep a handful of very stable,
reasonably configurable (but not crazy) base plugins that people can
decorate.

eugene.y...@coreos.com

unread,
Aug 21, 2015, 6:17:46 PM8/21/15
to kubernetes-sig-network, eugene.y...@coreos.com
I guess I'd only want these base plugins to get the features if they're of general use.
For example, if we're talking about Docker links, then no. But if it's to
restrict cross talk between networks (which CNI does not currently do), then
sure. 

On Friday, August 21, 2015 at 3:05:12 PM UTC-7, Tim Hockin wrote:
Do you really want the "base" plugins to accumulate those sorts of
features?  I like the idea of wrapping other plugins - formalizing
that pattern would be interesting.  Keep a handful of very stable,
reasonably configurable (but not crazy) base plugins that people can
decorate.

On Fri, Aug 21, 2015 at 2:41 PM,  <eugene.y...@coreos.com> wrote:
>
> On Friday, August 21, 2015 at 2:25:03 PM UTC-7, Tim Hockin wrote:
>>
>> Can you answer how, in CNI, something like Docker would work?  They
>> want the "bridge" plugin but they want to add some per-container
>> iptables rules on top of it.
>>
>> Should they fork the bridge plugin into their own and implement their
>> custom behavior?  Should they make a 2nd plugin that follows "bridge"
>> and adds their iptables (not allowed in CNI)?  Should they make a
>> wrapper plugin that calls bridge and then does their own work?
>
>
> They can either fork the bridge plugin or do a wrapper one. Ideally they
> would abstract out the iptables rules into something they can contribute
> upstream
> to CNI's bridge plugin.
>
> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an

Paul Tiplady

unread,
Aug 24, 2015, 8:45:15 PM8/24/15
to kubernetes-sig-network
On Thursday, August 20, 2015 at 10:30:15 AM UTC-7, Tim Hockin wrote:
The MASQUERADE stuff is needed regardless of whether you use a docker
bridge or Calico or Weave or Flannel, but it's actually pretty
site-specific.  In GCE we shouldbasically say "anything that is not
destined for 10.0.0.0/8 needs masquerade", but that's not right.  It
should be "anything destined for and RFC1918 address".  But there are
probably cases where we would want the masquerade even withing RFC1918
space (I'm thinking VPNs, maybe?).  Outside of GCE, the rules are
obviously totally different.  sShould this be something that
kubernetes understands or handles at all?  Or can we punt this down to
the node-management layer (in as much as we have one)?
 
As you say, this is site-specific not plugin-specific; I think it's reasonable that, if needed, NAT should be set up when configuring the node (since the cloud-specific provisioner is better placed than the plugin to understand this requirement). Depending on the datacenter, there could be NAT on the node, NAT at the gateway, or no NAT at all if your pod IPs are publicly routable. Would be a win to keep that complexity out of k8s itself.
 
>
>  We've been talking about locking down host-pod communication in the general
> case as part of our thoughts on security.  There are still cases where
> host->pod communication is needed (e.g. NodePort), but at the moment our
> thinking is to treat infrastructure as "outside the cluster". As far as
> security is concerned, we think the default behavior should be "allow from
> pods within my namespace".  Anything beyond that can be configured using
> security policy.

See above - what about cases where the node needs to legitimately
access cluster services (the canonical case being a docker registry)?

I think we'd like to have an intermediate level of access to a service for "allow from all pods and k8s infrastructure (but not from outside the datacenter)", but this gets tricky because "from k8s infrastructure" could include traffic originally from a load balancer which has been forwarded from a NodePort service on one node to a pod on a second node (i.e. indistinguishable from internal node->pod traffic). I think we can just document the security impact of the combination [nodePort service + cluster-accessible pods => globally-accessible service]. Hopefully this goes away when LBs are smart enough that we don't need nodePort (or when we can use headless services + DNS instead).

> I like this model because it would allow Calico to provide a single CNI
> plugin for Kubernetes, and have it run for any containerizer (docker, rkt,
> runc, ...). As k8s support for different runtimes grows, this will become an
> increasingly significant issue. (Right now we can just target docker and be
> done with it).

Does CNI work with Docker?

Not natively (i.e. Docker calling into CNI). Though if it finds success in k8s, then this network plugin model could nudge the eventual standardized API that the OCI arrives upon.

CNI can quite straightforwardly configure networking for Docker containers in Kubernetes though. The approach I took for my prototype is to create the pod infra docker container with `--net=none`, and then have k8s call directly into CNI to set up networking. The main bit of complexity was rewiring the IP-learning for the new pod (since CNI returns the IP and expects the orchestrator to remember it, and there is no analogue to the 'docker inspect' command). Now that I I've got that working correctly, it's also removed the requirement for the STATUS plugin action, too.

Paul Tiplady

unread,
Aug 31, 2015, 8:29:48 PM8/31/15
to kubernetes-sig-network
As mentioned before, I've done some prototyping of replacing the current plugin interface with CNI. I've written up a design doc for my proposed changes.

I'd be interested to get folks' feedback on this approach; I'm going to spend the next day or two polishing my prototype so that I can have the code the talking.

eugene.y...@coreos.com

unread,
Aug 31, 2015, 9:12:15 PM8/31/15
to kubernetes-sig-network
Paul,

This is a great start. One shortcoming of CNI right now is that there's no good library. There's some code in https://github.com/appc/cni/tree/master/pkg and some still in rkt (https://github.com/coreos/rkt/tree/master/networking). It really needs to be pulled together to make using CNI easier from both the container runtime and for the plugin writers (the plugin side is currently better). To this end, Michael Bridgen from Weave was working on putting a library together (https://github.com/squaremo/cni/tree/libcni) but I don't know where he is at with it.

-Eugene

Tim Hockin

unread,
Sep 1, 2015, 1:16:40 AM9/1/15
to Paul Tiplady, kubernetes-sig-network
Notes as I read.

The biggest problem I have with this (and it's not necessarily a
show-stopper) is that a container created with plain old 'docker run'
will not be part of the kubernetes network because we will have
orchestrated the network at a higher level. In an ideal world, we'd
teach docker itself about the plugins and then simply delegate to it
as we do today.

That said, the more I dig into Docker's networking plugins the less I
like them. Philosophically and practically a daemon-free model built
around exec is so much cleaner. It seems at least theoretically
possible to bridge libnetwork to run CNI plugins, but probably not
without mutating the CNI spec to the more proscriptive libnetwork
model.


You say you'll push the IP to the apiserver - I guess you mean in
pod.status.podIP ?


Regarding CNI network configs, I assume that over time this might even
be something we expose through kubernetes - a la Docker networks.
The advantage here is that network management is a clean abstraction
distinct from interface management.


To your questions:

1) Can we eliminate Init?

I think yes.

2) Can we eliminate Status?

I think yes.

3) Can we cut over immediately to CNI, or do we need to keep the old
plugin interface for a time? If so, how long?

I think this becomes a community decision. There are a half-dozen to
a dozen places I know of using this feature. IFF they were OK making
a jump to something like CNI, we could do a hard cutover.

4) Can we live without the vendoring naming rules? Can we establish
that convention for plugins is to vendor-name the binary?
mycompany.com~myplug or something? Maybe it's not a huge deal.


I'll add #5 - does this mean we have no concept of in-process plugin?
Or do we retain the facade of an in-process API like we have now.


Overall this looks plausible, but I'd like to hear from all the folks
who have plugins implemented today, especially if you have both CNI
and libnetwork experience. The drawback I listed above (plain old
'docker run') is real, but maybe something we can live with. Maybe
it's actually a feature?

As a discussion point - how much would we have to adulterate CNI to
make a bridge? It sure would be nice to use the same plugins in both
Docker and rkt - I sure as hell don't want to tweak and debug this
stuff twice.

We could have a little wrapper binary that knew about a static network
config, and anyone who asked for a new network from our plugin would
get an error, then we just feed the static config(s) to the wrapped
CNI binary. We'd have to split Add into create/join but that doesn't
seem SO bad. What else?
> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kubernetes-sig-ne...@googlegroups.com.

Paul Tiplady

unread,
Sep 1, 2015, 12:39:30 PM9/1/15
to kubernetes-sig-network
Hi Eugene,

I've been talking with Michael, and the prototype integration that I built uses his libcni branch. Thus far his code has met my needs, so I think his approach is sound.

Cheers,
Paul

Paul Tiplady

unread,
Sep 1, 2015, 1:46:35 PM9/1/15
to kubernetes-sig-network, paul.t...@metaswitch.com


On Monday, August 31, 2015 at 10:16:40 PM UTC-7, Tim Hockin wrote:
Notes as I read.

The biggest problem I have with this (and it's not necessarily a
show-stopper) is that a container created with plain old 'docker run'
will not be part of the kubernetes network because we will have
orchestrated the network at a higher level.  In an ideal world, we'd
teach docker itself about the plugins and then simply delegate to it
as we do today.

Agree that this is a change in workflow -- though `docker run` was already broken with the existing network plugin API for plugins which don't use the docker bridge (e.g. Calico).

I think we can get round this by using `kubectl run|exec`; now that exec has -i and --tty options, I think the main usecases are covered.


That said, the more I dig into Docker's networking plugins the less I
like them.  Philosophically and practically a daemon-free model built
around exec is so much cleaner.  It seems at least theoretically
possible to bridge libnetwork to run CNI plugins, but probably not
without mutating the CNI spec to the more proscriptive libnetwork
model.


You say you'll push the IP to the apiserver - I guess you mean in
pod.status.podIP ?

Yep 


Regarding CNI network configs, I assume that over time this might even
be something we expose through kubernetes - a la Docker networks.
The advantage here is that network management is a clean abstraction
distinct from interface management.

Good point -- this could be a nice feature, if you have one group of pods which have a very different set of network requirements (e.g. latency-sensitive, or L2 vs. pure-L3) then you can bundle them onto a different network. Routing between networks could be fun though.



To your questions:

1) Can we eliminate Init?

I think yes.

2) Can we eliminate Status?

I think yes.

3) Can we cut over immediately to CNI, or do we need to keep the old
plugin interface for a time? If so, how long?

I think this becomes a community decision.  There are a half-dozen to
a dozen places I know of using this feature.  IFF they were OK making
a jump to something like CNI, we could do a hard cutover.

4) Can we live without the vendoring naming rules?  Can we establish
that convention for plugins is to vendor-name the binary?
mycompany.com~myplug or something?  Maybe it's not a huge deal.


I'll add #5 - does this mean we have no concept of in-process plugin?
Or do we retain the facade of an in-process API like we have now.

Added a bullet for this in the doc. 

CNI doesn't currently have the concept of an in-process plugin. Looks like with the current API this only for vendors that are extending the kubernetes codebase, or am I missing something?

With Michael Bridgen's work to turn CNI into a library, in-process CNI-style plugins become a viable option. A couple possible approaches:

* Extend libcni to add the concept of an in-process plugin as a native concept. (libcni could formalize an interface to run these in-process plugins as standalone plugins as well, which would mean developers can target both in- and out-of-process plugins if they care).
* Create a hook in our plugin code where in-process code can run (consuming the CNI API objects that we created to pass to the CNI exec-plugin).

I think the latter could be done with the current proposal on a per-vendor basis, but there might be benefit in formalizing that interface.


Overall this looks plausible, but I'd like to hear from all the folks
who have plugins implemented today, especially if you have both CNI
and libnetwork experience.  The drawback I listed above (plain old
'docker run') is real, but maybe something we can live with.  Maybe
it's actually a feature?

As a discussion point - how much would we have to adulterate CNI to
make a bridge?  It sure would be nice to use the same plugins in both
Docker and rkt - I sure as hell don't want to tweak and debug this
stuff twice.

We could have a little wrapper binary that knew about a static network
config, and anyone who asked for a new network from our plugin would
get an error, then we just feed the static config(s) to the wrapped
CNI binary.  We'd have to split Add into create/join but that doesn't
seem SO bad.   What else?

I was pondering this approach -- the big stumbling block for me is that a CNM createEndpoint can occur on a different host than the joinEndpoint call, so naively we'd need a cluster-wide distributed datastore to keep track of the Create calls.

Short of breaking the spec and disallowing Create and Join from being called on different hosts, I don't see a way around that issue.

eugene.y...@coreos.com

unread,
Sep 1, 2015, 3:07:16 PM9/1/15
to kubernetes-sig-network, paul.t...@metaswitch.com

On Tuesday, September 1, 2015 at 10:46:35 AM UTC-7, Paul Tiplady wrote:

I'll add #5 - does this mean we have no concept of in-process plugin?
Or do we retain the facade of an in-process API like we have now.

Added a bullet for this in the doc. 

CNI doesn't currently have the concept of an in-process plugin. Looks like with the current API this only for vendors that are extending the kubernetes codebase, or am I missing something?

 
CNI doesn't have in-process plugins because that requires shared object (.so) support and I believe that Go has problems with that (although it maybe fixed in 1.5). Technically CNI is not Go specific but realistically so much software in this space is written in Go. Having "in-tree" plugins don't require .so support but to be honest those never pass my definition of "plugins". FWIW, I would have been quite happy to just have .so plugins as there's no fork/exec overhead. 

 
I was pondering this approach -- the big stumbling block for me is that a CNM createEndpoint can occur on a different host than the joinEndpoint call, so naively we'd need a cluster-wide distributed datastore to keep track of the Create calls.

Short of breaking the spec and disallowing Create and Join from being called on different hosts, I don't see a way around that issue.


Wow, I was not aware of that. How does it work now? CreateEndpoint creates the interface (e.g. veth pair) on the host. Join then specifies the interface names that should be moved into the sandbox. I don't really understand how Join can be called on a different host -- wouldn't there be no interface to move on that host then? 

Gurucharan Shetty

unread,
Sep 1, 2015, 3:43:42 PM9/1/15
to eugene.y...@coreos.com, kubernetes-sig-network, paul.t...@metaswitch.com
Let us not make an assumption that all plugins will be Golang based.
OpenStack Neutron currently has python libraries for clients and my
plugin that integrates containers with openstack is python based.

Fwiw, Docker's libnetwork does not mandate golang plugins. It uses
REST APIs to talk to plugins.


> Wow, I was not aware of that. How does it work now? CreateEndpoint creates
> the interface (e.g. veth pair) on the host.
veth pairs are not mandated to be created on CreateEndpoint (). You
are only to return IP addresses, MAC addresses, Gateway etc. What this
does in theory is that it provides flexibility with container mobility
across hosts. So you can effectively create an endpoint from a central
location and ask a container to join that endpoint from any host.


>Join then specifies the
> interface names that should be moved into the sandbox. I don't really
> understand how Join can be called on a different host -- wouldn't there be
> no interface to move on that host then?
>
> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kubernetes-sig-ne...@googlegroups.com.

eugene.y...@coreos.com

unread,
Sep 1, 2015, 5:11:21 PM9/1/15
to kubernetes-sig-network, eugene.y...@coreos.com, paul.t...@metaswitch.com

On Tuesday, September 1, 2015 at 12:43:42 PM UTC-7, Gurucharan Shetty wrote:
Let us not make an assumption that all plugins will be Golang based.
OpenStack Neutron currently has python libraries for clients and my
plugin that integrates containers with openstack is python based.

Sure, I would never assume that all plugins will be Go based. Rather I want to not
exclude Go based ones.
 
Fwiw, Docker's libnetwork does not mandate golang plugins. It uses
REST APIs to talk to plugins. 
 
Right but REST is for out of process.


> Wow, I was not aware of that. How does it work now? CreateEndpoint creates
> the interface (e.g. veth pair) on the host.
veth pairs are not mandated to be created on CreateEndpoint (). You
are only to return IP addresses, MAC addresses, Gateway etc. What this
does in theory is that it provides flexibility with container mobility
across hosts. So you can effectively create an endpoint from a central
location and ask a container to join that endpoint from any host.

I think I understand what you and Paul meant by different hosts. Yes, their
model of decoupling the container from the interfaces is slick and something
that CNI can't do. However for all its slickness, I am not a fan of moving containers
or IPs around. If a container dies, start a new one. And give it a new IP -- don't equate
the IP to a service (yes, you need a service discovery). Anyway, that's all in line
with Kubernetes thinking and so does not need to be supported in a Kubernetes cluster.

I think if a user wants to have this mobility, they will not be running Kubernetes.
 


>Join then specifies the
> interface names that should be moved into the sandbox. I don't really
> understand how Join can be called on a different host -- wouldn't there be
> no interface to move on that host then?
>
> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an

Tim Hockin

unread,
Sep 1, 2015, 5:54:48 PM9/1/15
to Eugene Yakubovich, kubernetes-sig-network, Paul Tiplady
On Tue, Sep 1, 2015 at 12:07 PM, <eugene.y...@coreos.com> wrote:
>
> On Tuesday, September 1, 2015 at 10:46:35 AM UTC-7, Paul Tiplady wrote:
>>>
>>>
>>> I'll add #5 - does this mean we have no concept of in-process plugin?
>>> Or do we retain the facade of an in-process API like we have now.
>>>
>> Added a bullet for this in the doc.
>>
>> CNI doesn't currently have the concept of an in-process plugin. Looks like
>> with the current API this only for vendors that are extending the kubernetes
>> codebase, or am I missing something?
>>
>
> CNI doesn't have in-process plugins because that requires shared object
> (.so) support and I believe that Go has problems with that (although it
> maybe fixed in 1.5). Technically CNI is not Go specific but realistically so
> much software in this space is written in Go. Having "in-tree" plugins don't
> require .so support but to be honest those never pass my definition of
> "plugins". FWIW, I would have been quite happy to just have .so plugins as
> there's no fork/exec overhead.

I didn't mean to imply .so, though that's a way to do it too. I meant
to ask whether kubernetes/docker/rkt could have network plugins
defined in code, one of which was an exec-proxy, or whether exec was
it. I don't feel strongly that in-process is needed at this point.

>> I was pondering this approach -- the big stumbling block for me is that a
>> CNM createEndpoint can occur on a different host than the joinEndpoint call,
>> so naively we'd need a cluster-wide distributed datastore to keep track of
>> the Create calls.
>>
>> Short of breaking the spec and disallowing Create and Join from being
>> called on different hosts, I don't see a way around that issue.
>
> Wow, I was not aware of that. How does it work now? CreateEndpoint creates
> the interface (e.g. veth pair) on the host. Join then specifies the
> interface names that should be moved into the sandbox. I don't really
> understand how Join can be called on a different host -- wouldn't there be
> no interface to move on that host then?

Yeah, where do you get that information? Not calling you wrong, just
something that was not at all clear, if it is true.

Tim Hockin

unread,
Sep 1, 2015, 5:56:46 PM9/1/15
to Gurucharan Shetty, Eugene Yakubovich, kubernetes-sig-network, Paul Tiplady
On Tue, Sep 1, 2015 at 12:43 PM, Gurucharan Shetty <she...@nicira.com> wrote:
> Let us not make an assumption that all plugins will be Golang based.
> OpenStack Neutron currently has python libraries for clients and my
> plugin that integrates containers with openstack is python based.
>
> Fwiw, Docker's libnetwork does not mandate golang plugins. It uses
> REST APIs to talk to plugins.
>
>
>> Wow, I was not aware of that. How does it work now? CreateEndpoint creates
>> the interface (e.g. veth pair) on the host.
> veth pairs are not mandated to be created on CreateEndpoint (). You
> are only to return IP addresses, MAC addresses, Gateway etc. What this
> does in theory is that it provides flexibility with container mobility
> across hosts. So you can effectively create an endpoint from a central
> location and ask a container to join that endpoint from any host.

I feel dumb, but I don't get it. Since you seem to understand it, can
you spell it out in more detail?

Paul Tiplady

unread,
Sep 1, 2015, 6:41:35 PM9/1/15
to kubernetes-sig-network, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com
On Tuesday, September 1, 2015 at 2:56:46 PM UTC-7, Tim Hockin wrote:
I feel dumb, but I don't get it.  Since you seem to understand it, can
you spell it out in more detail?

This isn't spelled out in the libnetwork docs, though it should be since it's highly unintuitive. We had to do a lot of code/IRC spelunking to get a clear picture of this.

The best I can do from the docs is this:

"
  • One of a FAQ on endpoint join() API is that, why do we need an API to create an Endpoint and another to join the endpoint.
    • The answer is based on the fact that Endpoint represents a Service which may or may not be backed by a Container. When an Endpoint is created, it will have its resources reserved so that any container can get attached to the endpoint later and get a consistent networking behaviour.
"
Note "resources reserved", not "interfaces created".

Here's an issue that Gurucharan raised on the libnetwork repo which clarifies somewhat: https://github.com/docker/libnetwork/issues/133#issuecomment-99927188

Although see this issue, which suggests that in CNM the CreateEndpoint ("service publish") event gets broadcasted (via the docker daemon's distributed KV store) to the network plugins on every host, so it looks like it's not even possible to optimistically create a veth and then hope that the Endpoint.Join gets run on the same host.

Note that things like interface name and MAC address are assigned at CreateEndpoint time, as if you were creating the veth at that stage.

eugene.y...@coreos.com

unread,
Sep 1, 2015, 7:19:10 PM9/1/15
to kubernetes-sig-network, eugene.y...@coreos.com, paul.t...@metaswitch.com

On Tuesday, September 1, 2015 at 2:54:48 PM UTC-7, Tim Hockin wrote:
On Tue, Sep 1, 2015 at 12:07 PM,  <eugene.y...@coreos.com> wrote:
>
> On Tuesday, September 1, 2015 at 10:46:35 AM UTC-7, Paul Tiplady wrote:
>>>
>>>
>>> I'll add #5 - does this mean we have no concept of in-process plugin?
>>> Or do we retain the facade of an in-process API like we have now.
>>>
>> Added a bullet for this in the doc.
>>
>> CNI doesn't currently have the concept of an in-process plugin. Looks like
>> with the current API this only for vendors that are extending the kubernetes
>> codebase, or am I missing something?
>>
>
> CNI doesn't have in-process plugins because that requires shared object
> (.so) support and I believe that Go has problems with that (although it
> maybe fixed in 1.5). Technically CNI is not Go specific but realistically so
> much software in this space is written in Go. Having "in-tree" plugins don't
> require .so support but to be honest those never pass my definition of
> "plugins". FWIW, I would have been quite happy to just have .so plugins as
> there's no fork/exec overhead.

I didn't mean to imply .so, though that's a way to do it too.  I meant
to ask whether kubernetes/docker/rkt could have network plugins
defined in code, one of which was an exec-proxy, or whether exec was
it.  I don't feel strongly that in-process is needed at this point.

Kubernetes/docker/rkt could certainly have "built-in types" aside from the exec based ones.
But if that built-in type is useful in general, it should be a separate executable so 
it could be re-used in other projects. And since we should strive to make these
networking plugins not tied to a container runtime, they should all be executables
by my logic.

Gurucharan Shetty

unread,
Sep 1, 2015, 7:30:05 PM9/1/15
to Tim Hockin, Eugene Yakubovich, kubernetes-sig-network, Paul Tiplady
Let me try to explain what I mean to the best of my ability with an
analogy of VMs and Network Virtualization (But before that let me
clarify that since k8 is single tenant orchestrator and has been
designed with VIP and load balancers as a basic building block, the
feature is not really useful for k8. )

With Network Virtualization, you can have 2 VMs belonging to 2
different tenants run on the same hypervisor with the same IP address.
The packet sent by VM of one tenant will never reach the VM of another
tenant, even though they are connected to the same vSwitch (e.g
openvswitch). You can apply policies to these VM interfaces (e.g.
QoS, Firewall) etc. And then you can move one of the VM to a different
hypervisor (vMotion). All the policies (e.g QoS, firewall) will now
follow the VM to the new hypervisor automatically. The IP address and
MAC address follows too to the new VM. The network controller simply
reprograms the various vswitches so that the packet forwarding happens
to the new location.

Since you have already associated your policies (firewall, QoS etc)
with the endpoint, you can destroy the VM that the endpoint is
connected to and then create a new VM at a different hypervisor and
attach the old endpoint (with its old policies) to the new VM.

My reading of what libnetwork achieves with containers is the same as
above. i.e., you can create a network endpoint with policies applied
and then attach it to any container on any host.


PS: OpenStack Neutron has the same model. The network endpoints are
created first. An IP and MAC is provisioned to that network endpoint.
And then a VM is created asking it to attach to that network endpoint.

eugene.y...@coreos.com

unread,
Sep 1, 2015, 7:45:26 PM9/1/15
to kubernetes-sig-network, tho...@google.com, eugene.y...@coreos.com, paul.t...@metaswitch.com
Let me try to explain what I mean to the best of my ability with an
analogy of VMs and Network Virtualization (But before that let me
clarify that since k8 is single tenant orchestrator and has been
designed with VIP and load balancers as a basic building block, the
feature is not really useful for k8. )

With Network Virtualization, you can have 2 VMs belonging to 2
different tenants run on the same hypervisor with the same IP address.
The packet sent by VM of one tenant will never reach the VM of another
tenant, even though they are connected to the same vSwitch (e.g
openvswitch).  You can apply policies to these VM interfaces (e.g.
QoS, Firewall) etc. And then you can move one of the VM to a different
hypervisor (vMotion). All the policies (e.g QoS, firewall) will now
follow the VM to the new hypervisor automatically. The IP address and
MAC address follows too to the new VM. The network controller simply
reprograms the various vswitches so that the packet forwarding happens
to the new location.

Since you have already associated your policies (firewall, QoS etc)
with the endpoint, you can destroy the VM that the endpoint is
connected to and then create a new VM at a different hypervisor and
attach the old endpoint (with its old policies) to the new VM.

My reading of what libnetwork achieves with containers is the same as
above. i.e., you can create a network endpoint with policies applied
and then attach it to any container on any host.

That makes sense except for this conflation of endpoint and service.
If the CreateEndpoint is really just reserving an IP  for the service that
can later be backed by some container, there is really no reason to
allocate a MAC at that point (which CreateEndpoint requires as it is expected
that it will call AddInterface whose second arg is a MAC).

While I certainly appreciate having a driver type that allows this type of migration,
I don't like every driver having to support this model. For example, this won't
really work with ipvlan where the MAC address can't be generated (it's the host's MAC)
and moved around.

Considering above statement, I don't want to modify CNI towards it. This still leaves
me hanging on how to change CNI enough to make the libnetwork interop possible. 

Tim Hockin

unread,
Sep 1, 2015, 8:18:57 PM9/1/15
to Paul Tiplady, kubernetes-sig-network, Gurucharan Shetty, Eugene Yakubovich
On Tue, Sep 1, 2015 at 3:41 PM, Paul Tiplady
<paul.t...@metaswitch.com> wrote:

> Here's an issue that Gurucharan raised on the libnetwork repo which
> clarifies somewhat:
> https://github.com/docker/libnetwork/issues/133#issuecomment-99927188

This was not answered:

"""Is your thought process that the driver can create vethnames based
on endpointuuid to make it truly portable. i.e., one can call
driver.CreateEndpoint() on one host and return back vethnames based on
eid, but not actually create the veths. Call driver.Join() on a
different host. So even though veth names are created during 'docker
service create' but veths are physically created only during 'docker
service join'. (But, vethnames can only be 15 characters long on
Linux, so there is a very very small possibility of collisions.)"""

> Although see this issue, which suggests that in CNM the CreateEndpoint
> ("service publish") event gets broadcasted (via the docker daemon's
> distributed KV store) to the network plugins on every host, so it looks like
> it's not even possible to optimistically create a veth and then hope that
> the Endpoint.Join gets run on the same host.

That sounds ludicrous to me. Can we get some confirmation from
libnetwork folks?

> Note that things like interface name and MAC address are assigned at
> CreateEndpoint time, as if you were creating the veth at that stage.

So I make up l locally-random name and expect it to be globally unique?

Tim Hockin

unread,
Sep 1, 2015, 8:46:25 PM9/1/15
to Gurucharan Shetty, Eugene Yakubovich, kubernetes-sig-network, Paul Tiplady
On Tue, Sep 1, 2015 at 4:30 PM, Gurucharan Shetty <she...@nicira.com> wrote:

> Since you have already associated your policies (firewall, QoS etc)
> with the endpoint, you can destroy the VM that the endpoint is
> connected to and then create a new VM at a different hypervisor and
> attach the old endpoint (with its old policies) to the new VM.
>
> My reading of what libnetwork achieves with containers is the same as
> above. i.e., you can create a network endpoint with policies applied
> and then attach it to any container on any host.

Thanks for the explanation. I understand it better. It's incredibly
complicated, isn't it? I think a main distinction between this
example with VMs and the container ethos is identity. A VM's IP
really is part of its identity, but a container is a point in time. I
know people will argue with this in both directions, but that seems to
be "generally" the way people think about things.

A VM's IP is expected to remain constant across moves and restarts. A
container's IP is less constant (not at all today).

More importantly, there are things that container networking can do
that preclude this level of migratability - ipvlan or plain old docker
bridges being good examples. What are "simple" plugins supposed to
assume?

Straw man, thanks to Prashanth here for discussion:

Write a "cni-exec" libnetwork driver.

You can not create new networks using it. When a CreateNetwork() call
is received we check for a static config file on disk.
E.g.CreateNetwork(name = "foobar") looks for
/etc/cni/networks/foobar.json, and if it does not exist or does not
match, fail.

PROBLEM: it looks like the CreateNetwork() call can not see the name
of the network. Let's assume that could be fixed.

CreateEndpoint() does just enough work to satisfy the API, and save
all of its state in memory.

PROBLEM: If docker goes down, how does this state get restored?

endpoint.Join() takes the saved info from CreateEndpoint(), massages
it into CNI-compatible data, and calls the CNI plugin.


Someone shoot this down? It's not general purpose in the sense that
docker's network CLI can't be used, but would it be good enough to
enable people to use the same CNI plugins across docker and rkt?

Gurucharan Shetty

unread,
Sep 1, 2015, 8:52:52 PM9/1/15
to Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
>> Note that things like interface name and MAC address are assigned at
>> CreateEndpoint time, as if you were creating the veth at that stage.
>
> So I make up l locally-random name and expect it to be globally unique?

Actually that is not true. From:
https://github.com/docker/libnetwork/blob/master/docs/remote.md

The veth names are returned back during network join call. And network
join is not broadcasted to all hosts. If I remember correctly only
create network, delete network, create endpoint, delete endpoint are
broadcasted to all nodes.

Tim Hockin

unread,
Sep 1, 2015, 8:59:44 PM9/1/15
to Gurucharan Shetty, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
They are broadcasted or they are posted to the KV store? In other
words, are plugins expected to watch() the KV store for new endpoints
and networks, or to lazily fetch them?

How does a "remote" plugin do this?

Gurucharan Shetty

unread,
Sep 1, 2015, 9:09:39 PM9/1/15
to Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
On Tue, Sep 1, 2015 at 5:59 PM, Tim Hockin <tho...@google.com> wrote:
> They are broadcasted or they are posted to the KV store? In other
> words, are plugins expected to watch() the KV store for new endpoints
> and networks, or to lazily fetch them?
Docker daemon in every host reads the kv store and send that
information to the remote plugin drivers on that host via the plugin
API. The remote drivers are not supposed to look at Docker's kv store
but only rely on the information received via the API.

>
> How does a "remote" plugin do this?
The current design is harsh on remote plugins (the libnetwork
developers have promised to look into it to see if they can come up
with a viable solution. See
https://github.com/docker/libnetwork/issues/313). My remote driver
(which integrates with OpenStack neutron, but runs the containers
inside the VMs) makes call to the OpenStack Neutron database to
store/fetch the information. With the current design a single user
request via docker CLI gets converted into 'X' requests to Neutron
database (where 'X' = number of hosts in the cluster) and that is
unworkable for large number of hosts and large number of containers.
That is one reason I like the k8 plugins.

Tim Hockin

unread,
Sep 1, 2015, 9:19:06 PM9/1/15
to Gurucharan Shetty, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
Guru,

Thanks for the explanations. I appreciate you being "docker by proxy" here :)

I do feel like I mostly understand the libnetwork model now, though
there are some very clear limitations of it. I also feel like I could
work around most of the limitations, but the solutions are the same as
you - make our own side-band calls to our own API and fetch
information that Docker does not provide. We can't implement their
libkv in terms of our API server because it is not a general purpose
store, though maybe we could intercept accesses and parse the path?
Puke.

The more I learn, the less I like it. It feels incredibly convoluted
for simple drivers to do anything useful.

Gurucharan Shetty

unread,
Sep 1, 2015, 9:57:07 PM9/1/15
to Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
I think if docker daemons are not started as part of a distributed kv
store, but rather as part of a local only kv store (for e.g each
minion will have a consul/etcd client running that is local only),
then libnetwork can be abused for k8 purposes for IP address reporting
via 'docker inspect'. If such a thing is done, any commands like
'docker network ls', 'docker service ls' etc will report false data,
but k8 need not show that to the user.

Prashanth B

unread,
Sep 1, 2015, 11:06:04 PM9/1/15
to Gurucharan Shetty, Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
> Docker daemon in every host reads the kv store and send that
information to the remote plugin drivers on that host via the plugin
API.

Your statement is confusing, can you clarify? It seems like the remote driver is just like a native driver that does nothing but parse its arguments into json and post them to the plugin using an http client. There is code to watch the kv store in the controller itself, but it no-ops if a store isn't provided (that's how the bridge driver works). IIUC the plugin just needs to run an HTTP server bound to a unix socket in /run/docker/plugins. 

If this is right it makes our first cut driver simpler, we can directly use the apiserver from our (hypothetical) kubelet-driver for things that need persistence, without running another database.


Gurucharan Shetty

unread,
Sep 1, 2015, 11:59:03 PM9/1/15
to Prashanth B, Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
On Tue, Sep 1, 2015 at 8:06 PM, Prashanth B <be...@google.com> wrote:
>> Docker daemon in every host reads the kv store and send that
> information to the remote plugin drivers on that host via the plugin
> API.
>
> Your statement is confusing, can you clarify? It seems like the remote
> driver is just like a native driver that does nothing but parse its
> arguments into json and post them to the plugin using an http client. There
> is code to watch the kv store in the controller itself, but it no-ops if a
> store isn't provided (that's how the bridge driver works). IIUC the plugin
> just needs to run an HTTP server bound to a unix socket in
> /run/docker/plugins.
>
> If this is right it makes our first cut driver simpler, we can directly use
> the apiserver from our (hypothetical) kubelet-driver for things that need
> persistence, without running another database.
I did not understand your question/assertion correctly. So I am going
to make a detailed elaboration.

When I say a "remote driver", I mean a server which listens for REST
API calls from docker daemon.

I have a remote driver written in Python here:
https://github.com/shettyg/ovn-docker/blob/master/ovn-docker-driver

That driver writes the line "tcp://0.0.0.0:5000" in the file
"/etc/docker/plugins/openvswitch.spec"

Now, when docker daemon starts, it will send the equivalent of:
curl -i -H 'Content-Type: application/json' -X POST
http://localhost:5000/Plugin.Activate

And my driver is suppose to respond with:

{
"Implements": ["NetworkDriver"]
}

1. User creates a network:
docker network create -d openvswitch foo

This command from the user results in my server receiving the
equivalent of the following POST request.


i.e:
curl -i -H 'Content-Type: application/json' -X POST -d
'{"NetworkID":"UUID","Options":{"blah":"bar"}}'
http://localhost:5000/NetworkDriver.CreateNetwork

Now things get interesting. The above POST request gets repeated on
every host that belongs to the docker cluster.
So your driver should figure out what is a duplicate request.


2. User creates a service
docker service publish my-service.foo

The above command will call the equivalent of:

curl -i -H 'Content-Type: application/json' -X POST -d
'{"NetworkID":"UUID","EndpointID":"dummy-endpoint","Interfaces":[],"Options":{}}'
http://localhost:5000/NetworkDriver.CreateEndpoint

Again the same command gets called in every host.

I hope that answers your question?

Tim Hockin

unread,
Sep 2, 2015, 12:35:53 AM9/2/15
to Gurucharan Shetty, Prashanth B, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
Prashanth's point, I think, was that we could have Kubelet act as a
network plugin, lookup/store network and endpoint config in our api
server, and exec CNI plugins for join.

This falls down a bit for a few reasons. First, libkv assumes an
arbitrary KV store, which our APIserver is not. Second, the fact that
Network objects can be created through Docker or kubernetes is not
cool. Third, if we only allow network objects through kubernetes we
can't see the name of the object Docker thinks it is creating.

Prashanth B

unread,
Sep 2, 2015, 1:22:23 AM9/2/15
to Tim Hockin, Gurucharan Shetty, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
I hope that answers your question?

Thanks for the example. So what I'm proposing is a networking model with the following limitations for the short term:
1. Only one (docker) network object, this is the kubernetes network. All endpoints must join it. 
2. Containers can only join endpoints on the same host.
3. A join execs CNI plugin with json composed from the join and endpoint create, derived from storage (physical memory, sqlite, apiserver -- as long as it's not another database it remains an implementation detail).

First, libkv assumes an arbitrary KV store, which our APIserver is not.

Doesn't look like libkv is a requirement for remote plugins. If we start docker with a plugin but without a kv store, the json will get posted to the localhost http server, but not propogated to the other hosts (untested, this from staring at code). This is ok, because there is only 1 network and no cross host endpoint joining. If we really need cross host consistency, we have an escape hatch via apiserver watch.

> Third, if we only allow network objects through kubernetes we can't see the name of the object Docker thinks it is creating.

We don't even have to allow this. The cluster is bootstrapped with a network object. It's readonly thereafter. Create network will noop after that. 

This would give users the ability to dump their own docker plugins into /etc/docker/plugins, start the kubelet with --manage-networking=false, and use dockers remote plugin api. At the same time CNI should work with manage-networking=true. 




Tim Hockin

unread,
Sep 2, 2015, 2:23:15 AM9/2/15
to Prashanth B, Gurucharan Shetty, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
We will eventually want to add something akin to multiple Networks, so
I want to be dead sure that it is viable before we choose and
implement a model.

Gurucharan Shetty

unread,
Sep 2, 2015, 10:36:08 AM9/2/15
to Prashanth B, Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
On Tue, Sep 1, 2015 at 10:22 PM, Prashanth B <be...@google.com> wrote:
>> I hope that answers your question?
>
> Thanks for the example. So what I'm proposing is a networking model with the
> following limitations for the short term:
> 1. Only one (docker) network object, this is the kubernetes network. All
> endpoints must join it.

IMO, the "one" network object theoretically fits into the current k8
model wherein all pods can communicate with each other over L3. But
let me bring up a couple of points that provides a counter-view.

My understanding of implementation of Docker's inbuilt overlay
solution is that a "network" is a broadcast domain. So if you impose
the same meaning on k8 networking, you actually end up with a
humungous broadcast domain across multiple hosts and it won;t scale.

So one could argue that the current k8 model is that each minion is
one network and all networks are connected to each other via a router.

> 2. Containers can only join endpoints on the same host.
> 3. A join execs CNI plugin with json composed from the join and endpoint
> create, derived from storage (physical memory, sqlite, apiserver -- as long
> as it's not another database it remains an implementation detail).
>
>> First, libkv assumes an arbitrary KV store, which our APIserver is not.
>
> Doesn't look like libkv is a requirement for remote plugins.
> If we start
> docker with a plugin but without a kv store, the json will get posted to the
> localhost http server, but not propogated to the other hosts (untested, this
> from staring at code). This is ok, because there is only 1 network and no
> cross host endpoint joining. If we really need cross host consistency, we
> have an escape hatch via apiserver watch.

You have to start Docker daemon with libkv for libnetwork to work
(atleast based on my observation). It does not matter whether it is
remote driver or the native overlay solution. I will be happy if it
turns out that my assertion is wrong.

Prashanth B

unread,
Sep 2, 2015, 11:41:22 AM9/2/15
to Gurucharan Shetty, Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
On Wed, Sep 2, 2015 at 7:36 AM, Gurucharan Shetty <she...@nicira.com> wrote:
On Tue, Sep 1, 2015 at 10:22 PM, Prashanth B <be...@google.com> wrote:
>> I hope that answers your question?
>
> Thanks for the example. So what I'm proposing is a networking model with the
> following limitations for the short term:
> 1. Only one (docker) network object, this is the kubernetes network. All
> endpoints must join it.

IMO, the "one" network object theoretically  fits into the current k8
model wherein all pods can communicate with each other over L3. But
let me bring up a couple of points that provides a counter-view.

My understanding of implementation of Docker's inbuilt overlay
solution is that a "network" is a broadcast domain. So if you impose
the same meaning on k8 networking, you actually end up with a
humungous broadcast domain across multiple hosts and it won;t scale.

So one could argue that the current k8 model is that each minion is
one network and all networks are connected to each other via a router.

 
I think this is do-able with the same limitations mentioned in Tim's straw man.

> 2. Containers can only join endpoints on the same host.
> 3. A join execs CNI plugin with json composed from the join and endpoint
> create, derived from storage (physical memory, sqlite, apiserver -- as long
> as it's not another database it remains an implementation detail).
>
>> First, libkv assumes an arbitrary KV store, which our APIserver is not.
>
> Doesn't look like libkv is a requirement for remote plugins.
> If we start
> docker with a plugin but without a kv store, the json will get posted to the
> localhost http server, but not propogated to the other hosts (untested, this
> from staring at code). This is ok, because there is only 1 network and no
> cross host endpoint joining. If we really need cross host consistency, we
> have an escape hatch via apiserver watch.

You have to start Docker daemon with libkv for libnetwork to work
(atleast based on my observation).

You're probably right, since you've written a driver :)
I plan to dig a little and file a docker issue to get their opinions. This is probably a deal breaker. We have a CP store, it just isn't a KV store because it has a well defined schema enforced by the apiserver. 

dc...@redhat.com

unread,
Sep 2, 2015, 4:00:37 PM9/2/15
to kubernetes-sig-network, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com
On Tuesday, September 1, 2015 at 7:46:25 PM UTC-5, Tim Hockin wrote:
On Tue, Sep 1, 2015 at 4:30 PM, Gurucharan Shetty <she...@nicira.com> wrote:

> Since you have already associated your policies (firewall, QoS etc)
> with the endpoint, you can destroy the VM that the endpoint is
> connected to and then create a new VM at a different hypervisor and
> attach the old endpoint (with its old policies) to the new VM.
>
> My reading of what libnetwork achieves with containers is the same as
> above. i.e., you can create a network endpoint with policies applied
> and then attach it to any container on any host.

Write a "cni-exec" libnetwork driver.

I started doing that a month ago.  It has some fundamental problems, some of which you've outlined and others while I'll talk about below.
 
https://github.com/dcbw/cni-docker-plugin

You can not create new networks using it.  When a CreateNetwork() call
is received we check for a static config file on disk.
E.g.CreateNetwork(name = "foobar") looks for
/etc/cni/networks/foobar.json, and if it does not exist or does not
match, fail.

With libnetwork you create the network definitions with the libnetwork API.  If you have a KV store backing libnetwork then it's persistent, but if not then all the network definitions go away when docker exits.  So my thought was that k8s would create the networks itself, but as others have mentioned it's a problem that networks can be created with 'docker network add' too.
 
The first fundamental mismatch between libnetwork/CNM and CNI is that CNM is much more granular than CNI, and it wants more information at each step that CNI isn't willing to give back until the end.

The second fundamental mismatch is that libnetwork/CNM does more than CNI does; it handles moving the interfaces into the right NetNS, setting up routes, and setting the IP address on the interfaces.  The plugin's job is simply to create the interface and allocate the addresses, and pass all that back to libnetwork.  CNI plugins currently expect to handle all this themselves.

These two issues ensure that there cannot be a direct mapping between CNM and CNI right now due to how they handle interface configuration.

Third, remote plugins are called in a blocking manner so they cannot call back into docker's API to retrieve any extra information they might need (eg, network name).

PROBLEM: it looks like the CreateNetwork() call can not see the name
of the network.  Let's assume that could be fixed.

In my implementation I just cached the network ID and started a network watch to grab the name, and all the actual CNM work was done in Join().
 
CreateEndpoint() does just enough work to satisfy the API, and save
all of its state in memory.

PROBLEM: If docker goes down, how does this state get restored?

You're expected to build libnetwork with a KV store if you want persistence.
 
endpoint.Join() takes the saved info from CreateEndpoint(), massages
it into CNI-compatible data, and calls the CNI plugin.

Yeah, and at this point in my attempt we have the network name so we can pass that on.

BUT the deal-breaker is that the CNI plugin will expect to move the interface into the right NetNS itself, configure the interface's IP address itself, and more.  CNM doens't allow that.  CNM also doesn't expose the NetNS FD to the plugins in anyway (though in-process plugins might be able to find it), so the CNI plugin has no idea what network namespace to move the interface into.  That's where I stopped with cni-docker-plugin because it just wasn't possible without some changes to CNI or CNM itself.

Someone shoot this down?  It's not general purpose in the sense that
docker's network CLI can't be used, but would it be good enough to
enable people to use the same CNI plugins across docker and rkt?

Unfortunately I can't see a way to make existing CNI plugins work with libnetwork/CNM right now due to the fundamental difference in their granularity and handling of IP addressing and network namespace management...

dc...@redhat.com

unread,
Sep 2, 2015, 4:20:45 PM9/2/15
to kubernetes-sig-network, tho...@google.com, she...@nicira.com, paul.t...@metaswitch.com, eugene.y...@coreos.com
On Wednesday, September 2, 2015 at 12:22:23 AM UTC-5, Prashanth B wrote:
I hope that answers your question?

Thanks for the example. So what I'm proposing is a networking model with the following limitations for the short term:
1. Only one (docker) network object, this is the kubernetes network. All endpoints must join it. 
2. Containers can only join endpoints on the same host.
3. A join execs CNI plugin with json composed from the join and endpoint create, derived from storage (physical memory, sqlite, apiserver -- as long as it's not another database it remains an implementation detail)
First, libkv assumes an arbitrary KV store, which our APIserver is not.

Doesn't look like libkv is a requirement for remote plugins. If we start docker with a plugin but without a kv store, the json will get posted to the localhost http server, but not propogated to the other hosts (untested, this from staring at code). This is ok, because there is only 1 network and no cross host endpoint joining. If we really need cross host consistency, we have an escape hatch via apiserver watch

It's not a requirement.  It's only a requirement for libnetwork if you want persistence of the libnetwork-defined networks.  Your plugin can do whatever it wants, but fundamentally you'll be operating with an externally defined network name and possibly configuration too (eg, 'docker network add').
 
> Third, if we only allow network objects through kubernetes we can't see the name of the object Docker thinks it is creating.

We don't even have to allow this. The cluster is bootstrapped with a network object. It's readonly thereafter. Create network will noop after that. 

We're already using multi-network functionality for our isolation features in OpenShift, and we'd like to make sure that keeps working.

So if I understand your model correctly, there would be one defined "fake" network in docker/libnetwork that was the "kubernetes" network that any docker-managed container that wanted to interoperate with k8s would need to join.  All the actual network intelligence would be in k8s and k8s would pass that information to the *actual* CNI plugin outside of the docker/libnetwork paths.  So essentailly:

1) something creates a new "network" object in k8s
2) k8s wants to start a pod in that network so it tells docker to start the pause container in the 'kubernetes' (docker) network
3) docker then calls the 'kubernetes' CNM plugin, which happens to be Kube
4) Kube looks up whatever info it needs (the actual network name, permissions, whatever) and then executes a CNI plugin that handles all that

Is that more or less correct?
 
This would give users the ability to dump their own docker plugins into /etc/docker/plugins, start the kubelet with --manage-networking=false, and use dockers remote plugin api. At the same time CNI should work with manage-networking=true.

What would happen in the --manage-networking=false case?

dc...@redhat.com

unread,
Sep 2, 2015, 4:21:33 PM9/2/15
to kubernetes-sig-network, be...@google.com, she...@nicira.com, paul.t...@metaswitch.com, eugene.y...@coreos.com
On Wednesday, September 2, 2015 at 1:23:15 AM UTC-5, Tim Hockin wrote:
We will eventually want to add something akin to multiple Networks, so
I want to be dead sure that it is viable before we choose and
implement a model.

We're already doing this with OpenShift and we'd like to ensure it keeps working with whatever we come up with here.  We'll help make that happen.
 

Rajat Chopra

unread,
Sep 2, 2015, 5:23:15 PM9/2/15
to kubernetes-sig-network, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com


BUT the deal-breaker is that the CNI plugin will expect to move the interface into the right NetNS itself, configure the interface's IP address itself, and more.  CNM doens't allow that.  CNM also doesn't expose the NetNS FD to the plugins in anyway (though in-process plugins might be able to find it), so the CNI plugin has no idea what network namespace to move the interface into.  That's where I stopped with cni-docker-plugin because it just wasn't possible without some changes to CNI or CNM itself.

To this point and the one below, the only alternative I see at this point is for the CNI plugin to know whether it is being called for a CNM interface or not. If yes, then the plugin should not move the interface into net namespace.
For the independent case, no problem. And this becomes the responsibility of the glue driver (plant an ENV var called CNM=true?).

 

Someone shoot this down?  It's not general purpose in the sense that
docker's network CLI can't be used, but would it be good enough to
enable people to use the same CNI plugins across docker and rkt?

Unfortunately I can't see a way to make existing CNI plugins work with libnetwork/CNM right now due to the fundamental difference in their granularity and handling of IP addressing and network namespace management...


For handling the granularity, how about splitting CNI's IPAM and ADD from the glue driver? The IPAM is anyway separately define-able in CNI, just that it is not called separately from ADD. And we make the IPAM understand if it is called directly by the glue code or through the ADD command (switch behaviour accordingly).
 


dc...@redhat.com

unread,
Sep 2, 2015, 5:31:32 PM9/2/15
to kubernetes-sig-network, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com

Yeah, if we can make some small changes to CNI to accommodate "doing less" that would work for us, at least.  But I'm not sure how well it would work for the others in this discusson, eg Calico & Weave & etc?  Would be good to get their input.

dc...@redhat.com

unread,
Sep 2, 2015, 5:40:05 PM9/2/15
to kubernetes-sig-network, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com

Also, just to be clear, when creating the cni-docker-plugin I posted above, I was attempting to work with existing CNI plugins, and assumed that I could not change CNM or CNI at all.  If we can change them (and I think we probably can) then we may be able to get docker to run CNI plugins via the CNM API.

Tim Hockin

unread,
Sep 2, 2015, 5:45:34 PM9/2/15
to dc...@redhat.com, Madhu Venugopal, kubernetes-sig-network, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady
On Wed, Sep 2, 2015 at 1:00 PM, <dc...@redhat.com> wrote:
> On Tuesday, September 1, 2015 at 7:46:25 PM UTC-5, Tim Hockin wrote:
>
>> Write a "cni-exec" libnetwork driver.
>
>
> I started doing that a month ago. It has some fundamental problems, some of
> which you've outlined and others while I'll talk about below.
>
> https://github.com/dcbw/cni-docker-plugin
>
>> You can not create new networks using it. When a CreateNetwork() call
>> is received we check for a static config file on disk.
>> E.g.CreateNetwork(name = "foobar") looks for
>> /etc/cni/networks/foobar.json, and if it does not exist or does not
>> match, fail.
>
>
> With libnetwork you create the network definitions with the libnetwork API.
> If you have a KV store backing libnetwork then it's persistent, but if not
> then all the network definitions go away when docker exits. So my thought
> was that k8s would create the networks itself, but as others have mentioned
> it's a problem that networks can be created with 'docker network add' too.

I think we have two possible approaches - one is to build a truly
generic CNM-CNI bridge, the other is to build a kubernetes-centric CNI
bridge. Solving the former sounds great, but I'll settle for the
latter if I have to. I want to explore the generic approach first.

To build a truly generic CNM-CNI bridge, let's assume a working libkv
plugin. In the bridge driver, catch the "create network" call and
write a CNI-like config file. Assuming libkv works, every node will
see that operation and create the same file in their local
filesystems. Right?

> The first fundamental mismatch between libnetwork/CNM and CNI is that CNM is
> much more granular than CNI, and it wants more information at each step that
> CNI isn't willing to give back until the end.

What about actually dong the CNI "add" operation on CNM's "create
endpoint" ? Is there a guarantee that "create endpoint" runs on only
one node? If not, this seems hard to surmount.

> The second fundamental mismatch is that libnetwork/CNM does more than CNI
> does; it handles moving the interfaces into the right NetNS, setting up
> routes, and setting the IP address on the interfaces. The plugin's job is
> simply to create the interface and allocate the addresses, and pass all that
> back to libnetwork. CNI plugins currently expect to handle all this
> themselves.

Hack: move into a tmp namespace in CNI plugins, and then move it out
in the bridge.

> Third, remote plugins are called in a blocking manner so they cannot call
> back into docker's API to retrieve any extra information they might need
> (eg, network name).

Do we understand WHY we can't have the network name?

>> PROBLEM: it looks like the CreateNetwork() call can not see the name
>> of the network. Let's assume that could be fixed.
>
> In my implementation I just cached the network ID and started a network
> watch to grab the name, and all the actual CNM work was done in Join().

Network watch of the libkv backend? Or something else?

>> CreateEndpoint() does just enough work to satisfy the API, and save
>> all of its state in memory.
>>
>> PROBLEM: If docker goes down, how does this state get restored?
>
>
> You're expected to build libnetwork with a KV store if you want persistence.

Isn't there a berkeley DB implementation of libkv that can be used for
local-only persistence?

> BUT the deal-breaker is that the CNI plugin will expect to move the
> interface into the right NetNS itself, configure the interface's IP address
> itself, and more. CNM doens't allow that. CNM also doesn't expose the
> NetNS FD to the plugins in anyway (though in-process plugins might be able
> to find it), so the CNI plugin has no idea what network namespace to move
> the interface into. That's where I stopped with cni-docker-plugin because
> it just wasn't possible without some changes to CNI or CNM itself.
>
>> Someone shoot this down? It's not general purpose in the sense that
>> docker's network CLI can't be used, but would it be good enough to
>> enable people to use the same CNI plugins across docker and rkt?
>
>
> Unfortunately I can't see a way to make existing CNI plugins work with
> libnetwork/CNM right now due to the fundamental difference in their
> granularity and handling of IP addressing and network namespace
> management...

I feel like IF we could make it work it might be worth some changes in CNI.


Now I want to think through the kubernetes-centric (less generic)
approach. The key here being that anything that affects kubernetes
needs to flow down from kubernetes, not up from docker. Just as you
can't 'docker run' and have a pod appear in kubernetes, you can't
'docker network create' and have a network appear in kubernetes.

Starting assumptions: There is one "global" network for kubernetes
today (let's call it "kubernetes"), but there will eventually be more
granular control (more networks). We will not use libkv.

Assume a docker remote plugin which is a CNM-to-k8s+CNI bridge. This
is the default docker network driver. Any calls to "create network"
for anything other than "kubernetes" will fail. When kubelet starts
up it will 'docker network create' to make sure the "kubernetes"
network exists.

When a pod is created kubelet will tell it to join the "kubernetes"
network. libnetwork will call "create endpoint". The bridge will
exec the CNI plugin and then move the resulting interface back out of
the namespace (as CNM demands). libnetwork will call "join", which
should just be a no op (I think?) for the bridge.

Now, what happens if a user wants to use the docker-included overlay
network? When kubelet starts up it will 'docker network create' to
make sure the "kubernetes" network exists. All of the existing
libnetwork stuff should work. Unfortunately this requires a libkv
implementation, which we can't satisfy easily with kubernetes, so
users have to run YET ANOTHER cluster system. Pukey, but it works for
people who want it. Maybe we could add a libkv implementation in
terms of a kubernetes config object?

This seems to add up to:
1) use libnetwork and libnetwork drivers when running docker
2) offer a slightly hacky bridge from libnetwork to CNI drivers (is it
worth the cost?)

None of this addresses the other issues in lbnetwork - not wrappable,
IPAM is too baked-in, not available today, no access to network Name
field, complex model, etc. I keep hearing from people who tried to
implement libnetwork drivers that it's sort of a bad experience, and
docker doesn't seem keen to make it better (hearsay).

Are those issues really painful enough to warrant NOT using Docker's
network plugins?

The big alternative is to say "forget it", and just run all our pods
with --net=none to docker, and use CNI ourselves to set up networking.
This means (as discussed) 'docker run' can never join the kubernetes
network and that we don't take advantage of vendors who implement
docker plugins (could we bridge it the other way? A CNI binary that
drives docker remote plugins :)


I feel like a prototype is warranted, and then maybe a get-together?


I'm adding Madhu on this email.

eugene.y...@coreos.com

unread,
Sep 2, 2015, 6:11:47 PM9/2/15
to kubernetes-sig-network, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com

On Wednesday, September 2, 2015 at 2:23:15 PM UTC-7, Rajat Chopra wrote:



For handling the granularity, how about splitting CNI's IPAM and ADD from the glue driver? The IPAM is anyway separately define-able in CNI, just that it is not called separately from ADD. And we make the IPAM understand if it is called directly by the glue code or through the ADD command (switch behaviour accordingly).

The reason IPAM plugin is invoked by the top-level plugin is to make something like DHCP work. You first need to create your interface (e.g. macvlan) and then use it to get the IP. Finally you need to apply the values from DHCP to the interface and things like routing table. Therefore the IPAM plugin invocation need to be "sandwiched" in the top level.

Actually with the DHCP example, you could call IPAM plugin after the top-level one. So suppose we have a "macvlan" plugin and IPAM plugins as "peers". We could call "macvlan" first to have it create the interface. We then call "dhcp" which uses the newly created interface to get the IP/gw/routes and applies it (or maybe returns it for libcni to apply).

But what about a solution that uses host routing. There we first need to call out to IPAM to get the IP allocated and then add a route to the host. It was for this reason that IPAM is invoked by the top-level plugin at the appropriate time.

-Eugene

eugene.y...@coreos.com

unread,
Sep 2, 2015, 6:18:08 PM9/2/15
to kubernetes-sig-network, dc...@redhat.com, ma...@docker.com, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com

On Wednesday, September 2, 2015 at 2:45:34 PM UTC-7, Tim Hockin wrote:
The big alternative is to say "forget it", and just run all our pods
with --net=none to docker, and use CNI ourselves to set up networking.
This means (as discussed) 'docker run' can never join the kubernetes
network and that we don't take advantage of vendors who implement
docker plugins (could we bridge it the other way? A CNI binary that
drives docker remote plugins :)

CNI binary that drives docker remote plugins should be easier as CNI
is more coarse. Not sure how much value is in that.
 

I feel like a prototype is warranted, and then maybe a get-together?

I am going to hack on the prototype that uses a tmp namespace and calls
out to CNI plugin in CreateEndpoint. It then moves things out into root ns
and Join becomes a no-op. I realize that CreateEndpoint is not supposed to
be doing things like veth creation but even libnetwork's bridge plugin works this way. 

dc...@redhat.com

unread,
Sep 2, 2015, 6:22:20 PM9/2/15
to kubernetes-sig-network, dc...@redhat.com, ma...@docker.com, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com

Be careful with "bridge" driver :)  I thought you were talking about the the docker/libnetwork "bridge" driver for a second.
 
But anyway, I don't think we need to write anything out at all?  What would that file be used for?

> The first fundamental mismatch between libnetwork/CNM and CNI is that CNM is
> much more granular than CNI, and it wants more information at each step that
> CNI isn't willing to give back until the end.

What about actually dong the CNI "add" operation on CNM's "create
endpoint" ?  Is there a guarantee that "create endpoint" runs on only
one node?  If not, this seems hard to surmount.

Looking into the code (it appears that if your KV store is distributed, then *yes* all the objects (networks, endpoints, etc) will be replicated across all nodes.  The Controller calls store.go::initDataStore() which reads existing networks and endpoints and appears to call into the libnetwork drivers for those operations too.  But maybe that's not the intended model?  I've only used local data stores, so I have no clue what's supposed to happen when you're using the same distributed data store with docker/libnetwork.

Having all nodes replicate all networks/endpoints/etc seems pretty weird, since not every node would care about every network and you don't want to allocate resources until the node knows that its going to actually run a container connected to that endpoint.
 
> The second fundamental mismatch is that libnetwork/CNM does more than CNI
> does; it handles moving the interfaces into the right NetNS, setting up
> routes, and setting the IP address on the interfaces.  The plugin's job is
> simply to create the interface and allocate the addresses, and pass all that
> back to libnetwork.  CNI plugins currently expect to handle all this
> themselves.

Hack: move into a tmp namespace in CNI plugins, and then move it out
in the bridge.

I don't think that'll work, because there's no way for libnetwork to know about the temp namespace (it only operates in the global one), and so when we pass the interface name back that interface is now invisible to libnetwork.
 
> Third, remote plugins are called in a blocking manner so they cannot call
> back into docker's API to retrieve any extra information they might need
> (eg, network name).

Do we understand WHY we can't have the network name?

No; the answer may well be "because none of the existing libnetwork plugins need it", and that upstream libnetwork would be happy to take a patch adding it.  If we proceed with this, we should do that patch.
 
>> PROBLEM: it looks like the CreateNetwork() call can not see the name
>> of the network.  Let's assume that could be fixed.
>
> In my implementation I just cached the network ID and started a network
> watch to grab the name, and all the actual CNM work was done in Join().

Network watch of the libkv backend?  Or something else?

It watches the docker API event stream actually, so it works even if you don't use a KV store with libnetwork.  docker simply godeps libnetwork and wraps the libnetwork API in the docker API, so when we talk about libnetwork I think we all mean "libnetwork as wrapped by docker"?

Dan
 

Tim Hockin

unread,
Sep 2, 2015, 6:44:24 PM9/2/15
to Madhu Venugopal, dc...@redhat.com, kubernetes-sig-network, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady, Jana Radhakrishnan
Madhu,

Thanks for the clue on scope. It looks like all remote drivers are
assumed to be global.

https://github.com/docker/libnetwork/blob/master/drivers/remote/driver.go#L32



On Wed, Sep 2, 2015 at 3:28 PM, Madhu Venugopal <ma...@docker.com> wrote:
> Copying Jana as well.
> Will read & reply to this thread later today.
>
> Just a quick clarification on some misunderstanding on libkv usage.
> libkv supports local persistence (using boltdb) and libnetwork makes use of
> it for local persistence (https://github.com/docker/libnetwork/pull/466).
>
> And the questions about global vs local libnetwork events are purely a
> matter of scope of the driver.
> If the driver scope is global, endpoint & network create calls are global.
> But Join is local.
> But if driver is scoped local, then all the calls are local.
>
> Thanks,
> -Madhu

Rajat Chopra

unread,
Sep 2, 2015, 6:47:48 PM9/2/15
to kubernetes-sig-network, dc...@redhat.com, ma...@docker.com, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com

Join can be a no-op in actual terms but it should still return the interface names, or you will hit this:
https://github.com/docker/libnetwork/blob/master/drivers/remote/driver.go#L176


 

Madhu Venugopal

unread,
Sep 2, 2015, 6:48:43 PM9/2/15
to Tim Hockin, dc...@redhat.com, kubernetes-sig-network, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady, Jana Radhakrishnan
On Sep 2, 2015, at 3:44 PM, Tim Hockin <tho...@google.com> wrote:

Madhu,

Thanks for the clue on scope.  It looks like all remote drivers are
assumed to be global.

https://github.com/docker/libnetwork/blob/master/drivers/remote/driver.go#L32

That must be fixed & we just had a discussion on the topic in #docker-network channel.
The remote network driver capability exchange must provide this detail during the 
RegisterDriver phase. Please open a PR and this can be quickly fixed.

-Madhu

Jana Radhakrishnan

unread,
Sep 2, 2015, 10:24:19 PM9/2/15
to kubernetes-sig-network, dc...@redhat.com, ma...@docker.com, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com


None of this addresses the other issues in lbnetwork - not wrappable,
IPAM is too baked-in, not available today, no access to network Name
field, complex model, etc.  I keep hearing from people who tried to
implement libnetwork drivers that it's sort of a bad experience, and
docker doesn't seem keen to make it better (hearsay).

I'll just provide some answers for the perceived libnetwork problems:

* It should be fairly easy to wrap a libnetwork plugin with another plugin.
* IPAM is coming out before we release. Please feel free to comment on the proposal: https://github.com/docker/libnetwork/issues/489
* Going to be available in stable release in 1.9
* Network names should not be that relevant to drivers if their only responsibility is to plumb low level stuff
* I am not too sure about complexity of the model because the model consists of just Networks and Endpoints :-)
* Implementing a libnetwork driver is all about just implementing 6 Apis some of them can be very minimal or no-op

On top of that there is a general perception that you need libkv for libnetwork to operate. But this is not true if the driver is available only in local scope.

-Jana
 


Tim Hockin

unread,
Sep 2, 2015, 11:52:33 PM9/2/15
to Jana Radhakrishnan, kubernetes-sig-network, dc...@redhat.com, Madhu Venugopal, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady
On Wed, Sep 2, 2015 at 7:24 PM, Jana Radhakrishnan
<jana.radh...@docker.com> wrote:
>>
>>
>> None of this addresses the other issues in lbnetwork - not wrappable,
>> IPAM is too baked-in, not available today, no access to network Name
>> field, complex model, etc. I keep hearing from people who tried to
>> implement libnetwork drivers that it's sort of a bad experience, and
>> docker doesn't seem keen to make it better (hearsay).
>
>
> I'll just provide some answers for the perceived libnetwork problems:
>
> * It should be fairly easy to wrap a libnetwork plugin with another plugin.

How? In CNI it's a shell script. How do I wrap a daemon?

> * IPAM is coming out before we release. Please feel free to comment on the
> proposal: https://github.com/docker/libnetwork/issues/489

Will do

> * Going to be available in stable release in 1.9

I'm anxious to see what happens with the separation of Services and
Networks. I think that conflation is part of what makes the
libnetwork model very complicated

> * Network names should not be that relevant to drivers if their only
> responsibility is to plumb low level stuff

I know you guys keep saying that, but lots of people implementing
drivers claim to need it, and now I see exactly why.

> * I am not too sure about complexity of the model because the model consists
> of just Networks and Endpoints :-)

And sandboxes. And KV stores, but optional. And IPAM. And global vs
local. And "creating endpoints" that get broadcasted across the
network. I'm sorry, the concept count on libnetwork is really high
and not at all obvious. Guru explained it up-thread in a way that was
pretty clear, but it was pretty clearly overkill.

> * Implementing a libnetwork driver is all about just implementing 6 Apis
> some of them can be very minimal or no-op
>
> On top of that there is a general perception that you need libkv for
> libnetwork to operate. But this is not true if the driver is available only
> in local scope.

...once that bug is fixed. do local-scope drivers have persistence?
If I create a local driver, create a Network, attach a container to
that Network, and then bounce docker daemon - do my networks come
back?

Jana Radhakrishnan

unread,
Sep 3, 2015, 12:18:01 AM9/3/15
to Tim Hockin, Jana Radhakrishnan, kubernetes-sig-network, dc...@redhat.com, Madhu Venugopal, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady
On Sep 2, 2015, at 8:52 PM, Tim Hockin <tho...@google.com> wrote:

On Wed, Sep 2, 2015 at 7:24 PM, Jana Radhakrishnan
<jana.radh...@docker.com> wrote:


None of this addresses the other issues in lbnetwork - not wrappable,
IPAM is too baked-in, not available today, no access to network Name
field, complex model, etc.  I keep hearing from people who tried to
implement libnetwork drivers that it's sort of a bad experience, and
docker doesn't seem keen to make it better (hearsay).


I'll just provide some answers for the perceived libnetwork problems:

* It should be fairly easy to wrap a libnetwork plugin with another plugin.

How?  In CNI it's a shell script.  How do I wrap a daemon?

If the goal is to add additional functionality to an already existing plugin you can wrap that daemon with your custom daemon and chain the calls and implement additional functionality. So the virtue of “wrappability" is not mutually exclusive with a plugin being a daemon. If you don’t want a long running process the problem is easily solved by adding an “exec” driver in libnetwork very similar to “remote” driver so that it can invoke plugins by execing them with the api as a JSON encoded argument.


* IPAM is coming out before we release. Please feel free to comment on the
proposal: https://github.com/docker/libnetwork/issues/489

Will do

* Going to be available in stable release in 1.9

I'm anxious to see what happens with the separation of Services and
Networks.  I think that conflation is part of what makes the
libnetwork model very complicated

This conflation is going away pretty soon.


* Network names should not be that relevant to drivers if their only
responsibility is to plumb low level stuff

I know you guys keep saying that, but lots of people implementing
drivers claim to need it, and now I see exactly why.

Why?


* I am not too sure about complexity of the model because the model consists
of just Networks and Endpoints :-)

And sandboxes.  And KV stores, but optional.  And IPAM.  And global vs
local.  And "creating endpoints" that get broadcasted across the
network.  I'm sorry, the concept count on libnetwork is really high
and not at all obvious.  Guru explained it up-thread in a way that was
pretty clear, but it was pretty clearly overkill.

I thought IPAM being a separate thing is a good thing. And sandboxes are not something that driver developers need to worry about in general. Some of the concepts may be an overkill for k8s because k8s itself provides such abstractions. But if you look at docker itself as an independent product then it’s not so much of an overkill.


* Implementing a libnetwork driver is all about just implementing 6 Apis
some of them can be very minimal or no-op

On top of that there is a general perception that you need libkv for
libnetwork to operate. But this is not true if the driver is available only
in local scope.

...once that bug is fixed.  do local-scope drivers have persistence?
If I create a local driver, create a Network, attach a container to
that Network, and then bounce docker daemon - do my networks come
back?

Yes once this PR https://github.com/docker/libnetwork/pull/466 is merged.

-Jana

Tim Hockin

unread,
Sep 3, 2015, 12:32:17 AM9/3/15
to Jana Radhakrishnan, kubernetes-sig-network, dc...@redhat.com, Madhu Venugopal, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady
On Wed, Sep 2, 2015 at 9:17 PM, Jana Radhakrishnan
<jana.radh...@docker.com> wrote:

> * It should be fairly easy to wrap a libnetwork plugin with another plugin.
>
>
> How? In CNI it's a shell script. How do I wrap a daemon?
>
>
> If the goal is to add additional functionality to an already existing plugin
> you can wrap that daemon with your custom daemon and chain the calls and
> implement additional functionality. So the virtue of “wrappability" is not
> mutually exclusive with a plugin being a daemon. If you don’t want a long
> running process the problem is easily solved by adding an “exec” driver in
> libnetwork very similar to “remote” driver so that it can invoke plugins by
> execing them with the api as a JSON encoded argument.

Exec would be a big step forward

> * IPAM is coming out before we release. Please feel free to comment on the
> proposal: https://github.com/docker/libnetwork/issues/489

There's not much detail there to comment on. An example of something
like DHCP will be interesting

> I'm anxious to see what happens with the separation of Services and
> Networks. I think that conflation is part of what makes the
> libnetwork model very complicated
>
>
> This conflation is going away pretty soon.

That's great! does that mean the CNM will get simpler?

> * Network names should not be that relevant to drivers if their only
> responsibility is to plumb low level stuff
>
> I know you guys keep saying that, but lots of people implementing
> drivers claim to need it, and now I see exactly why.
>
> Why?

I want to orchestrate networks in Kubernetes. Kubernetes will manage
the networks in our API server. If I have a driver plugin that
integrates with Kubernetes, I need to know the network name for a
given endpoint, so I can look up info in our own API. I do not have
the random ID, especially if this acts as a local driver (each node
has a different ID).

Madhu Venugopal

unread,
Sep 3, 2015, 12:38:24 AM9/3/15
to Tim Hockin, Jana Radhakrishnan, kubernetes-sig-network, dc...@redhat.com, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady

> On Sep 2, 2015, at 9:31 PM, Tim Hockin <tho...@google.com> wrote:
>
> On Wed, Sep 2, 2015 at 9:17 PM, Jana Radhakrishnan
> <jana.radh...@docker.com> wrote:
>
>> * It should be fairly easy to wrap a libnetwork plugin with another plugin.
>>
>>
>> How? In CNI it's a shell script. How do I wrap a daemon?
>>
>>
>> If the goal is to add additional functionality to an already existing plugin
>> you can wrap that daemon with your custom daemon and chain the calls and
>> implement additional functionality. So the virtue of “wrappability" is not
>> mutually exclusive with a plugin being a daemon. If you don’t want a long
>> running process the problem is easily solved by adding an “exec” driver in
>> libnetwork very similar to “remote” driver so that it can invoke plugins by
>> execing them with the api as a JSON encoded argument.
>
> Exec would be a big step forward

PRs welcome :)

>
>> * IPAM is coming out before we release. Please feel free to comment on the
>> proposal: https://github.com/docker/libnetwork/issues/489
>
> There's not much detail there to comment on. An example of something
> like DHCP will be interesting

Please comment in the proposal so that we will get appropriate response.

>
>> I'm anxious to see what happens with the separation of Services and
>> Networks. I think that conflation is part of what makes the
>> libnetwork model very complicated
>>
>>
>> This conflation is going away pretty soon.
>
> That's great! does that mean the CNM will get simpler?

CNM doesn’t talk about services. It is made of Network, Endpoints and Sandbox
and that stays exactly as intended. The services discussions that we had in the
past were purely UX centric.

>
>> * Network names should not be that relevant to drivers if their only
>> responsibility is to plumb low level stuff
>>
>> I know you guys keep saying that, but lots of people implementing
>> drivers claim to need it, and now I see exactly why.
>>
>> Why?
>
> I want to orchestrate networks in Kubernetes. Kubernetes will manage
> the networks in our API server. If I have a driver plugin that
> integrates with Kubernetes, I need to know the network name for a
> given endpoint, so I can look up info in our own API. I do not have
> the random ID, especially if this acts as a local driver (each node
> has a different ID).

Why ? Network creation API returns the ID which you can make use of.
For a locally scoped driver, network name in 1 host is NOT the same as the other host
(just like docker0 in a host is not the same as other docker0 in other hosts).
and hence k8s can actually stitch these network-IDs together.

Tim Hockin

unread,
Sep 3, 2015, 12:43:24 AM9/3/15
to Madhu Venugopal, Jana Radhakrishnan, kubernetes-sig-network, dc...@redhat.com, Gurucharan Shetty, Eugene Yakubovich, Paul Tiplady
I'm not sure I understand this - what does "stitch these network-IDs
together" mean? If I have 500 nodes, each of which has a unique
network ID for the "same" network - what reasonable thing do I do with
that?

Madhu Venugopal

unread,
Sep 3, 2015, 12:46:33 AM9/3/15