Network Plugins definition (was "Kicking off the network SIG")

693 views
Skip to first unread message

Tim Hockin

unread,
Aug 20, 2015, 1:30:15 PM8/20/15
to kubernetes-...@googlegroups.com
It's hard to talk about network plugins without also getting into how
we can better align with docker and rocket-native plugins, but given
the immaturity of the whole space, let's try to ignore them and think
about what is the overall behavior we really want to get.

Forking from the kickoff thread.

On Sun, Aug 16, 2015 at 10:00 AM, Gurucharan Shetty <she...@nicira.com> wrote:

> From what I understand, this hairpin flag is only needed if one uses
> the kube-proxy to do L4 load-balancing. If there are people that would
> be doing L7 load balancing or a more feature rich L4 load balancing
> (based not only on L4 of the packet but also based on the traffic load
> on a particular destination pod) and they don't want to use iptables,
> I guess, mandating the hair-pin is not needed. So would it be correct
> to say that, if a network solution intends to use kube-proxy to do L4
> load balancing then their network plugin should enable hair-pin? This
> also raises a more general question: Is kube-proxy replaceable by
> something else?

kube-proxy is supposed to be one implementation of Services, but not
the only. In fact I have seen at least two demos of Services without
kube-proxy and know of two or three others that I have not seen
directly.

Given that, it seems to make more sense to describe the behavior that
we expect from a network plugin (ideally in the form of a test) and
let people conform to that.

> 1. Right now, there is a requirement that all the pods should also be
> able to speak to every host that hosts pods. This was clearly needed
> for the previous kube-proxy implementation where the source ip and mac
> would change to that of the src host. With the new implementation of
> kube-proxy, do we still need the host to be able to speak to the pods
> via IP? May be there were other reasons for this requirement?
> Thoughts?

This question touches on both directions: pod-to-node and node-to-pod.
It's a fair question - we jump through a lot of hoops to ensure that
access to services from nodes works. Is that really necessary? It's
certainly useful for debugging. If we want things like docker being
able to pull from a registry that is running as a Service or nodes to
access DNS in the cluster we need this access, or we start doing
per-node proxies and host-ports and localhost for everything. See
recent work by one of our interns on making an in-cluster docker
registry work for a taste.

Those are the "obvious" ones. I am interested to hear what other
things people might be doing where a process on the node needs to
access a Pod or Service, or vice-versa. Simplifying this connectivity
would be a win.

> 2. The current network plugin architecture prevents the network plugin
> from changing the IP address of the pod (from the Docker generated
> one). Well, it does not prevent you from changing the IP, but things
> like 'kubectl get endpoints' would only see the docker generated IP
> address. Kubernetes currently is a single-tenant system, so it
> probably is not very important to be able to change the IP address of
> the pod. But in the future, if there are plans for multi-tenancy (dev,
> qa, prod, staging etc in single environment), then overlapping IP
> addresses and logical networks (SDN, network virtualization) may be
> needed , in which case, ability to change the IP address will become
> important. Any thoughts?

As of recently plugins can return status, which includes a different IP.


On Mon, Aug 17, 2015 at 11:01 AM, Casey Davenport
<casey.d...@metaswitch.com> wrote:

> I don't think the hairpin flag will come into play for Calico. We don't
> build on the Docker bridge, instead creating new veth pair with one end in
> the pod's namespace and one end in the host's for each new pod. I won't
> know for sure if this is true until I test it, but I plan on doing so this
> week.

So you're assuming (rightly, so far) that network plugins are
monolithic and not composeable. Is it valuable or interesting to have
network plugins be composeable? For example, should it be possible to
write a plugin that handles things like installing special iptables
rules and use that plugin alongside a Calico plugin? For a more
concrete example, let's look at what we do in GCE in the default
Docker bridge mode. All of this is done in kubelet, but should be
plugins (IMO).

1) set a broad MASQUERADE rule in iptables (required to egress traffic
from containers because of GCE's edge NAT).
2) configure cbr0 (our docker bridge) with the per-node CIDR
3) tweak Docker config (I think?) to use cbr0
4) soon: install hairpin config on each cbr0 interface

At least two things stand out as pretty distinct - the MASQUERADE
rules and the cbr0 stuff.

The MASQUERADE stuff is needed regardless of whether you use a docker
bridge or Calico or Weave or Flannel, but it's actually pretty
site-specific. In GCE we shouldbasically say "anything that is not
destined for 10.0.0.0/8 needs masquerade", but that's not right. It
should be "anything destined for and RFC1918 address". But there are
probably cases where we would want the masquerade even withing RFC1918
space (I'm thinking VPNs, maybe?). Outside of GCE, the rules are
obviously totally different. sShould this be something that
kubernetes understands or handles at all? Or can we punt this down to
the node-management layer (in as much as we have one)?

The bridge management seems like a pretty clear case of plugin. We
could/should move all of the cbr0/docker0 management into a plugin and
make that the default, probably. This touches on another sore point -
docker itself has flags that we have historically suggested people set
(or not set) around iptables and masquerade. Those flags do things
that conflict with what we want to do, sometimes, but we don't
actively check that they are not set.

Is there any use for composable plugins?

> From my perspective, this should be handled by each individual network
> plugin - each plugin might want to handle this differently (or not at all).
> Perhaps there are other cases to be made for chaining of network plugins,
> but the hairpin case alone doesn't convince me.

I think I am coming to the same conclusion.

>> 1. Right now, there is a requirement that all the pods should also be
>> able to speak to every host that hosts pods.
>
> We've been talking about locking down host-pod communication in the general
> case as part of our thoughts on security. There are still cases where
> host->pod communication is needed (e.g. NodePort), but at the moment our
> thinking is to treat infrastructure as "outside the cluster". As far as
> security is concerned, we think the default behavior should be "allow from
> pods within my namespace". Anything beyond that can be configured using
> security policy.

See above - what about cases where the node needs to legitimately
access cluster services (the canonical case being a docker registry)?


On Tue, Aug 18, 2015 at 5:45 AM, Michael Bridgen <mic...@weave.works> wrote:

> I have been working on adding an API library to CNI[1], which is used for
> rocket's networking, but was intended as a widely-applicable plugin
> mechanism. To date, CNI consists of a specification for invoking plugins[2],
> some plugins themselves, and a handful of example scripts that drive the
> plugins. With an API in a go library, it'd be much easier to use as common
> networking plugin infrastructure for kubernetes, rocket, runc and other
> things that come along.
>
> I like CNI because it does just what is needed, while giving plugins and
> applications a fair bit of freedom of implementation. It's pretty close, and
> at the same level of abstraction, to the networking hooks added to
> Kubernetes recently.

I'm fine with folding things together - that would be great, in fact.
I have not paid attention to CNI in the last 2 months, but I had some
concerns with it, last I looked. I was one of the people arguing that
CNI and CNM were too close to not fold together. I still feel that
there is not really room for more than one or MAYBE two plugin models
in this space. I don't have any particular attachment to owning one
of those, personally, but I am VERY concerned that:

a) implementors like Weave/Calico/... have to implement and maintain
Docker plugins and CNI/k8s plugins with slightly different semantics

b) users experience confusion and/or complexity about how to configure
a solution

> So far, with Eugene @ CoreOS's help, I have pulled together enough of a
> library that Rocket's networking can be ported to it[3] -- not too
> surprising, since much of the code was adapted from there -- and I've
> written a tiny command-line tool that can be used with e.g., runc.
> Meanwhile, Paul @ Calico is getting good results trying an integration with
> Kubernetes.

I'd like to see this expanded on. If we can reduce the space from 3
plugins to 2, that's a win.

> I am aware that I'm late to the party, and that Kubernetes + CNI and various
> other combinations have been discussed before. But I think things have moved
> on a bit[4], so if people don't mind some recapitulation, it'd be useful to
> hear objections and (unmet) requirements and so on. Perhaps it is needless
> to say that I would like this to become a "full community effort", if we
> find that it is a broadly acceptable approach.

I'll have to look at CNI again.


On Tue, Aug 18, 2015 at 9:40 AM, Paul Tiplady
<paul.t...@metaswitch.com> wrote:

> I like this model because it would allow Calico to provide a single CNI
> plugin for Kubernetes, and have it run for any containerizer (docker, rkt,
> runc, ...). As k8s support for different runtimes grows, this will become an
> increasingly significant issue. (Right now we can just target docker and be
> done with it).

Does CNI work with Docker?

> Of the plugin hooks, CNI maps cleanly to ADD and DELETE. It doesn't have a
> concept of daemons, so the k8s INIT action isn't covered (we don't currently
> use INIT, though we think we will eventually). To handle the functionality
> currently provided by INIT, CNI could potentially be extended to add the
> concept of a daemon, or we could leave the INIT hook as a kick to an
> arbitrary binary that is independent from the CNI mechanism. The latter is
> probably the short-term pragmatic solution, since current plugins' INIT
> hooks will remain unchanged.

k8s plugins were intended to be exec-and-terminate, but docker plugins
are assumed to be be daemons. In both cases we have open issues with
startup, containerization vs not, etc.

> As for the new STATUS plugin action, I'm not sure if that's needed if we use
> CNI; the plugin returns the IP from the ADD call so we can just update the
> pod status after it's created. Was another motivation of STATUS the idea
> that a pod's IP could change? If we don't need to support that use case then
> things integrate very cleanly.

I asked the same question. I think it was "following the existing
pattern on calling docker for status". I think simply returning it
once might be OK.


On Tue, Aug 18, 2015 at 10:09 AM, Paul Tiplady
<paul.t...@metaswitch.com> wrote:

> For the rate-limiting case, I can't see how you can implement this outside
> the plugin in a generic way; after the plugin has done its thing, how do you
> determine which interface is connected to the pod? For example Calico
> deletes the veth pair that docker creates, so we'd have to duplicate any tc
> config that was set anyway. IMO better to have all that logic in one place,
> where a plugin implementor can see what needs to be implemented.

If I recall, the TC logic is applied at the host interface per-CIDR
not per veth.

> Also, currently kubelet has code to setup cbr0. While that's a great
> pragmatic simplification, I don't think it quite fits with the concept of
> pluggable networking modules -- could that be handled in the plugin INIT
> step? That would make the docker-bridge plugin an equal peer to other
> plugins, which would help flush out issues with the API; if we can't
> implement docker-bridge entirely as a k8s plugin, then maybe the API isn't
> complete enough.

Yeah, I should have read the thread before responding - this is my
conclusion, too.

Tim Hockin

unread,
Aug 21, 2015, 2:40:14 AM8/21/15
to kubernetes-sig-network
I re-read the CNI spec and looked at some of the code. I have a lot
of questions, which I will write up tomorrow hopefully, but it seems
viable to me.

eugene.y...@coreos.com

unread,
Aug 21, 2015, 3:10:38 PM8/21/15
to kubernetes-sig-network

On Thursday, August 20, 2015 at 10:30:15 AM UTC-7, Tim Hockin wrote:
It's hard to talk about network plugins without also getting into how
we can better align with docker and rocket-native plugins, but given
the immaturity of the whole space, let's try to ignore them and think
about what is the overall behavior we really want to get.

A philosophical decision one has to make when talking about these plugins is whether the role of the plugin is to:
1) Perform some abstract task like joining a container to a network. This is both the CNI and CNM model.
or
2) Just a callout to do any old manipulation of networking resources (e.g. bridges, iptables, veths, traffic classes, etc). I think this is what Tim proposed. This model is very flexible but is harder for the user to comprehend and configure. The user has to know what works with what and in which order.

I feel like we actually have a tiny experience with (1) but not with (2). Maybe resource intensive but is it worth doing a small POC for option 2?

Tim Hockin

unread,
Aug 21, 2015, 5:25:03 PM8/21/15
to Eugene Yakubovich, kubernetes-sig-network
Can you answer how, in CNI, something like Docker would work? They
want the "bridge" plugin but they want to add some per-container
iptables rules on top of it.

Should they fork the bridge plugin into their own and implement their
custom behavior? Should they make a 2nd plugin that follows "bridge"
and adds their iptables (not allowed in CNI)? Should they make a
wrapper plugin that calls bridge and then does their own work?

I think these are all viable. There's a simplicity win for admins
(especially user/admins) if the plugin is assumed monolithic, I guess.
> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kubernetes-sig-ne...@googlegroups.com.
> To post to this group, send email to
> kubernetes-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/kubernetes-sig-network.
> For more options, visit https://groups.google.com/d/optout.

eugene.y...@coreos.com

unread,
Aug 21, 2015, 5:41:40 PM8/21/15
to kubernetes-sig-network, eugene.y...@coreos.com

On Friday, August 21, 2015 at 2:25:03 PM UTC-7, Tim Hockin wrote:
Can you answer how, in CNI, something like Docker would work?  They
want the "bridge" plugin but they want to add some per-container
iptables rules on top of it.

Should they fork the bridge plugin into their own and implement their
custom behavior?  Should they make a 2nd plugin that follows "bridge"
and adds their iptables (not allowed in CNI)?  Should they make a
wrapper plugin that calls bridge and then does their own work?

They can either fork the bridge plugin or do a wrapper one. Ideally they
would abstract out the iptables rules into something they can contribute upstream 
to CNI's bridge plugin. 

Tim Hockin

unread,
Aug 21, 2015, 6:05:12 PM8/21/15
to Eugene Yakubovich, kubernetes-sig-network
Do you really want the "base" plugins to accumulate those sorts of
features? I like the idea of wrapping other plugins - formalizing
that pattern would be interesting. Keep a handful of very stable,
reasonably configurable (but not crazy) base plugins that people can
decorate.

eugene.y...@coreos.com

unread,
Aug 21, 2015, 6:17:46 PM8/21/15
to kubernetes-sig-network, eugene.y...@coreos.com
I guess I'd only want these base plugins to get the features if they're of general use.
For example, if we're talking about Docker links, then no. But if it's to
restrict cross talk between networks (which CNI does not currently do), then
sure. 

On Friday, August 21, 2015 at 3:05:12 PM UTC-7, Tim Hockin wrote:
Do you really want the "base" plugins to accumulate those sorts of
features?  I like the idea of wrapping other plugins - formalizing
that pattern would be interesting.  Keep a handful of very stable,
reasonably configurable (but not crazy) base plugins that people can
decorate.

On Fri, Aug 21, 2015 at 2:41 PM,  <eugene.y...@coreos.com> wrote:
>
> On Friday, August 21, 2015 at 2:25:03 PM UTC-7, Tim Hockin wrote:
>>
>> Can you answer how, in CNI, something like Docker would work?  They
>> want the "bridge" plugin but they want to add some per-container
>> iptables rules on top of it.
>>
>> Should they fork the bridge plugin into their own and implement their
>> custom behavior?  Should they make a 2nd plugin that follows "bridge"
>> and adds their iptables (not allowed in CNI)?  Should they make a
>> wrapper plugin that calls bridge and then does their own work?
>
>
> They can either fork the bridge plugin or do a wrapper one. Ideally they
> would abstract out the iptables rules into something they can contribute
> upstream
> to CNI's bridge plugin.
>
> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an

Paul Tiplady

unread,
Aug 24, 2015, 8:45:15 PM8/24/15
to kubernetes-sig-network
On Thursday, August 20, 2015 at 10:30:15 AM UTC-7, Tim Hockin wrote:
The MASQUERADE stuff is needed regardless of whether you use a docker
bridge or Calico or Weave or Flannel, but it's actually pretty
site-specific.  In GCE we shouldbasically say "anything that is not
destined for 10.0.0.0/8 needs masquerade", but that's not right.  It
should be "anything destined for and RFC1918 address".  But there are
probably cases where we would want the masquerade even withing RFC1918
space (I'm thinking VPNs, maybe?).  Outside of GCE, the rules are
obviously totally different.  sShould this be something that
kubernetes understands or handles at all?  Or can we punt this down to
the node-management layer (in as much as we have one)?
 
As you say, this is site-specific not plugin-specific; I think it's reasonable that, if needed, NAT should be set up when configuring the node (since the cloud-specific provisioner is better placed than the plugin to understand this requirement). Depending on the datacenter, there could be NAT on the node, NAT at the gateway, or no NAT at all if your pod IPs are publicly routable. Would be a win to keep that complexity out of k8s itself.
 
>
>  We've been talking about locking down host-pod communication in the general
> case as part of our thoughts on security.  There are still cases where
> host->pod communication is needed (e.g. NodePort), but at the moment our
> thinking is to treat infrastructure as "outside the cluster". As far as
> security is concerned, we think the default behavior should be "allow from
> pods within my namespace".  Anything beyond that can be configured using
> security policy.

See above - what about cases where the node needs to legitimately
access cluster services (the canonical case being a docker registry)?

I think we'd like to have an intermediate level of access to a service for "allow from all pods and k8s infrastructure (but not from outside the datacenter)", but this gets tricky because "from k8s infrastructure" could include traffic originally from a load balancer which has been forwarded from a NodePort service on one node to a pod on a second node (i.e. indistinguishable from internal node->pod traffic). I think we can just document the security impact of the combination [nodePort service + cluster-accessible pods => globally-accessible service]. Hopefully this goes away when LBs are smart enough that we don't need nodePort (or when we can use headless services + DNS instead).

> I like this model because it would allow Calico to provide a single CNI
> plugin for Kubernetes, and have it run for any containerizer (docker, rkt,
> runc, ...). As k8s support for different runtimes grows, this will become an
> increasingly significant issue. (Right now we can just target docker and be
> done with it).

Does CNI work with Docker?

Not natively (i.e. Docker calling into CNI). Though if it finds success in k8s, then this network plugin model could nudge the eventual standardized API that the OCI arrives upon.

CNI can quite straightforwardly configure networking for Docker containers in Kubernetes though. The approach I took for my prototype is to create the pod infra docker container with `--net=none`, and then have k8s call directly into CNI to set up networking. The main bit of complexity was rewiring the IP-learning for the new pod (since CNI returns the IP and expects the orchestrator to remember it, and there is no analogue to the 'docker inspect' command). Now that I I've got that working correctly, it's also removed the requirement for the STATUS plugin action, too.

Paul Tiplady

unread,
Aug 31, 2015, 8:29:48 PM8/31/15
to kubernetes-sig-network
As mentioned before, I've done some prototyping of replacing the current plugin interface with CNI. I've written up a design doc for my proposed changes.

I'd be interested to get folks' feedback on this approach; I'm going to spend the next day or two polishing my prototype so that I can have the code the talking.

eugene.y...@coreos.com

unread,
Aug 31, 2015, 9:12:15 PM8/31/15
to kubernetes-sig-network
Paul,

This is a great start. One shortcoming of CNI right now is that there's no good library. There's some code in https://github.com/appc/cni/tree/master/pkg and some still in rkt (https://github.com/coreos/rkt/tree/master/networking). It really needs to be pulled together to make using CNI easier from both the container runtime and for the plugin writers (the plugin side is currently better). To this end, Michael Bridgen from Weave was working on putting a library together (https://github.com/squaremo/cni/tree/libcni) but I don't know where he is at with it.

-Eugene

Tim Hockin

unread,
Sep 1, 2015, 1:16:40 AM9/1/15
to Paul Tiplady, kubernetes-sig-network
Notes as I read.

The biggest problem I have with this (and it's not necessarily a
show-stopper) is that a container created with plain old 'docker run'
will not be part of the kubernetes network because we will have
orchestrated the network at a higher level. In an ideal world, we'd
teach docker itself about the plugins and then simply delegate to it
as we do today.

That said, the more I dig into Docker's networking plugins the less I
like them. Philosophically and practically a daemon-free model built
around exec is so much cleaner. It seems at least theoretically
possible to bridge libnetwork to run CNI plugins, but probably not
without mutating the CNI spec to the more proscriptive libnetwork
model.


You say you'll push the IP to the apiserver - I guess you mean in
pod.status.podIP ?


Regarding CNI network configs, I assume that over time this might even
be something we expose through kubernetes - a la Docker networks.
The advantage here is that network management is a clean abstraction
distinct from interface management.


To your questions:

1) Can we eliminate Init?

I think yes.

2) Can we eliminate Status?

I think yes.

3) Can we cut over immediately to CNI, or do we need to keep the old
plugin interface for a time? If so, how long?

I think this becomes a community decision. There are a half-dozen to
a dozen places I know of using this feature. IFF they were OK making
a jump to something like CNI, we could do a hard cutover.

4) Can we live without the vendoring naming rules? Can we establish
that convention for plugins is to vendor-name the binary?
mycompany.com~myplug or something? Maybe it's not a huge deal.


I'll add #5 - does this mean we have no concept of in-process plugin?
Or do we retain the facade of an in-process API like we have now.


Overall this looks plausible, but I'd like to hear from all the folks
who have plugins implemented today, especially if you have both CNI
and libnetwork experience. The drawback I listed above (plain old
'docker run') is real, but maybe something we can live with. Maybe
it's actually a feature?

As a discussion point - how much would we have to adulterate CNI to
make a bridge? It sure would be nice to use the same plugins in both
Docker and rkt - I sure as hell don't want to tweak and debug this
stuff twice.

We could have a little wrapper binary that knew about a static network
config, and anyone who asked for a new network from our plugin would
get an error, then we just feed the static config(s) to the wrapped
CNI binary. We'd have to split Add into create/join but that doesn't
seem SO bad. What else?
> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kubernetes-sig-ne...@googlegroups.com.

Paul Tiplady

unread,
Sep 1, 2015, 12:39:30 PM9/1/15
to kubernetes-sig-network
Hi Eugene,

I've been talking with Michael, and the prototype integration that I built uses his libcni branch. Thus far his code has met my needs, so I think his approach is sound.

Cheers,
Paul

Paul Tiplady

unread,
Sep 1, 2015, 1:46:35 PM9/1/15
to kubernetes-sig-network, paul.t...@metaswitch.com


On Monday, August 31, 2015 at 10:16:40 PM UTC-7, Tim Hockin wrote:
Notes as I read.

The biggest problem I have with this (and it's not necessarily a
show-stopper) is that a container created with plain old 'docker run'
will not be part of the kubernetes network because we will have
orchestrated the network at a higher level.  In an ideal world, we'd
teach docker itself about the plugins and then simply delegate to it
as we do today.

Agree that this is a change in workflow -- though `docker run` was already broken with the existing network plugin API for plugins which don't use the docker bridge (e.g. Calico).

I think we can get round this by using `kubectl run|exec`; now that exec has -i and --tty options, I think the main usecases are covered.


That said, the more I dig into Docker's networking plugins the less I
like them.  Philosophically and practically a daemon-free model built
around exec is so much cleaner.  It seems at least theoretically
possible to bridge libnetwork to run CNI plugins, but probably not
without mutating the CNI spec to the more proscriptive libnetwork
model.


You say you'll push the IP to the apiserver - I guess you mean in
pod.status.podIP ?

Yep 


Regarding CNI network configs, I assume that over time this might even
be something we expose through kubernetes - a la Docker networks.
The advantage here is that network management is a clean abstraction
distinct from interface management.

Good point -- this could be a nice feature, if you have one group of pods which have a very different set of network requirements (e.g. latency-sensitive, or L2 vs. pure-L3) then you can bundle them onto a different network. Routing between networks could be fun though.



To your questions:

1) Can we eliminate Init?

I think yes.

2) Can we eliminate Status?

I think yes.

3) Can we cut over immediately to CNI, or do we need to keep the old
plugin interface for a time? If so, how long?

I think this becomes a community decision.  There are a half-dozen to
a dozen places I know of using this feature.  IFF they were OK making
a jump to something like CNI, we could do a hard cutover.

4) Can we live without the vendoring naming rules?  Can we establish
that convention for plugins is to vendor-name the binary?
mycompany.com~myplug or something?  Maybe it's not a huge deal.


I'll add #5 - does this mean we have no concept of in-process plugin?
Or do we retain the facade of an in-process API like we have now.

Added a bullet for this in the doc. 

CNI doesn't currently have the concept of an in-process plugin. Looks like with the current API this only for vendors that are extending the kubernetes codebase, or am I missing something?

With Michael Bridgen's work to turn CNI into a library, in-process CNI-style plugins become a viable option. A couple possible approaches:

* Extend libcni to add the concept of an in-process plugin as a native concept. (libcni could formalize an interface to run these in-process plugins as standalone plugins as well, which would mean developers can target both in- and out-of-process plugins if they care).
* Create a hook in our plugin code where in-process code can run (consuming the CNI API objects that we created to pass to the CNI exec-plugin).

I think the latter could be done with the current proposal on a per-vendor basis, but there might be benefit in formalizing that interface.


Overall this looks plausible, but I'd like to hear from all the folks
who have plugins implemented today, especially if you have both CNI
and libnetwork experience.  The drawback I listed above (plain old
'docker run') is real, but maybe something we can live with.  Maybe
it's actually a feature?

As a discussion point - how much would we have to adulterate CNI to
make a bridge?  It sure would be nice to use the same plugins in both
Docker and rkt - I sure as hell don't want to tweak and debug this
stuff twice.

We could have a little wrapper binary that knew about a static network
config, and anyone who asked for a new network from our plugin would
get an error, then we just feed the static config(s) to the wrapped
CNI binary.  We'd have to split Add into create/join but that doesn't
seem SO bad.   What else?

I was pondering this approach -- the big stumbling block for me is that a CNM createEndpoint can occur on a different host than the joinEndpoint call, so naively we'd need a cluster-wide distributed datastore to keep track of the Create calls.

Short of breaking the spec and disallowing Create and Join from being called on different hosts, I don't see a way around that issue.

eugene.y...@coreos.com

unread,
Sep 1, 2015, 3:07:16 PM9/1/15
to kubernetes-sig-network, paul.t...@metaswitch.com

On Tuesday, September 1, 2015 at 10:46:35 AM UTC-7, Paul Tiplady wrote:

I'll add #5 - does this mean we have no concept of in-process plugin?
Or do we retain the facade of an in-process API like we have now.

Added a bullet for this in the doc. 

CNI doesn't currently have the concept of an in-process plugin. Looks like with the current API this only for vendors that are extending the kubernetes codebase, or am I missing something?

 
CNI doesn't have in-process plugins because that requires shared object (.so) support and I believe that Go has problems with that (although it maybe fixed in 1.5). Technically CNI is not Go specific but realistically so much software in this space is written in Go. Having "in-tree" plugins don't require .so support but to be honest those never pass my definition of "plugins". FWIW, I would have been quite happy to just have .so plugins as there's no fork/exec overhead. 

 
I was pondering this approach -- the big stumbling block for me is that a CNM createEndpoint can occur on a different host than the joinEndpoint call, so naively we'd need a cluster-wide distributed datastore to keep track of the Create calls.

Short of breaking the spec and disallowing Create and Join from being called on different hosts, I don't see a way around that issue.


Wow, I was not aware of that. How does it work now? CreateEndpoint creates the interface (e.g. veth pair) on the host. Join then specifies the interface names that should be moved into the sandbox. I don't really understand how Join can be called on a different host -- wouldn't there be no interface to move on that host then? 

Gurucharan Shetty

unread,
Sep 1, 2015, 3:43:42 PM9/1/15
to eugene.y...@coreos.com, kubernetes-sig-network, paul.t...@metaswitch.com
Let us not make an assumption that all plugins will be Golang based.
OpenStack Neutron currently has python libraries for clients and my
plugin that integrates containers with openstack is python based.

Fwiw, Docker's libnetwork does not mandate golang plugins. It uses
REST APIs to talk to plugins.


> Wow, I was not aware of that. How does it work now? CreateEndpoint creates
> the interface (e.g. veth pair) on the host.
veth pairs are not mandated to be created on CreateEndpoint (). You
are only to return IP addresses, MAC addresses, Gateway etc. What this
does in theory is that it provides flexibility with container mobility
across hosts. So you can effectively create an endpoint from a central
location and ask a container to join that endpoint from any host.


>Join then specifies the
> interface names that should be moved into the sandbox. I don't really
> understand how Join can be called on a different host -- wouldn't there be
> no interface to move on that host then?
>
> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kubernetes-sig-ne...@googlegroups.com.

eugene.y...@coreos.com

unread,
Sep 1, 2015, 5:11:21 PM9/1/15
to kubernetes-sig-network, eugene.y...@coreos.com, paul.t...@metaswitch.com

On Tuesday, September 1, 2015 at 12:43:42 PM UTC-7, Gurucharan Shetty wrote:
Let us not make an assumption that all plugins will be Golang based.
OpenStack Neutron currently has python libraries for clients and my
plugin that integrates containers with openstack is python based.

Sure, I would never assume that all plugins will be Go based. Rather I want to not
exclude Go based ones.
 
Fwiw, Docker's libnetwork does not mandate golang plugins. It uses
REST APIs to talk to plugins. 
 
Right but REST is for out of process.


> Wow, I was not aware of that. How does it work now? CreateEndpoint creates
> the interface (e.g. veth pair) on the host.
veth pairs are not mandated to be created on CreateEndpoint (). You
are only to return IP addresses, MAC addresses, Gateway etc. What this
does in theory is that it provides flexibility with container mobility
across hosts. So you can effectively create an endpoint from a central
location and ask a container to join that endpoint from any host.

I think I understand what you and Paul meant by different hosts. Yes, their
model of decoupling the container from the interfaces is slick and something
that CNI can't do. However for all its slickness, I am not a fan of moving containers
or IPs around. If a container dies, start a new one. And give it a new IP -- don't equate
the IP to a service (yes, you need a service discovery). Anyway, that's all in line
with Kubernetes thinking and so does not need to be supported in a Kubernetes cluster.

I think if a user wants to have this mobility, they will not be running Kubernetes.
 


>Join then specifies the
> interface names that should be moved into the sandbox. I don't really
> understand how Join can be called on a different host -- wouldn't there be
> no interface to move on that host then?
>
> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an

Tim Hockin

unread,
Sep 1, 2015, 5:54:48 PM9/1/15
to Eugene Yakubovich, kubernetes-sig-network, Paul Tiplady
On Tue, Sep 1, 2015 at 12:07 PM, <eugene.y...@coreos.com> wrote:
>
> On Tuesday, September 1, 2015 at 10:46:35 AM UTC-7, Paul Tiplady wrote:
>>>
>>>
>>> I'll add #5 - does this mean we have no concept of in-process plugin?
>>> Or do we retain the facade of an in-process API like we have now.
>>>
>> Added a bullet for this in the doc.
>>
>> CNI doesn't currently have the concept of an in-process plugin. Looks like
>> with the current API this only for vendors that are extending the kubernetes
>> codebase, or am I missing something?
>>
>
> CNI doesn't have in-process plugins because that requires shared object
> (.so) support and I believe that Go has problems with that (although it
> maybe fixed in 1.5). Technically CNI is not Go specific but realistically so
> much software in this space is written in Go. Having "in-tree" plugins don't
> require .so support but to be honest those never pass my definition of
> "plugins". FWIW, I would have been quite happy to just have .so plugins as
> there's no fork/exec overhead.

I didn't mean to imply .so, though that's a way to do it too. I meant
to ask whether kubernetes/docker/rkt could have network plugins
defined in code, one of which was an exec-proxy, or whether exec was
it. I don't feel strongly that in-process is needed at this point.

>> I was pondering this approach -- the big stumbling block for me is that a
>> CNM createEndpoint can occur on a different host than the joinEndpoint call,
>> so naively we'd need a cluster-wide distributed datastore to keep track of
>> the Create calls.
>>
>> Short of breaking the spec and disallowing Create and Join from being
>> called on different hosts, I don't see a way around that issue.
>
> Wow, I was not aware of that. How does it work now? CreateEndpoint creates
> the interface (e.g. veth pair) on the host. Join then specifies the
> interface names that should be moved into the sandbox. I don't really
> understand how Join can be called on a different host -- wouldn't there be
> no interface to move on that host then?

Yeah, where do you get that information? Not calling you wrong, just
something that was not at all clear, if it is true.

Tim Hockin

unread,
Sep 1, 2015, 5:56:46 PM9/1/15
to Gurucharan Shetty, Eugene Yakubovich, kubernetes-sig-network, Paul Tiplady
On Tue, Sep 1, 2015 at 12:43 PM, Gurucharan Shetty <she...@nicira.com> wrote:
> Let us not make an assumption that all plugins will be Golang based.
> OpenStack Neutron currently has python libraries for clients and my
> plugin that integrates containers with openstack is python based.
>
> Fwiw, Docker's libnetwork does not mandate golang plugins. It uses
> REST APIs to talk to plugins.
>
>
>> Wow, I was not aware of that. How does it work now? CreateEndpoint creates
>> the interface (e.g. veth pair) on the host.
> veth pairs are not mandated to be created on CreateEndpoint (). You
> are only to return IP addresses, MAC addresses, Gateway etc. What this
> does in theory is that it provides flexibility with container mobility
> across hosts. So you can effectively create an endpoint from a central
> location and ask a container to join that endpoint from any host.

I feel dumb, but I don't get it. Since you seem to understand it, can
you spell it out in more detail?

Paul Tiplady

unread,
Sep 1, 2015, 6:41:35 PM9/1/15
to kubernetes-sig-network, she...@nicira.com, eugene.y...@coreos.com, paul.t...@metaswitch.com
On Tuesday, September 1, 2015 at 2:56:46 PM UTC-7, Tim Hockin wrote:
I feel dumb, but I don't get it.  Since you seem to understand it, can
you spell it out in more detail?

This isn't spelled out in the libnetwork docs, though it should be since it's highly unintuitive. We had to do a lot of code/IRC spelunking to get a clear picture of this.

The best I can do from the docs is this:

"
  • One of a FAQ on endpoint join() API is that, why do we need an API to create an Endpoint and another to join the endpoint.
    • The answer is based on the fact that Endpoint represents a Service which may or may not be backed by a Container. When an Endpoint is created, it will have its resources reserved so that any container can get attached to the endpoint later and get a consistent networking behaviour.
"
Note "resources reserved", not "interfaces created".

Here's an issue that Gurucharan raised on the libnetwork repo which clarifies somewhat: https://github.com/docker/libnetwork/issues/133#issuecomment-99927188

Although see this issue, which suggests that in CNM the CreateEndpoint ("service publish") event gets broadcasted (via the docker daemon's distributed KV store) to the network plugins on every host, so it looks like it's not even possible to optimistically create a veth and then hope that the Endpoint.Join gets run on the same host.

Note that things like interface name and MAC address are assigned at CreateEndpoint time, as if you were creating the veth at that stage.

eugene.y...@coreos.com

unread,
Sep 1, 2015, 7:19:10 PM9/1/15
to kubernetes-sig-network, eugene.y...@coreos.com, paul.t...@metaswitch.com

On Tuesday, September 1, 2015 at 2:54:48 PM UTC-7, Tim Hockin wrote:
On Tue, Sep 1, 2015 at 12:07 PM,  <eugene.y...@coreos.com> wrote:
>
> On Tuesday, September 1, 2015 at 10:46:35 AM UTC-7, Paul Tiplady wrote:
>>>
>>>
>>> I'll add #5 - does this mean we have no concept of in-process plugin?
>>> Or do we retain the facade of an in-process API like we have now.
>>>
>> Added a bullet for this in the doc.
>>
>> CNI doesn't currently have the concept of an in-process plugin. Looks like
>> with the current API this only for vendors that are extending the kubernetes
>> codebase, or am I missing something?
>>
>
> CNI doesn't have in-process plugins because that requires shared object
> (.so) support and I believe that Go has problems with that (although it
> maybe fixed in 1.5). Technically CNI is not Go specific but realistically so
> much software in this space is written in Go. Having "in-tree" plugins don't
> require .so support but to be honest those never pass my definition of
> "plugins". FWIW, I would have been quite happy to just have .so plugins as
> there's no fork/exec overhead.

I didn't mean to imply .so, though that's a way to do it too.  I meant
to ask whether kubernetes/docker/rkt could have network plugins
defined in code, one of which was an exec-proxy, or whether exec was
it.  I don't feel strongly that in-process is needed at this point.

Kubernetes/docker/rkt could certainly have "built-in types" aside from the exec based ones.
But if that built-in type is useful in general, it should be a separate executable so 
it could be re-used in other projects. And since we should strive to make these
networking plugins not tied to a container runtime, they should all be executables
by my logic.

Gurucharan Shetty

unread,
Sep 1, 2015, 7:30:05 PM9/1/15
to Tim Hockin, Eugene Yakubovich, kubernetes-sig-network, Paul Tiplady
Let me try to explain what I mean to the best of my ability with an
analogy of VMs and Network Virtualization (But before that let me
clarify that since k8 is single tenant orchestrator and has been
designed with VIP and load balancers as a basic building block, the
feature is not really useful for k8. )

With Network Virtualization, you can have 2 VMs belonging to 2
different tenants run on the same hypervisor with the same IP address.
The packet sent by VM of one tenant will never reach the VM of another
tenant, even though they are connected to the same vSwitch (e.g
openvswitch). You can apply policies to these VM interfaces (e.g.
QoS, Firewall) etc. And then you can move one of the VM to a different
hypervisor (vMotion). All the policies (e.g QoS, firewall) will now
follow the VM to the new hypervisor automatically. The IP address and
MAC address follows too to the new VM. The network controller simply
reprograms the various vswitches so that the packet forwarding happens
to the new location.

Since you have already associated your policies (firewall, QoS etc)
with the endpoint, you can destroy the VM that the endpoint is
connected to and then create a new VM at a different hypervisor and
attach the old endpoint (with its old policies) to the new VM.

My reading of what libnetwork achieves with containers is the same as
above. i.e., you can create a network endpoint with policies applied
and then attach it to any container on any host.


PS: OpenStack Neutron has the same model. The network endpoints are
created first. An IP and MAC is provisioned to that network endpoint.
And then a VM is created asking it to attach to that network endpoint.

eugene.y...@coreos.com

unread,
Sep 1, 2015, 7:45:26 PM9/1/15
to kubernetes-sig-network, tho...@google.com, eugene.y...@coreos.com, paul.t...@metaswitch.com
Let me try to explain what I mean to the best of my ability with an
analogy of VMs and Network Virtualization (But before that let me
clarify that since k8 is single tenant orchestrator and has been
designed with VIP and load balancers as a basic building block, the
feature is not really useful for k8. )

With Network Virtualization, you can have 2 VMs belonging to 2
different tenants run on the same hypervisor with the same IP address.
The packet sent by VM of one tenant will never reach the VM of another
tenant, even though they are connected to the same vSwitch (e.g
openvswitch).  You can apply policies to these VM interfaces (e.g.
QoS, Firewall) etc. And then you can move one of the VM to a different
hypervisor (vMotion). All the policies (e.g QoS, firewall) will now
follow the VM to the new hypervisor automatically. The IP address and
MAC address follows too to the new VM. The network controller simply
reprograms the various vswitches so that the packet forwarding happens
to the new location.

Since you have already associated your policies (firewall, QoS etc)
with the endpoint, you can destroy the VM that the endpoint is
connected to and then create a new VM at a different hypervisor and
attach the old endpoint (with its old policies) to the new VM.

My reading of what libnetwork achieves with containers is the same as
above. i.e., you can create a network endpoint with policies applied
and then attach it to any container on any host.

That makes sense except for this conflation of endpoint and service.
If the CreateEndpoint is really just reserving an IP  for the service that
can later be backed by some container, there is really no reason to
allocate a MAC at that point (which CreateEndpoint requires as it is expected
that it will call AddInterface whose second arg is a MAC).

While I certainly appreciate having a driver type that allows this type of migration,
I don't like every driver having to support this model. For example, this won't
really work with ipvlan where the MAC address can't be generated (it's the host's MAC)
and moved around.

Considering above statement, I don't want to modify CNI towards it. This still leaves
me hanging on how to change CNI enough to make the libnetwork interop possible. 

Tim Hockin

unread,
Sep 1, 2015, 8:18:57 PM9/1/15
to Paul Tiplady, kubernetes-sig-network, Gurucharan Shetty, Eugene Yakubovich
On Tue, Sep 1, 2015 at 3:41 PM, Paul Tiplady
<paul.t...@metaswitch.com> wrote:

> Here's an issue that Gurucharan raised on the libnetwork repo which
> clarifies somewhat:
> https://github.com/docker/libnetwork/issues/133#issuecomment-99927188

This was not answered:

"""Is your thought process that the driver can create vethnames based
on endpointuuid to make it truly portable. i.e., one can call
driver.CreateEndpoint() on one host and return back vethnames based on
eid, but not actually create the veths. Call driver.Join() on a
different host. So even though veth names are created during 'docker
service create' but veths are physically created only during 'docker
service join'. (But, vethnames can only be 15 characters long on
Linux, so there is a very very small possibility of collisions.)"""

> Although see this issue, which suggests that in CNM the CreateEndpoint
> ("service publish") event gets broadcasted (via the docker daemon's
> distributed KV store) to the network plugins on every host, so it looks like
> it's not even possible to optimistically create a veth and then hope that
> the Endpoint.Join gets run on the same host.

That sounds ludicrous to me. Can we get some confirmation from
libnetwork folks?

> Note that things like interface name and MAC address are assigned at
> CreateEndpoint time, as if you were creating the veth at that stage.

So I make up l locally-random name and expect it to be globally unique?

Tim Hockin

unread,
Sep 1, 2015, 8:46:25 PM9/1/15
to Gurucharan Shetty, Eugene Yakubovich, kubernetes-sig-network, Paul Tiplady
On Tue, Sep 1, 2015 at 4:30 PM, Gurucharan Shetty <she...@nicira.com> wrote:

> Since you have already associated your policies (firewall, QoS etc)
> with the endpoint, you can destroy the VM that the endpoint is
> connected to and then create a new VM at a different hypervisor and
> attach the old endpoint (with its old policies) to the new VM.
>
> My reading of what libnetwork achieves with containers is the same as
> above. i.e., you can create a network endpoint with policies applied
> and then attach it to any container on any host.

Thanks for the explanation. I understand it better. It's incredibly
complicated, isn't it? I think a main distinction between this
example with VMs and the container ethos is identity. A VM's IP
really is part of its identity, but a container is a point in time. I
know people will argue with this in both directions, but that seems to
be "generally" the way people think about things.

A VM's IP is expected to remain constant across moves and restarts. A
container's IP is less constant (not at all today).

More importantly, there are things that container networking can do
that preclude this level of migratability - ipvlan or plain old docker
bridges being good examples. What are "simple" plugins supposed to
assume?

Straw man, thanks to Prashanth here for discussion:

Write a "cni-exec" libnetwork driver.

You can not create new networks using it. When a CreateNetwork() call
is received we check for a static config file on disk.
E.g.CreateNetwork(name = "foobar") looks for
/etc/cni/networks/foobar.json, and if it does not exist or does not
match, fail.

PROBLEM: it looks like the CreateNetwork() call can not see the name
of the network. Let's assume that could be fixed.

CreateEndpoint() does just enough work to satisfy the API, and save
all of its state in memory.

PROBLEM: If docker goes down, how does this state get restored?

endpoint.Join() takes the saved info from CreateEndpoint(), massages
it into CNI-compatible data, and calls the CNI plugin.


Someone shoot this down? It's not general purpose in the sense that
docker's network CLI can't be used, but would it be good enough to
enable people to use the same CNI plugins across docker and rkt?

Gurucharan Shetty

unread,
Sep 1, 2015, 8:52:52 PM9/1/15
to Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
>> Note that things like interface name and MAC address are assigned at
>> CreateEndpoint time, as if you were creating the veth at that stage.
>
> So I make up l locally-random name and expect it to be globally unique?

Actually that is not true. From:
https://github.com/docker/libnetwork/blob/master/docs/remote.md

The veth names are returned back during network join call. And network
join is not broadcasted to all hosts. If I remember correctly only
create network, delete network, create endpoint, delete endpoint are
broadcasted to all nodes.

Tim Hockin

unread,
Sep 1, 2015, 8:59:44 PM9/1/15
to Gurucharan Shetty, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
They are broadcasted or they are posted to the KV store? In other
words, are plugins expected to watch() the KV store for new endpoints
and networks, or to lazily fetch them?

How does a "remote" plugin do this?

Gurucharan Shetty

unread,
Sep 1, 2015, 9:09:39 PM9/1/15
to Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
On Tue, Sep 1, 2015 at 5:59 PM, Tim Hockin <tho...@google.com> wrote:
> They are broadcasted or they are posted to the KV store? In other
> words, are plugins expected to watch() the KV store for new endpoints
> and networks, or to lazily fetch them?
Docker daemon in every host reads the kv store and send that
information to the remote plugin drivers on that host via the plugin
API. The remote drivers are not supposed to look at Docker's kv store
but only rely on the information received via the API.

>
> How does a "remote" plugin do this?
The current design is harsh on remote plugins (the libnetwork
developers have promised to look into it to see if they can come up
with a viable solution. See
https://github.com/docker/libnetwork/issues/313). My remote driver
(which integrates with OpenStack neutron, but runs the containers
inside the VMs) makes call to the OpenStack Neutron database to
store/fetch the information. With the current design a single user
request via docker CLI gets converted into 'X' requests to Neutron
database (where 'X' = number of hosts in the cluster) and that is
unworkable for large number of hosts and large number of containers.
That is one reason I like the k8 plugins.

Tim Hockin

unread,
Sep 1, 2015, 9:19:06 PM9/1/15
to Gurucharan Shetty, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
Guru,

Thanks for the explanations. I appreciate you being "docker by proxy" here :)

I do feel like I mostly understand the libnetwork model now, though
there are some very clear limitations of it. I also feel like I could
work around most of the limitations, but the solutions are the same as
you - make our own side-band calls to our own API and fetch
information that Docker does not provide. We can't implement their
libkv in terms of our API server because it is not a general purpose
store, though maybe we could intercept accesses and parse the path?
Puke.

The more I learn, the less I like it. It feels incredibly convoluted
for simple drivers to do anything useful.

Gurucharan Shetty

unread,
Sep 1, 2015, 9:57:07 PM9/1/15
to Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
I think if docker daemons are not started as part of a distributed kv
store, but rather as part of a local only kv store (for e.g each
minion will have a consul/etcd client running that is local only),
then libnetwork can be abused for k8 purposes for IP address reporting
via 'docker inspect'. If such a thing is done, any commands like
'docker network ls', 'docker service ls' etc will report false data,
but k8 need not show that to the user.

Prashanth B

unread,
Sep 1, 2015, 11:06:04 PM9/1/15
to Gurucharan Shetty, Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
> Docker daemon in every host reads the kv store and send that
information to the remote plugin drivers on that host via the plugin
API.

Your statement is confusing, can you clarify? It seems like the remote driver is just like a native driver that does nothing but parse its arguments into json and post them to the plugin using an http client. There is code to watch the kv store in the controller itself, but it no-ops if a store isn't provided (that's how the bridge driver works). IIUC the plugin just needs to run an HTTP server bound to a unix socket in /run/docker/plugins. 

If this is right it makes our first cut driver simpler, we can directly use the apiserver from our (hypothetical) kubelet-driver for things that need persistence, without running another database.


Gurucharan Shetty

unread,
Sep 1, 2015, 11:59:03 PM9/1/15
to Prashanth B, Tim Hockin, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
On Tue, Sep 1, 2015 at 8:06 PM, Prashanth B <be...@google.com> wrote:
>> Docker daemon in every host reads the kv store and send that
> information to the remote plugin drivers on that host via the plugin
> API.
>
> Your statement is confusing, can you clarify? It seems like the remote
> driver is just like a native driver that does nothing but parse its
> arguments into json and post them to the plugin using an http client. There
> is code to watch the kv store in the controller itself, but it no-ops if a
> store isn't provided (that's how the bridge driver works). IIUC the plugin
> just needs to run an HTTP server bound to a unix socket in
> /run/docker/plugins.
>
> If this is right it makes our first cut driver simpler, we can directly use
> the apiserver from our (hypothetical) kubelet-driver for things that need
> persistence, without running another database.
I did not understand your question/assertion correctly. So I am going
to make a detailed elaboration.

When I say a "remote driver", I mean a server which listens for REST
API calls from docker daemon.

I have a remote driver written in Python here:
https://github.com/shettyg/ovn-docker/blob/master/ovn-docker-driver

That driver writes the line "tcp://0.0.0.0:5000" in the file
"/etc/docker/plugins/openvswitch.spec"

Now, when docker daemon starts, it will send the equivalent of:
curl -i -H 'Content-Type: application/json' -X POST
http://localhost:5000/Plugin.Activate

And my driver is suppose to respond with:

{
"Implements": ["NetworkDriver"]
}

1. User creates a network:
docker network create -d openvswitch foo

This command from the user results in my server receiving the
equivalent of the following POST request.


i.e:
curl -i -H 'Content-Type: application/json' -X POST -d
'{"NetworkID":"UUID","Options":{"blah":"bar"}}'
http://localhost:5000/NetworkDriver.CreateNetwork

Now things get interesting. The above POST request gets repeated on
every host that belongs to the docker cluster.
So your driver should figure out what is a duplicate request.


2. User creates a service
docker service publish my-service.foo

The above command will call the equivalent of:

curl -i -H 'Content-Type: application/json' -X POST -d
'{"NetworkID":"UUID","EndpointID":"dummy-endpoint","Interfaces":[],"Options":{}}'
http://localhost:5000/NetworkDriver.CreateEndpoint

Again the same command gets called in every host.

I hope that answers your question?

Tim Hockin

unread,
Sep 2, 2015, 12:35:53 AM9/2/15
to Gurucharan Shetty, Prashanth B, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
Prashanth's point, I think, was that we could have Kubelet act as a
network plugin, lookup/store network and endpoint config in our api
server, and exec CNI plugins for join.

This falls down a bit for a few reasons. First, libkv assumes an
arbitrary KV store, which our APIserver is not. Second, the fact that
Network objects can be created through Docker or kubernetes is not
cool. Third, if we only allow network objects through kubernetes we
can't see the name of the object Docker thinks it is creating.

Prashanth B

unread,
Sep 2, 2015, 1:22:23 AM9/2/15
to Tim Hockin, Gurucharan Shetty, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
I hope that answers your question?

Thanks for the example. So what I'm proposing is a networking model with the following limitations for the short term:
1. Only one (docker) network object, this is the kubernetes network. All endpoints must join it. 
2. Containers can only join endpoints on the same host.
3. A join execs CNI plugin with json composed from the join and endpoint create, derived from storage (physical memory, sqlite, apiserver -- as long as it's not another database it remains an implementation detail).

First, libkv assumes an arbitrary KV store, which our APIserver is not.

Doesn't look like libkv is a requirement for remote plugins. If we start docker with a plugin but without a kv store, the json will get posted to the localhost http server, but not propogated to the other hosts (untested, this from staring at code). This is ok, because there is only 1 network and no cross host endpoint joining. If we really need cross host consistency, we have an escape hatch via apiserver watch.

> Third, if we only allow network objects through kubernetes we can't see the name of the object Docker thinks it is creating.

We don't even have to allow this. The cluster is bootstrapped with a network object. It's readonly thereafter. Create network will noop after that. 

This would give users the ability to dump their own docker plugins into /etc/docker/plugins, start the kubelet with --manage-networking=false, and use dockers remote plugin api. At the same time CNI should work with manage-networking=true. 




Tim Hockin

unread,
Sep 2, 2015, 2:23:15 AM9/2/15
to Prashanth B, Gurucharan Shetty, Paul Tiplady, kubernetes-sig-network, Eugene Yakubovich
We will eventually want to add something akin to multiple Networks, so
I want to be dead sure that it is viable before we choose and
implement a model.

Gurucharan Shetty

unread,
Sep 2, 2015, 10:36:08 AM9/2/15