windows, CNI Check, VNICs, and containerd

jay vyas

unread,

Jan 28, 2021, 10:17:24 PM1/28/21

to kubernetes-sig-network

Hi folks !!!!

We recently hit a codepath difference in how CNI is leveraged in containerd, for windows, wherein the IP address of a pod is created (by a cni provider) but not really ready yet (because of the timing of how the containerd->cni vs cri setup and ordering for windows sets up "Vnics").

In any case, we were wondering wether or not theres anyone using the cni check api call, to checkin periodically on the health of IP addresses, to potentially use that info to kill off containers which wound up getting a corrupted device attachment.

In the windows scenario, this might be the only way to build large scale CNIs that support Containerd for k8s, so it would be a pretty cool paradigm to potentially adopt.

So, just looking for any initial thoughts on how folks use the CNI check API, and wether this might not be too-crazy of an idea to try out ?

Maybe one of the Casey's has a thought here :)

This seems to effect alot of CNIs, and so, as we move off dockershim, it will get more relevant for us... at least in the windows world :)

The main issue is here https://github.com/containerd/containerd/issues/4851,

- vmware-tanzu/antrea#1581

- Calico with plain containerd on Windows does not projectcalico/calico#4334

--

jay vyas

Dan Williams

unread,

Jan 29, 2021, 12:09:32 PM1/29/21

to jay vyas, kubernetes-sig-network

On Thu, 2021-01-28 at 22:17 -0500, jay vyas wrote:
> Hi folks !!!!
>
> We recently hit a codepath difference in how CNI is leveraged in
> containerd, for windows, wherein the IP address of a pod is created
> (by a cni provider) but not really ready yet (because of the timing
> of how the containerd->cni vs cri setup and ordering for windows
> sets up "Vnics").

We would expect the CNI plugin not return a Result to the runtime until
the container has connectivity. It's more complicated than that
(depending on how much visiblity the CNI plugin has into the full
network) but as far as the plugin is reasonably able to determine
container network readiness, this is what it should do.

> In any case, we were wondering wether or not theres anyone using
> the cni check api call, to checkin periodically on the health of IP
> addresses, to potentially use that info to kill off containers which
> wound up getting a corrupted device attachment.

CRIO used to do this, based on kubelet's ~30 second PodSandboxStatus
CRI requests. These days however, it just caches the network result and
returns that instead of calling check. I think that's a mistake, and
I'm working with the CRIO team to change that behavior back to
periodically calling CNI CHECK.

Your understanding above is the more-or-less the intention of CHECK.
It's not intended to replace the Kubernetes application-level health
check, but simply to determine (to the best extent possible) if the
container's networking is alive.

If a pod sandbox restart will solve the problem, then CHECK should
return an error. If it wouldn't solve the problem, then CHECK shouldn't
do anything.

eg, if the node's load balancer isn't working, a random container
restart won't solve that problem, so CHECK probably shouldn't poke the
node LB for liveness.

> In the windows scenario, this might be the only way to build large
> scale CNIs that support Containerd for k8s, so it would be a pretty
> cool paradigm to potentially adopt.

Why is that? Are we not able to reliably determine the container's
specific network resources are allocated and set up on Windows?

> So, just looking for any initial thoughts on how folks use the CNI
> check API, and wether this might not be too-crazy of an idea to try
> out ?

CHECK should return an error to the runtimem when the container's
network is unhealthy, and a container restart can do something about
that unhealthiness.

It's up to the runtime to plumb CHECK through in some useful way, and
most of them have done that via PodSandboxStatus CRI call.

Dan

Antonin Bas

unread,

Jan 29, 2021, 12:36:22 PM1/29/21

to Dan Williams, jay vyas, kubernetes-sig-network

Hi Dan,

Please see inline

Le ven. 29 janv. 2021 à 09:09, Dan Williams <dc...@redhat.com> a écrit :

On Thu, 2021-01-28 at 22:17 -0500, jay vyas wrote:
> Hi folks !!!!
>
> We recently hit a codepath difference in how CNI is leveraged in
> containerd, for windows, wherein the IP address of a pod is created
> (by a cni provider) but not really ready yet (because of the timing
> of how the containerd->cni vs cri setup and ordering for windows
> sets up "Vnics").

We would expect the CNI plugin not return a Result to the runtime until
the container has connectivity. It's more complicated than that
(depending on how much visiblity the CNI plugin has into the full
network) but as far as the plugin is reasonably able to determine
container network readiness, this is what it should do.

We have a situation on Windows with containerd where the CNI ADD command is invoked before the sandbox container is created. The CNI can create the HNSEndpoint successfully using the Windows APIs, but because there is no container yet, the vNIC / network adapter will not be created and visible to the CNI before it returns the Result and containerd proceeds to create the container. This is an issue if the CNI needs to perform some operations involving the network adapter in order to provide connectivity. The solution we are looking at is to have the CNI plugin spawn a goroutine which will asynchronously wait for the network adapter and perform the necessary operations. But if any of this results in an error, there is no way to report it to the CR and to kubelet. Most errors should be transient and we can have a retry mechanism. But there is always the possibility of something being broken and requiring a pod restart. We are looking at CHECK as a way to report errors that we cannot resolve to the CR.

> In any case, we were wondering wether or not theres anyone using
> the cni check api call, to checkin periodically on the health of IP
> addresses, to potentially use that info to kill off containers which
> wound up getting a corrupted device attachment.

CRIO used to do this, based on kubelet's ~30 second PodSandboxStatus
CRI requests. These days however, it just caches the network result and
returns that instead of calling check. I think that's a mistake, and
I'm working with the CRIO team to change that behavior back to
periodically calling CNI CHECK.

Your understanding above is the more-or-less the intention of CHECK.
It's not intended to replace the Kubernetes application-level health
check, but simply to determine (to the best extent possible) if the
container's networking is alive.

If a pod sandbox restart will solve the problem, then CHECK should
return an error. If it wouldn't solve the problem, then CHECK shouldn't
do anything.

eg, if the node's load balancer isn't working, a random container
restart won't solve that problem, so CHECK probably shouldn't poke the
node LB for liveness.

> In the windows scenario, this might be the only way to build large
> scale CNIs that support Containerd for k8s, so it would be a pretty
> cool paradigm to potentially adopt.

Why is that? Are we not able to reliably determine the container's
specific network resources are allocated and set up on Windows?

I think Jay is referring to the issue I described above, where some resources are created asynchronously.

I will not claim that this issue affects all CNIs as it probably depends on your datapath implementation.

And the sequence of operations was different for dockershim (sandbox container created before CNI ADD call), which means this issue didn't exist. But containerd went a different route and is using HCNNamespace objects.

> So, just looking for any initial thoughts on how folks use the CNI
> check API, and wether this might not be too-crazy of an idea to try
> out ?

CHECK should return an error to the runtimem when the container's
network is unhealthy, and a container restart can do something about
that unhealthiness.

It's up to the runtime to plumb CHECK through in some useful way, and
most of them have done that via PodSandboxStatus CRI call.

That was my understanding as well, but you mention that CRIO is not using CHECK currently. It seems that containerd is also caching the Result from CNI ADD instead of calling CHECK. And it doesn't seem that dockershim was using CHECK either. So is there any runtime leveraging CNI CHECK today?

Antonin

Dan

> Maybe one of the Casey's has a thought here :)
>
> This seems to effect alot of CNIs, and so, as we move off dockershim,
> it will get more relevant for us... at least in the windows world :)
>
> The main issue is here
> https://github.com/containerd/containerd/issues/4851,
>
> - vmware-tanzu/antrea#1581
> - Calico with plain containerd on Windows does not
> projectcalico/calico#4334
>

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-ne...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-network/2ba636c1016b83d6b62ef97abb27e3c13fbd8465.camel%40redhat.com.

Dan Williams

unread,

Feb 1, 2021, 1:42:32 PM2/1/21

to Antonin Bas, jay vyas, kubernetes-sig-network

So with containerd on Windows, is the sequence:

Kube -> containerd: RunPodSandbox()
containerd: CNI ADD
<block>
containerd: create pod sandbox
containerd -> kube: RunPodSandbox result

or is the CNI ADD asynchronous within containerd? eg, does containerd
block waiting for the ADD to return, or does it just kick off the ADD
and then continue to create the sandbox resources in parallel with the
ADD?

The answer to that question would help determine the best path forward.

Dan

Antonin Bas

unread,

Feb 1, 2021, 2:36:16 PM2/1/21

to Dan Williams, jay vyas, kubernetes-sig-network

Yes, I believe this is the sequence (blocking ADD then Pod sandbox creation) that we observe.

This can be confirmed by looking at the containerd code: https://github.com/containerd/containerd/blob/49c5c1487951307761e84de52477d640c8b0c84c/pkg/cri/server/sandbox_run.go#L150. The setupPodNetwork method is called first: it will invoke the CNI plugin, wait for the CNI Result and consume it, before returning.

Dan Williams

unread,

Feb 2, 2021, 10:54:33 AM2/2/21

to Antonin Bas, jay vyas, kubernetes-sig-network

On Mon, 2021-02-01 at 11:36 -0800, Antonin Bas wrote:
> Yes, I believe this is the sequence (blocking ADD then Pod sandbox
> creation) that we observe.
> This can be confirmed by looking at the containerd code:
> https://github.com/containerd/containerd/blob/49c5c1487951307761e84de52477d640c8b0c84c/pkg/cri/server/sandbox_run.go#L150
> . The setupPodNetwork method is called first: it will invoke the CNI
> plugin, wait for the CNI Result and consume it, before returning.

Any idea why it was done that way?

Dan

Jay Vyas

unread,

Feb 2, 2021, 10:12:50 PM2/2/21

to Dan Williams, Antonin Bas, kubernetes-sig-network

Out of my depth but iirc It’s related to the requirements of the windows HCSShim & the VNIC creation Which I guess requires an IP before it can run but still requires to do come other stuff after , https://github.com/microsoft/windows-container-networking/blob/45dfaf42358c2710159a7df0bde4807bc4a13a52/network/manager.go#L120, before the IP can be routable (In contrast your Linux where I guess veth pairs are cheap and easy)?

On Feb 2, 2021, at 10:54 AM, Dan Williams <dc...@redhat.com> wrote:

Dan Williams

unread,

Feb 18, 2021, 5:16:45 PM2/18/21

to Jay Vyas, Antonin Bas, kubernetes-sig-network

Jay/Antonin,

We talked about this recently in the CNI maintainer's meeting. The TLDR
is that yes, you should be able to use CHECK for this as you suggest.
That's not originally intended spirit of CHECK but it's not out-of-
scope to use it this way.

I'd suggest adding the CHECK call in the same function/flow in
containerd, eg in sandbox_run.go the flow should be:

1) call CNI ADD
2) set up the container's isolation domains (eg Linux namespaces or the
Windows container resources that depend on the CNI Add result)
3) call CHECK to make sure the network is alive, perhaps with in a loop
with a timeout until ready. Should work for Linux too.

Note that CHECK is only present in CNI spec 0.4.0 and later, so the
plugin must advertise that version, the CNI JSON config needs to
specify 0.4.0 or later, and obviously it must implement CHECK :)

Dan

> --
> You received this message because you are subscribed to the Google
> Groups "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it,
> send
> an email to kubernetes-sig-ne...@googlegroups.com.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/kubernetes-sig-network/992FBA99-89DA-490C-AA2A-C1C73CF990B9%40gmail.com
> .

Antonin Bas

unread,

Feb 18, 2021, 5:40:44 PM2/18/21

to Dan Williams, Jay Vyas, kubernetes-sig-network

Thanks for following-up on this Dan. I'll sync up with Jay and I think we can check with the containerd maintainers if this is a change they would be open to (at least for Windows).

Antonin

Casey Callendrello

unread,

Feb 22, 2021, 10:38:07 AM2/22/21

to Antonin Bas, Dan Williams, Jay Vyas, kubernetes-sig-network

Also, as a follow-up, CNI actually has an error code equivalent to "EAGAIN".

https://github.com/containernetworking/cni/blob/spec-v0.4.0/SPEC.md#well-known-error-codes

Arguably, the "most correct" behavior for ContainerD would be to bubble-up the CHECK failure when the error code isn't 11 (or a certain number of retries have passed).

-- Casey "the other Casey" Callendrello

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-network/CAAkB0aAM70LCk-0_i3rGkJ8dGBsdwQNiDMGxqufr95jNy%2ByRvg%40mail.gmail.com.

Antonin Bas

unread,

Feb 23, 2021, 7:33:06 PM2/23/21

to Casey Callendrello, Dan Williams, Jay Vyas, kubernetes-sig-network

Thanks Casey and Dan,

There was a meeting today with several folks (sig-windows, MSFT containers, RedHat, VMware) to discuss this proposal.

There seems to be consensus that calling CNI CHECK at the end of the containerd RunPodSandbox implementation and blocking is something we want to explore.

I'm aware that this is more in the realm of CNI and not so much in the realm of sig-network, but since we already have this thread, I figured I would follow-up here with this question. During the meeting today we found this sentence in the CNI spec document:

> The plugin must handle CHECK being called immediately after an ADD, and therefore should allow a reasonable convergence delay for any asynchronous resources.

I interpret it as: it is ok for CHECK to block for a reasonable amount of time to wait for resources (e.g. network routes) whose creation was initiated by the CNI ADD call.

Given that, there was a suggestion that instead of calling CHECK in a loop, it would be ok for a plugin to block for a while and return when all resources are ready or a timeout expires. This means that 1) we don't need to call CHECK in a loop and spawn a process each time, 2) we get notified as soon as resources are ready, as opposed to having to wait for the next iteration, and 3) plugins can define their own "reasonable" timeout instead of containerd (or another container runtime) having to define an umbrella one for all plugins.

Another interesting topic came up during the discussion about CNI CHECK: the ability to prevent the Pod / workload from accessing the Pod network until the network plugin can ensure that a certain baseline of NetworkPolicies have been enforced. I know that this security concern has come up in the past (https://github.com/kubernetes/enhancements/pull/1329), and is related to the ongoing effort to introduce admin-level policies (ClusterNetworkPolicies) with support for default-deny. But the CNI API and the K8s policy API(s) are orthogonal. Yet they are typically implemented by the same plugins, and I would believe that the implementations themselves may sometimes not be completely orthogonal, e.g. it would be possible for the CNI ADD implementation to communicate a Pod's IP address to the module in charge of implementing policies before the Pod Status is updated in K8s API. Has there ever been an interest / need in having the CNI be able to "guarantee" to the container runtime / kubelet that a required set of baseline policies have been enforced on the Pod's network endpoint, before the Pod containers are actually run? I apologize in advance if this seems like a complete layer violation :)

Thanks,

Antonin

Dan Williams

unread,

Feb 23, 2021, 8:43:30 PM2/23/21

to Antonin Bas, Casey Callendrello, Jay Vyas, kubernetes-sig-network

On Tue, 2021-02-23 at 16:32 -0800, Antonin Bas wrote:
> Thanks Casey and Dan,
>
> There was a meeting today with several folks (sig-windows, MSFT
> containers,
> RedHat, VMware) to discuss this proposal.
> There seems to be consensus that calling CNI CHECK at the end of the
> containerd RunPodSandbox implementation and blocking is something we
> want
> to explore.
> I'm aware that this is more in the realm of CNI and not so much in
> the
> realm of sig-network, but since we already have this thread, I
> figured I
> would follow-up here with this question. During the meeting today we
> found
> this sentence in the CNI spec document:

> *> The plugin must handle CHECK being called immediately after an

> ADD, and
> therefore should allow a reasonable convergence delay for any
> asynchronous

> resources.*

> I interpret it as: it is ok for CHECK to block for a reasonable
> amount of
> time to wait for resources (e.g. network routes) whose creation was
> initiated by the CNI ADD call.
> Given that, there was a suggestion that instead of calling CHECK in a
> loop,
> it would be ok for a plugin to block for a while and return when all
> resources are ready or a timeout expires. This means that 1) we don't
> need
> to call CHECK in a loop and spawn a process each time, 2) we get
> notified
> as soon as resources are ready, as opposed to having to wait for the
> next
> iteration, and 3) plugins can define their own "reasonable" timeout
> instead
> of containerd (or another container runtime) having to define an
> umbrella
> one for all plugins.

Yes, it seems reasonable to block for some time in CHECK. Note that the
runtime will likely kill you after some amount of time (~2m depending
on implementation) so you can't block forever. Kubelet might also bound
the time ADD can take, though I forget if we just discussed that or
actually implemented it.

Note that you need to make sure the plugin (whether yours, or the
reference CNI win-bridge/win-overlay) actually implements CHECK. None
of the Windows reference plugins currently do.

I don't see that as a layer violation. The network plugin may well be
responsible for NetworkPolicy, and I think it's definitely reasonable
to block ADD until the network is "ready", however the plugin wants to
define that.

But remember that NetworkPolicy can change at any time, including
during the ADD call. So maybe you take a snapshot of NetworkPolicy that
applies to the pod at some point during ADD and consider application of
that snapshot to be "ready".

There's certainly interest in guaranteeing that policy is applied to
the pod before returning from ADD, but that's usually a decision up to
the plugin during ADD (or subsequent CHECK I guess). I'm not sure it's
particularly useful to decouple the two from a Kubernetes point of
view.

There has also been discussion of Pod Readiness++ which tries to
formalize a mechanism for communicating when a pod is ready, IIRC
mostly for Service/Endpoint usage. NetworkPolicy could definitely feed
into that, but again would be an implementation detail of the plugin's
support for this.

Dan

> > > https://groups.google.com/d/msgid/kubernetes-sig-network/CAAkB0aAM70LCk-0_i3rGkJ8dGBsdwQNiDMGxqufr95jNy%2ByRvg%40mail.gmail.com?utm_medium=email&utm_source=footer
> > > >
> > > .
> > >
> >
>

Casey Callendrello

unread,

Feb 24, 2021, 8:12:59 AM2/24/21

to Antonin Bas, Dan Williams, Jay Vyas, kubernetes-sig-network

> The plugin must handle CHECK being called immediately after an ADD, and therefore should allow a reasonable convergence delay for any asynchronous resources.

Yup, that's in the spec. However, we can change the spec. We are this close to cutting CNI 1.0, so if there's consensus that that line isn't helpful; it can be changed.

In other words, we could change it to something like "If the plugin determines that the container's networking is correctly configured, but is waiting for asynchronous resources to be created, it MAY return error code 11: try again later".

If this sounds reasonable, can you file an issue on https://github.com/containernetworking/cni/ ?

Antonin Bas

unread,

Feb 25, 2021, 4:09:29 PM2/25/21

to Casey Callendrello, Dan Williams, Jay Vyas, kubernetes-sig-network

Hi Casey,

From what I could tell, folks were happy with the current spec and the blocking behavior (for the reasons I listed in my last email). So I am not sure there is a strong reason to change it.

My interpretation of the current spec is that in theory the plugin is already allowed to return EAGAIN for a CHECK call and that the container runtime should handle that case anyway.

I can open an issue to track this regardless, it may be time to move that part of the conversation out of the sig-network mailing list anyway :)

Thanks,

Antonin

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-network/CALbOP4G8XjXiXB_5V38QxohTgaf5Y0FF1a0xrXJACQOPHu%3DHpw%40mail.gmail.com.

jay vyas

unread,

Mar 9, 2021, 12:38:32 PM3/9/21

to kubernetes-sig-network

FYI markdown notes... https://hackmd.io/z5SH3_w8Qp-M2V0Vkw4xqw

Casey Callendrello

unread,

Mar 10, 2021, 4:13:03 PM3/10/21

to Antonin Bas, Dan Williams, Jay Vyas, kubernetes-sig-network

Hi Antonin,

The intention of the spec, as written, was to disallow the previously described behavior. However, I think we all agree the spec is too restrictive in this case, so we're going to change it for 1.0.

-- Casey

Reply all

Reply to author

Forward