Kubelet Node Addresses and External Cloud Providers

553 views
Skip to first unread message

Antonio Ojea

unread,
Sep 28, 2023, 3:01:02 PM9/28/23
to kubernetes-sig-network, kubernetes-sig-node, kubernetes-sig...@googlegroups.com, dev
Hi all,

signal boosting this problem as it impacts multiple SIGs.

The kubelet has different heuristic to assign the addresses to the Node object, but with the recent push on removing the intree cloud providers, we found that there is a problem in the kubelet with the existing logic to assign node addresses when using "cloudprovider=external".

In this mode, when the kubelet starts, if the node object does not have any node.status.address,  kubelet uses temporary the same logic as if it was not using any cloud provider to assign ones, and then allows the external cloud provider to override these addresses. There is no problem if the external cloud provider and the kubelet agree on the addresses, however, this is not always the case as we can see in https://github.com/kubernetes/kubernetes/issues/120720.

It is important to mention that this is especially problematic because the hostNetwork Pods uses the node Addresses to populate the HostIPs and pod.status.PodIPs with all the implications that it has: Endpoints, Downward API, ... Another important detail is  that we decided that the kubelet will not mutate the HostIPs, it has to be restarted https://github.com/kubernetes/enhancements/issues/2999, and this will make the race to happen again.

My proposal is to avoid this shortcut in the kubelet and wait for the cloud provider external to assign the node address, I also added a feature gate to this change to be safe and be able to have a way to rollback the decision if needed https://github.com/kubernetes/kubernetes/pull/120751.

As a consequence, this behavior presents an additional problem, because the node will be ready and may not have addresses until the cloud provider external assigns them. To solve this race I think we should make node.status.addresses as part of the node readiness https://github.com/kubernetes/kubernetes/pull/120753, so no pod will be scheduled until the node has the addresses.

Regards,
Antonio Ojea




jay vyas

unread,
Sep 28, 2023, 6:37:30 PM9/28/23
to Antonio Ojea, kubernetes-sig-network, kubernetes-sig-node, kubernetes-sig...@googlegroups.com, dev
Theoretically sounds good curious tho....

will DIY or edge clusters not running properly runninc cpis be DOA due to network metadata starvation that can easily be guessed ? 

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-ne...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-network/CABhP%3DtYzHgaWzCtjsh9DL9UzYt0do64%3DCgMc_enmrWqF9PX0sg%40mail.gmail.com.

Aaron U'Ren

unread,
Sep 28, 2023, 7:13:24 PM9/28/23
to jay vyas, Antonio Ojea, kubernetes-sig-network, kubernetes-sig-node, kubernetes-sig...@googlegroups.com, dev
I'm relatively new here, so forgive me if I'm missing something obvious. But if the node is marked as unready, then might this introduce a potential race condition if the external provider needs to run on the node in order to give it addresses so that it can become ready?

Specifically, I'm thinking about brand new clusters that don't yet have any nodes in a ready status. If the cloud provider runs as a pod in the cluster, then the cloud providers pods wouldn't schedule because of the lack of ready nodes right?

-Aaron


antonio.o...@gmail.com

unread,
Sep 30, 2023, 6:02:52 AM9/30/23
to kubernetes-sig-network
That is a good point Aaron, let me try to go deep into that problem to document this, I think that using tolerations will solve the problem of scheduling to a not ready node, but I want to double check it works fine

Dan Winship

unread,
Sep 30, 2023, 10:47:34 AM9/30/23
to Antonio Ojea, kubernetes-sig-network, kubernetes-sig-node, kubernetes-sig...@googlegroups.com, dev
On 9/28/23 15:00, Antonio Ojea wrote:
> It is important to mention that this is especially problematic because
> the hostNetwork Pods uses the node Addresses to populate the HostIPs and
> pod.status.PodIPs with all the implications that it has: Endpoints,
> Downward API, ... Another important detail is  that we decided that the
> kubelet will not mutate the HostIPs, it has to be
> restarted https://github.com/kubernetes/enhancements/issues/2999
> <https://github.com/kubernetes/enhancements/issues/2999>, and this will
> make the race to happen again.

Mmm, no, the race only happens the *first* time kubelet starts, when
there is no existing Node object. If you restart kubelet after the Node
exists and the cloud provider has filled in .Status.Addresses, then when
kubelet starts up again it will see that there are already addresses
there and so it won't run its own address-detecting code again.

-- Dan

Antonio Ojea

unread,
Sep 30, 2023, 4:07:59 PM9/30/23
to Dan Winship, kubernetes-sig-network, kubernetes-sig-node, kubernetes-sig...@googlegroups.com, dev
Yes, that's correct, thanks for catching that 

Antonio Ojea

unread,
Oct 6, 2023, 10:14:22 AM10/6/23
to Dan Winship, kubernetes-sig-network, kubernetes-sig-node, kubernetes-sig...@googlegroups.com, dev
I think that I've found a solution that solves the problem without breaking the compatibility https://github.com/kubernetes/kubernetes/pull/121028


Please review if you are impacted by this problem, this release we are moving to beta the KEP-2395 Removing In-Tree Cloud Providers and it is critical that we fix this behavior ASAP.

Thanks 

Davanum Srinivas

unread,
Oct 6, 2023, 5:26:42 PM10/6/23
to Antonio Ojea, Dan Winship, kubernetes-sig-network, kubernetes-sig-node, kubernetes-sig...@googlegroups.com, dev
Thanks Antonio. I will merge the PR as it has consensus now. Let's see if we spot any hiccups in the CI over the next few days.

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-ne...@googlegroups.com.

Antonio Ojea

unread,
Oct 9, 2023, 5:50:26 AM10/9/23
to Joel Speed, Dan Winship, kubernetes-sig-network, kubernetes-sig-node, kubernetes-sig...@googlegroups.com, dev
Good question Joel, thanks for bringing this up, it is no breaking change, the previous behavior was clearly a bug because assumed that external cloud-provider and kubelet autodiscovery will be in sync, and we heavily discussed the problems of mutating PodIPs on hostNetwork pods.

You don't have to specify node-ip flags, pods will be deployed as always and will run normally, just the hostNetwork pods , PodIPs will be updated later ...  if your pod startup logic depends on the PodIPs somehow, as using the downward API to get the value, then of course you'll have to wait, but it will be wrong for the CCM that is deployed in a Pod to depend on the PodIPs since it is the one that has to provide them.

Below example shows how a daemonset with hostNetwork: true, kindnet-nzgbk, runs in a node "external-worker" with the kubelet using external cloud-provider and no node-ip flags added

```
$ kubectl get pods -A -o wide
NAMESPACE            NAME                                             READY   STATUS    RESTARTS   AGE     IP            NODE                     NOMINATED NODE   READINESS GATES
kube-system          kindnet-4m67t                                    1/1     Running   0          4m1s    192.168.8.7   external-control-plane   <none>           <none>
kube-system          kindnet-5bn5q                                    1/1     Running   0          3m57s   192.168.8.8   external-worker2         <none>           <none>
kube-system          kindnet-nzgbk                                    1/1     Running   0          3m57s   <none>        external-worker          <none>           <none>

```

On Mon, 9 Oct 2023 at 10:14, Joel Speed <jsp...@redhat.com> wrote:
Hey Antonio,

With your solution, IIUC I think, if a topology relies on the CCM being a host network pod on the control plane nodes, this means that control plane Nodes will need to specify `—node-ip` always, is that correct?
And is that not considered to be a breaking change?

I expect at the moment there are a number of deployments of the CCM where there is an expectation that the CCM can run on your first, uninitialized control plane node. If these deployments haven’t set the `—node-ip` flag up to this point, when they are upgraded to 1.29 they will be required to change the cluster installation process to set the `—node-ip` flag going forward.

Thanks,
Joel

Joel Speed

He/Him/His

Principal Software Engineer, OpenShift

Red Hat



--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-cloud-provider" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-cloud...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-cloud-provider/CABhP%3Dtae8utFJxJH5zvRQB2Eo5b78bntQxotUZXOW5NMHduP3w%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-cloud-provider" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-cloud...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-cloud-provider/4747C861-E8D3-4642-918F-50B27F73E50F%40redhat.com.

Joel Speed

unread,
Oct 9, 2023, 11:06:45 AM10/9/23
to Antonio Ojea, Dan Winship, kubernetes-sig-network, kubernetes-sig-node, kubernetes-sig...@googlegroups.com, dev
Hey Antonio,

With your solution, IIUC I think, if a topology relies on the CCM being a host network pod on the control plane nodes, this means that control plane Nodes will need to specify `—node-ip` always, is that correct?
And is that not considered to be a breaking change?

I expect at the moment there are a number of deployments of the CCM where there is an expectation that the CCM can run on your first, uninitialized control plane node. If these deployments haven’t set the `—node-ip` flag up to this point, when they are upgraded to 1.29 they will be required to change the cluster installation process to set the `—node-ip` flag going forward.

Thanks,
Joel

Joel Speed

He/Him/His

Principal Software Engineer, OpenShift

Red Hat


On 6 Oct 2023, at 15:13, Antonio Ojea <antonio.o...@gmail.com> wrote:

Antonio Ojea

unread,
Jun 16, 2024, 12:04:08 PMJun 16
to kubernetes-sig-network, kubernetes-sig-node, kubernetes-sig-cloud-provider, kubernetes-sig-auth, dev
Hi,

Following up on this topic as the existing behavior has broken some deployments https://github.com/kubernetes/kubernetes/issues/125348 (I apologize for this :( )

It seems there is consensus on using https://github.com/kubernetes/kubernetes/pull/125337 to fix it.

In addition, and due to how brittle is this area of the codebase, I've created the following document to try to consolidate and document the existing problems https://docs.google.com/document/d/1mqdVLQHIYjrzjy8Hq-FMysthHRL7ht0guk2uWI0FxKM/edit .

I encourage all the SIG leads and people working on downstream Kubernetes deployments that depend on external cloud providers to chime in.

Regards,
Antonio Ojea

Antonio Ojea

unread,
Jun 30, 2024, 5:41:44 PM (2 days ago) Jun 30
to Michael McCune, kubernetes-sig-network, kubernetes-sig-node, kubernetes-sig-cloud-provider, kubernetes-sig-auth, dev


On Tue, 18 Jun 2024 at 10:52, Michael McCune <elm...@redhat.com> wrote:


On Sun, Jun 16, 2024 at 6:04 PM Antonio Ojea <antonio.o...@gmail.com> wrote:
Hi,

Following up on this topic as the existing behavior has broken some deployments https://github.com/kubernetes/kubernetes/issues/125348 (I apologize for this :( )

It seems there is consensus on using https://github.com/kubernetes/kubernetes/pull/125337 to fix it.

In addition, and due to how brittle is this area of the codebase, I've created the following document to try to consolidate and document the existing problems https://docs.google.com/document/d/1mqdVLQHIYjrzjy8Hq-FMysthHRL7ht0guk2uWI0FxKM/edit .

I encourage all the SIG leads and people working on downstream Kubernetes deployments that depend on external cloud providers to chime in.

excellent summary in the doc Antonio, thank you for putting this together.

i like the notion of extending the "0.0.0.0" or "::" discovery behavior to work in a similar manner with external cloud providers. although, i guess this would also put pressure on the CCM to choose the proper interface through configuration (which i would like) or some other mechanism. but, given the discussion on https://github.com/kubernetes/kubernetes/pull/125300 , it makes me think there will need to be some changes in the cloud-provider framework, and potentially the CCMs, to allow the discovery behavior. is that accurate? have i missed some details?

I don't fully understand this question, the unspecified address indicates to the kubelet "use the IP address associated with the interface used by the default gateway", and it adds it to the node.status.addresses when it creates the node object. Once the node object is created, it is processed by the external cloud provider and is up to the cloud provider implementation to decide what node addresses it reports in status.
 

Reply all
Reply to author
Forward
0 new messages