KEP-2593: Enhanced NodeIPAM to support Discontiguous Cluster CIDR

1,076 views
Skip to first unread message

Antonio Ojea

unread,
Sep 28, 2023, 1:53:00 PM9/28/23
to kubernetes-sig-network
Hi all,

Per discussion on today's sig-network meeting about the KEP-2593 https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2593-multiple-cluster-cidrs, we have decided that we are not going to move forward with the existing plan, so we are not going to implement it intree, extending the kube-controller-manager node-ipam controller and add new ClusterCIDRs object under the networking.k8s.io group.

The new plan is to create a new node-ipam controller with its own CRDs in a separate repository under the sig-network umbrella, so it still can be used independently or consumed by other projects.

As a follow up I will update the KEP to reflect the current status and request a new repository for this new project.

This will also be a great opportunity for growing new contributors in sig-network, because it is a relatively small sized and well defined project, so if you are interested in collaborating please let me know.

Regards,
Antonio Ojea

Dan Winship

unread,
Oct 1, 2023, 10:55:47 AM10/1/23
to Antonio Ojea, kubernetes-sig-network
I want to clarify my objections to this KEP since I'm not sure it was
coming across clearly in the meeting. (TL;DR: a "NodeIPAM
reconfiguration" feature is not a "dynamic pod network scaling" feature,
and pretending otherwise would eventually bite us.)

The issue is that the KEP provides a "NodeIPAM reconfiguration" feature,
but no user is actually saying "I wish I had a NodeIPAM reconfiguration
feature". What users want is a "dynamic pod network scaling" feature.
And in *some* environments (notably GKE), it happens that reconfiguring
the NodeIPAM controller is the only thing that you need to be able to if
you want to scale the pod network, so giving users a NodeIPAM
reconfiguration feature scratches their pod network scaling itch. But in
other (many? most?) clusters, reconfiguring NodeIPAM is just one part of
the "scaling the pod network" problem (or in some cases, isn't part of
the problem at all because the cluster doesn't use kcm NodeIPAM). But
the KEP completely ignored those kinds of clusters, and as a result, the
API it proposed wasn't necessarily even the right first step on the path
to a more generic "dynamic pod network scaling" API, and would, at best,
lead to us having redundant/incompatible configuration APIs in the future.

As an example of how it fails, if you are using flannel, then in
addition to having the NodeIPAM controller know what pod CIDRs to use,
flannel itself needs to also know what the pod CIDRs are, because it
creates iptables rules that need to distinguish pod IPs from non-pod
IPs. As a result, if you were to change the NodeIPAM configuration to
make it start using an additional CIDR for pod IPs, the network would
start to break in various ways as pods began getting IPs that flannel
would miscategorize as non-pod IPs.

This is actually pretty common with many network implementations (maybe
even GKE-with-Cilium?), but the reason I mention flannel in particular
is because someone submitted a PR to make flannel also watch the
ClusterCIDR objects defined by KEP-2593
(https://github.com/flannel-io/flannel/pull/1658), so that then clusters
with flannel can "support" the reconfiguration feature. But the problem
with this is that KEP-2593 doesn't consider the possibility of other
components using its objects at all, and even ignoring the "procedural"
aspects of that (e.g., the KEP PRR makes no attempt to work out the ways
that multiple components using the feature might interact), it ends up
that the patch wouldn't work right anyway, because if you delete a
ClusterCIDR, flannel would immediately "forget" about that CIDR and
delete the rules related to it, even though there might still be
existing pods using IPs from that CIDR. (The semantics of deleting a
ClusterCIDR according to KEP-2593 are just that NodeIPAM should stop
assigning *new* subnets out of it.)

And while it would be possible to tweak the KEP to fix that particular
problem (eg, having some "status" field somewhere that indicates
inactive-but-maybe-still-present CIDRs), that doesn't fix the larger
issue, which is that the reason this problem was there is because the
KEP had explicitly disavowed trying to solve any part of the "dynamic
pod network scaling" problem other than the NodeIPAM part.


While Antonio and I were discussing this, I wrote up the start of what I
imagine a "dynamic pod network scaling" KEP might look like,
https://github.com/danwinship/enhancements/tree/pod-network-scaling/keps/sig-network/2593bis-pod-network-scaling.

The basic idea is that the pod network configuration is owned by the
"pod network implementation" (ie, flannel, Calico, ovn-kubernetes, etc;
the thing people call "the CNI", but shouldn't), and that includes
things like "whether kcm NodeIPAM is being used" (and "whether
kube-proxy is being used"). So admins shouldn't be reconfiguring
NodeIPAM directly; there should be an API they can use to tell the pod
network implementation what they want to do, and then the pod network
implementation can reconfigure itself and its internal subcomponents
(NodeIPAM, kube-proxy, etc) as it sees fit.

But that proto-KEP is only half-written and full of UNRESOLVED sections
and "well, actually, what I just said won't really work", etc. It would
need a lot of work.

(But moving NodeIPAM out of kube-controller-manager probably helps with
that, because that suggests a model where the decision of whether (and
how) to deploy the NodeIPAM controller is more explicitly tied to the
configuration of the network implementation, in the same way that the
decision about whether to deploy kube-proxy is now.)

-- Dan

PS - My proto-KEP talks about an object called "PodNetwork" and points
out that the Multi-Network KEP is also using an object by that name...
And I thought it was interesting in the slides Maciej presented on
Thursday, that there's one showing a PodNetwork having multiple
different kinds of configuration parameters
(https://docs.google.com/presentation/d/1E6rfXD8dSGTpMyTjCKmOZPxyLf_qn_gI-_qs-RRAdbM/edit#slide=id.g2478fafae35_0_328).
Perhaps a "default" PodNetwork object containing one parameterRefs
pointing to a "FlannelConfig" / "CalicoConfig" / "OVNKubernetesConfig",
etc and a second parameterRefs pointing to a "networkingv1.ClusterCIDR"
/ "networkingv1.NodeIPAMConfig", etc?

On 9/28/23 13:52, Antonio Ojea wrote:
> Hi all,
>
> Per discussion on today's sig-network meeting about the KEP-2593
> https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2593-multiple-cluster-cidrs <https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2593-multiple-cluster-cidrs>, we have decided that we are not going to move forward with the existing plan, so we are not going to implement it intree, extending the kube-controller-manager node-ipam controller and add new ClusterCIDRs object under the networking.k8s.io <http://networking.k8s.io> group.
>
> The new plan is to create a new node-ipam controller with its own CRDs
> in a separate repository under the sig-network umbrella, so it still can
> be used independently or consumed by other projects.
>
> As a follow up I will update the KEP to reflect the current status and
> request a new repository for this new project.
>
> This will also be a great opportunity for growing new contributors in
> sig-network, because it is a relatively small sized and well defined
> project, so if you are interested in collaborating please let me know.
>
> Regards,
> Antonio Ojea
>
> --
> You received this message because you are subscribed to the Google
> Groups "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to kubernetes-sig-ne...@googlegroups.com
> <mailto:kubernetes-sig-ne...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kubernetes-sig-network/CABhP%3DtbtH37ygSf5-gTrB56VAHtnCDvFo1LmWjHF8UhqLd1L%3DQ%40mail.gmail.com <https://groups.google.com/d/msgid/kubernetes-sig-network/CABhP%3DtbtH37ygSf5-gTrB56VAHtnCDvFo1LmWjHF8UhqLd1L%3DQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Antonio Ojea

unread,
Oct 1, 2023, 4:19:40 PM10/1/23
to Dan Winship, kubernetes-sig-network
one clarification, the source of truth for Pods networks are the assigned CIDRs on the node.spec.PodCIDRs, not the ClusterCIDRs for all the reasons that you explain.
The implementations that want to know about the Pod CIDRs assigned must watch the Nodes, those fields are immutable and as we discussed extensively , only works for network plugins/implementations that use the node.spec.PodCIDR , something  that we can not generalize and, personally, that is the fundamental problem and the main reason why this should not be a core/intree (

Tim Hockin

unread,
Oct 2, 2023, 2:10:03 PM10/2/23
to Dan Winship, Antonio Ojea, kubernetes-sig-network
Just small points, I think we're violently agreeing :)

On Sun, Oct 1, 2023 at 7:55 AM Dan Winship <dwin...@redhat.com> wrote:

> The issue is that the KEP provides a "NodeIPAM reconfiguration" feature,
> but no user is actually saying "I wish I had a NodeIPAM reconfiguration
> feature". What users want is a "dynamic pod network scaling" feature.

I'm not sure it's "no user". Where we violently agree is that the
KEP, as proposed, was a partial solution in most cases. Growing your
cluster requires:
1) More IP space is made available; AND
2) Someone is empowered to consume that IP space

The KEP solves (2) but not (1). Solving (1) is not something we can
generalize without knowing a whole lot more about the underlying
network. That does not mean that solving (2) is useless, but it is
insufficient.

> And in *some* environments (notably GKE), it happens that reconfiguring
> the NodeIPAM controller is the only thing that you need to be able to if
> you want to scale the pod network, so giving users a NodeIPAM
> reconfiguration feature scratches their pod network scaling itch.

I don't think you are finger-pointing, but I don't want this to be
misinterpreted. Any environment COULD use this KEP, but most would do
so in concert with a solution to (1) above. GKE, for example, can
only use it if the VPC network has IP space, meaning (1) is not
strictly a no-op, it just happens that many cases have already done
that step.

> other (many? most?) clusters, reconfiguring NodeIPAM is just one part of
> the "scaling the pod network" problem (or in some cases, isn't part of
> the problem at all because the cluster doesn't use kcm NodeIPAM).

Agree. While I don't really have "a problem" with k8s core offering
APIs that are not used in every environment, it is sort of a
pattern-smell. The big mistake here (IMO) is that this API *appears*
to be authoritative, when it is not. That's a trap we should not set.

> As an example of how it fails, if you are using flannel, then in
> addition to having the NodeIPAM controller know what pod CIDRs to use,
> flannel itself needs to also know what the pod CIDRs are, because it
> creates iptables rules that need to distinguish pod IPs from non-pod
> IPs. As a result, if you were to change the NodeIPAM configuration to
> make it start using an additional CIDR for pod IPs, the network would
> start to break in various ways as pods began getting IPs that flannel
> would miscategorize as non-pod IPs.

Great example, thanks. What we don't have is a single place where
things like Flannel and IPAM systems (and others) can learn the
complete list of CIDRs which cover the whole cluster. That doesn't
exist because a) doing that authoritatively (IOW adding a CIDR here
makes it happen) is a pretty deep integration point with a lot of
complexity); and b) doing that reflectively (non-authoritative, just
description) is likely to be incomplete in the face of a broad
ecosystem.

Should we produce such a "canonical-but-not-authoritative" API? Maybe.

Should we do it in the core of k8s? No (IMO).

Is this KEP destined to be that API? I don't know, but it's not
sufficient as-is (see Dan's analysis up-thread).

This is interesting work, which springs from real problems experienced
by real users. Doing it out-of-core should make it easier to explore.

Tim

Dan Winship

unread,
Oct 2, 2023, 4:04:03 PM10/2/23
to Tim Hockin, Antonio Ojea, kubernetes-sig-network
On 10/2/23 14:09, 'Tim Hockin' via kubernetes-sig-network wrote:
> Just small points, I think we're violently agreeing :)
>
> On Sun, Oct 1, 2023 at 7:55 AM Dan Winship <dwin...@redhat.com> wrote:
>
>> The issue is that the KEP provides a "NodeIPAM reconfiguration" feature,
>> but no user is actually saying "I wish I had a NodeIPAM reconfiguration
>> feature". What users want is a "dynamic pod network scaling" feature.
>
> I'm not sure it's "no user". Where we violently agree is that the
> KEP, as proposed, was a partial solution in most cases. Growing your
> cluster requires:
> 1) More IP space is made available; AND
> 2) Someone is empowered to consume that IP space

and

3) Additional components in the cluster that aren't the ones *consuming*
the IP space, but which still need to know what IP space is being used,
have some way of knowing it.

> The KEP solves (2) but not (1).

And not (3).

>> And in *some* environments (notably GKE), it happens that reconfiguring
>> the NodeIPAM controller is the only thing that you need to be able to if
>> you want to scale the pod network, so giving users a NodeIPAM
>> reconfiguration feature scratches their pod network scaling itch.
>
> I don't think you are finger-pointing, but I don't want this to be
> misinterpreted. Any environment COULD use this KEP, but most would do
> so in concert with a solution to (1) above. GKE, for example, can
> only use it if the VPC network has IP space, meaning (1) is not
> strictly a no-op, it just happens that many cases have already done
> that step.

Actually, no, that's not at all the point I was trying to make. :-) (I
wasn't considering the situation "outside" of Kubernetes at all.)

What I was trying to say is that, in GKE, if you examine all of the
different components *inside the Kubernetes cluster* that are involved
in pod networking (kube-proxy, NodeIPAM, the CNI plugin, the
NetworkPolicy controller, etc), then (apparently) you will find that the
NodeIPAM controller is the only one of those components that needs to
know the full set of CIDRs that are available to be used for pod IPs;
kube-proxy doesn't care, the CNI plugin doesn't care, the NetworkPolicy
implementation doesn't care, etc. And so when KEP-2593 lets you
reconfigure the NodeIPAM controller, it is effectively letting you
reconfigure the pod network.

But this is not the case in most(?) other environments. E.g., as in the
example I gave, in a simple cluster using flannel (whether on GCE or
Azure or bare metal or wherever), there are *two* components that need
to know the full set of CIDRs that are available to be used for pod IPs
(NodeIPAM and flannel itself). I am pretty sure Calico and Cilium behave
similarly (and both are also able to use their own internal node IPAM
implementations instead of kcm's). In an OpenShift cluster, there are at
least four components that want to know the full set of cluster CIDRs in
use, and none of them is kcm. And KEP-2593 completely ignores these
possibilities. In these clusters, when you use KEP-2593 to reconfigure
the NodeIPAM controller, it either breaks pod networking, or it has no
effect.

So, yeah, I really was trying to say "the KEP solves the problem for GKE
but probably not for most other people". But not in a finger-pointy sort
of way, just in a "hey, you forgot about this other part of the
problem", this-is-why-we-have-KEPs sort of way. (And I'm sorry I didn't
catch this earlier... when I read the KEP initially, I pointed out that
it wasn't useful to people who didn't use NodeIPAM, but I had thought it
at least worked for everyone who *did* use NodeIPAM.)

(This was also why I suggested before
[https://groups.google.com/g/kubernetes-sig-network/c/Ga9SWGs00k4] that
we should create a wiki where everyone can describe how different k8s
networking implementations work, so that we can hopefully all get a
better sense of "things that would work in my network that wouldn't work
in other people's networks".)

-- Dan

Nick Young

unread,
Oct 3, 2023, 2:13:30 AM10/3/23
to Dan Winship, Tim Hockin, Antonio Ojea, kubernetes-sig-network
It sounds to me like exploring this out-of-tree is a great idea, but for anyone who _does_ start working on this out-of-tree, _please_ try and talk to Rob, Shane, and/or me about what we've learned about out-of-tree development on Gateway API before you go too far. I guarantee you there are problems you won't think of until they bite you.

Coincidentally, Rob and I are giving a talk about this at the Contrib Summit in Chicago as well, plug plug. (https://kcsna2023.sched.com/event/1Sp9u/lessons-learned-building-a-ga-api-with-crds )

But seriously, there are a heap of easy-to-avoid-once-you-know things that will really speed things up for whoever is implementing this.

Nick

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-ne...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-network/b6a3a464-d314-0574-a49a-5632340e0b16%40redhat.com.

antonio.o...@gmail.com

unread,
Nov 15, 2023, 4:51:13 AM11/15/23
to kubernetes-sig-network
> This will also be a great opportunity for growing new contributors in sig-network, because it is a relatively small sized and well defined project, so if you are interested in collaborating please let me know.

I'm positively surprised on the number of requests that I've received to try to help and collaborate on this effort.

> The new plan is to create a new node-ipam controller with its own CRDs in a separate repository under the sig-network umbrella, so it still can be used independently or consumed by other projects.

Max Neverov reached out to me one month ago and he already started to work on this on his personal repo https://github.com/mneverov/cluster-cidr-controller/

I think that the best way to proceed is to have a working prototype that we can demo in one of the sig-network meetings and then proceed to request an official subproject under sig-network.

Since it seems that there are multiple persons interested in collaborate on this, interested people can sync in sig-network slack to decide the best way to collaborate, or just work async, via Github issues or on the WIP PR https://github.com/mneverov/cluster-cidr-controller/pull/7 

In case that you feel I can help I use to have some time slots free on fridays https://calendar.app.google/bQzXfGVHg4CAvWgW9 

Antonio Ojea

unread,
Dec 6, 2023, 5:41:39 PM12/6/23
to kubernetes-sig-network
Hi all,

I think the project for using ClusterCIDR as a CRD has reached the minimum quality necessary to be part of SIG Network
There is also a committed group of people behind it to make it sustainable.

I've added this topic to tomorrow's agenda so Max can present a demo, we can discuss any outstanding issues, clarify doubts and, if everything is ok, request a repository in kubernetes-sigs/ and discuss the API-review process to be able to use the .x-k8s.io namespace, though this last one should be not much of an issue as the existing CRD is the same as in the KEP.

Regards,
A.Ojea

Reply all
Reply to author
Forward
Message has been deleted
0 new messages