I want to clarify my objections to this KEP since I'm not sure it was
coming across clearly in the meeting. (TL;DR: a "NodeIPAM
reconfiguration" feature is not a "dynamic pod network scaling" feature,
and pretending otherwise would eventually bite us.)
The issue is that the KEP provides a "NodeIPAM reconfiguration" feature,
but no user is actually saying "I wish I had a NodeIPAM reconfiguration
feature". What users want is a "dynamic pod network scaling" feature.
And in *some* environments (notably GKE), it happens that reconfiguring
the NodeIPAM controller is the only thing that you need to be able to if
you want to scale the pod network, so giving users a NodeIPAM
reconfiguration feature scratches their pod network scaling itch. But in
other (many? most?) clusters, reconfiguring NodeIPAM is just one part of
the "scaling the pod network" problem (or in some cases, isn't part of
the problem at all because the cluster doesn't use kcm NodeIPAM). But
the KEP completely ignored those kinds of clusters, and as a result, the
API it proposed wasn't necessarily even the right first step on the path
to a more generic "dynamic pod network scaling" API, and would, at best,
lead to us having redundant/incompatible configuration APIs in the future.
As an example of how it fails, if you are using flannel, then in
addition to having the NodeIPAM controller know what pod CIDRs to use,
flannel itself needs to also know what the pod CIDRs are, because it
creates iptables rules that need to distinguish pod IPs from non-pod
IPs. As a result, if you were to change the NodeIPAM configuration to
make it start using an additional CIDR for pod IPs, the network would
start to break in various ways as pods began getting IPs that flannel
would miscategorize as non-pod IPs.
This is actually pretty common with many network implementations (maybe
even GKE-with-Cilium?), but the reason I mention flannel in particular
is because someone submitted a PR to make flannel also watch the
ClusterCIDR objects defined by KEP-2593
(
https://github.com/flannel-io/flannel/pull/1658), so that then clusters
with flannel can "support" the reconfiguration feature. But the problem
with this is that KEP-2593 doesn't consider the possibility of other
components using its objects at all, and even ignoring the "procedural"
aspects of that (e.g., the KEP PRR makes no attempt to work out the ways
that multiple components using the feature might interact), it ends up
that the patch wouldn't work right anyway, because if you delete a
ClusterCIDR, flannel would immediately "forget" about that CIDR and
delete the rules related to it, even though there might still be
existing pods using IPs from that CIDR. (The semantics of deleting a
ClusterCIDR according to KEP-2593 are just that NodeIPAM should stop
assigning *new* subnets out of it.)
And while it would be possible to tweak the KEP to fix that particular
problem (eg, having some "status" field somewhere that indicates
inactive-but-maybe-still-present CIDRs), that doesn't fix the larger
issue, which is that the reason this problem was there is because the
KEP had explicitly disavowed trying to solve any part of the "dynamic
pod network scaling" problem other than the NodeIPAM part.
While Antonio and I were discussing this, I wrote up the start of what I
imagine a "dynamic pod network scaling" KEP might look like,
https://github.com/danwinship/enhancements/tree/pod-network-scaling/keps/sig-network/2593bis-pod-network-scaling.
The basic idea is that the pod network configuration is owned by the
"pod network implementation" (ie, flannel, Calico, ovn-kubernetes, etc;
the thing people call "the CNI", but shouldn't), and that includes
things like "whether kcm NodeIPAM is being used" (and "whether
kube-proxy is being used"). So admins shouldn't be reconfiguring
NodeIPAM directly; there should be an API they can use to tell the pod
network implementation what they want to do, and then the pod network
implementation can reconfigure itself and its internal subcomponents
(NodeIPAM, kube-proxy, etc) as it sees fit.
But that proto-KEP is only half-written and full of UNRESOLVED sections
and "well, actually, what I just said won't really work", etc. It would
need a lot of work.
(But moving NodeIPAM out of kube-controller-manager probably helps with
that, because that suggests a model where the decision of whether (and
how) to deploy the NodeIPAM controller is more explicitly tied to the
configuration of the network implementation, in the same way that the
decision about whether to deploy kube-proxy is now.)
-- Dan
PS - My proto-KEP talks about an object called "PodNetwork" and points
out that the Multi-Network KEP is also using an object by that name...
And I thought it was interesting in the slides Maciej presented on
Thursday, that there's one showing a PodNetwork having multiple
different kinds of configuration parameters
(
https://docs.google.com/presentation/d/1E6rfXD8dSGTpMyTjCKmOZPxyLf_qn_gI-_qs-RRAdbM/edit#slide=id.g2478fafae35_0_328).
Perhaps a "default" PodNetwork object containing one parameterRefs
pointing to a "FlannelConfig" / "CalicoConfig" / "OVNKubernetesConfig",
etc and a second parameterRefs pointing to a "networkingv1.ClusterCIDR"
/ "networkingv1.NodeIPAMConfig", etc?
On 9/28/23 13:52, Antonio Ojea wrote:
> Hi all,
>
> Per discussion on today's sig-network meeting about the KEP-2593
>
https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2593-multiple-cluster-cidrs <
https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2593-multiple-cluster-cidrs>, we have decided that we are not going to move forward with the existing plan, so we are not going to implement it intree, extending the kube-controller-manager node-ipam controller and add new ClusterCIDRs object under the
networking.k8s.io <
http://networking.k8s.io> group.
>
> The new plan is to create a new node-ipam controller with its own CRDs
> in a separate repository under the sig-network umbrella, so it still can
> be used independently or consumed by other projects.
>
> As a follow up I will update the KEP to reflect the current status and
> request a new repository for this new project.
>
> This will also be a great opportunity for growing new contributors in
> sig-network, because it is a relatively small sized and well defined
> project, so if you are interested in collaborating please let me know.
>
> Regards,
> Antonio Ojea
>
> --
> You received this message because you are subscribed to the Google
> Groups "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
kubernetes-sig-ne...@googlegroups.com
> <mailto:
kubernetes-sig-ne...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/kubernetes-sig-network/CABhP%3DtbtH37ygSf5-gTrB56VAHtnCDvFo1LmWjHF8UhqLd1L%3DQ%40mail.gmail.com <
https://groups.google.com/d/msgid/kubernetes-sig-network/CABhP%3DtbtH37ygSf5-gTrB56VAHtnCDvFo1LmWjHF8UhqLd1L%3DQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.