Re: [k8s-sig-net] Re: Considering the retirement of the KPNG project

36 views
Skip to first unread message

Mikaël Cluseau

unread,
Apr 13, 2026, 5:48:05 PMApr 13
to sig-n...@kubernetes.io
Hi all,

I wanted to come back to this thread, but honestly found it hard to do so without some emotional interference — this project meant a lot to me, even if it's far back now. So I used an AI assistant to help me put this into words more clearly than I probably would have on my own (or even could).

The design intent of kpng was never primarily about performance. The diff+apply engine emerged from a specific architectural concern: kube-proxy, in all its backends, conflates two very different responsibilities — reasoning about what a service means on a given node (topology, endpoint state, policy), and actually programming the local dataplane. Every backend reimplements that reasoning independently, which means semantics drift, and every new backend author has to get it right from scratch.

The goal was for Kubernetes to own that first layer — the translation from cluster-level configuration to a resolved, node-local view — and hand backends something simple and already correct to program against. Not a performance optimization, but a correctness and ownership boundary.

On the process critiques: I think Dan's and Antonio's points are fair. The KEP wasn't merged, and without that formal consensus the WG was building something the SIG hadn't fully committed to. That's a real failure mode and I don't dispute it.

On the test flakiness specifically: the flakes Antonio flagged were specific to GitHub Actions. The same tests ran cleanly on dedicated hardware. What I meant by "we needed testgrid's conditions" was exactly that — we needed access to stable CI infrastructure to produce results the SIG could trust in either direction. That was a resource access problem, not a test quality problem. In hindsight I should have been clearer about that distinction at the time.

The diversity of backends that emerged — nftables, eBPF, ipvs, userspace, Windows — wasn't a sign of chaos to me, it was a sign that the abstraction worked. And the underlying question remains open: how do SDN vendors today ensure feature parity and semantic correctness with upstream? I don't think it's really answered.

Shane's retrospective idea sounds right. I just wanted to make sure the original design intent is part of that conversation, not just the execution.

As a footnote: the ideas didn't stop with kpng. I've been quietly working on a successor, knls (Kubernetes Node-Local Services — https://github.com/mcluseau/knls), as a personal out-of-band project. One more species in the cambrian explosion. It takes the same core idea — a single process watching the apiserver and resolving cluster state down to a clean node-local view — but rewritten in Rust, with no gRPC intermediate representation, and expanded in scope to cover not just the proxy (nftables) but also authoritative DNS and pod connectivity via WireGuard. In practice it runs between 4.7 and 25 MiB of anonymous memory. It's not a proposal for upstream, just a hint that the design space is enjoyable to explore — and that it can be done with a nice small footprint.

That's my story here, how it ends, how it continues.

Thanks to everyone who was part of it.


Le lun. 15 juil. 2024 à 06:39, jay vyas <jayunit1...@gmail.com> a écrit :
Hi folks ! 

So i as read through this whole thread.  A few things that come to mind.... Somewhat unrelated - 

- The history of the KPNG project is documented in the notes.... here https://docs.google.com/document/d/1yW3AUp5rYDLYCAtZc6e4zeLbP5HPLXdvuEFeVESOTic/edit....   Theres alot of interesting history there and its interesting to read and look back on.   Like the early debates on diffstore, grpc, and so on.  Or the time mikael figured out that we needed t proto.MarshalOptions{Deterministic: true} to get the hashing of the endpoints fixed ...  If anyone wants to know what its like to try to make a battle hardened kube proxy , thats a good starting point :) 

Me and Amim shot nerf darts at ricardo during  KPNGCON https://youtu.be/GT_p2mkbn2E?t=252 .  That slowed the project down by at least a few months while ricardo recovered (the wounds were emotional, as well as physical) 

- I think Shane's recent post on K8s certification - if that was around at the time we were doing KPNG, there could have been a good alternative reality where we could have ended up as a "Certifiable" Kube proxy - that passed all 250 or so k8s sig-net tests + Conformance.    Then the idea of in tree or out of tree would be irrelevant.  Of course, the effort of making such a certiffication might not be worth it given that ... well... how many people are actually trying to rewrite the kube-proxy ?  Im assuming its handfulls, but its not like theres going to be 100s of companies in that business.  Most folks are happy with the stock in tree proxy. 

- The idea of making it "in-tree" to me was always confusing. I think the goal was always (in my mind) to have (like dan said) kinda our  parallel mirror universe of sig-network and see how far we could go. The reality is - not having a large corporate sponsor made it virtually impossible for that universe to continue to co-exist forever.  People came and went.  Nobody was paid to work on KPNG.  It was destined ultimately to lack in the consistency and polish that other initiatives would have in this area. 

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-ne...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-network/CACVCA%3Dc8QTqSD8rny0cAVcY%3DV3TXm9JL-OmbUAVv_DmuyyQnmg%40mail.gmail.com.

Antonio Ojea

unread,
Apr 14, 2026, 5:37:46 AMApr 14
to sig-network, mikael....@gmail.com

Hi Mikael,

Thanks for coming back to this and sharing your perspective. I really appreciate the transparency about the emotional side of it, KPNG was a huge effort, and it's clear a lot of heart went into it. As a SIG lead, I want also to offer a perspective from the maintenance side to help put things into context.

I’m a big fan of hacking and new ideas; I always encourage people to start new projects. However, my primary commitment is the stability of the ecosystem. For those at the last SIG Network meeting, the fact that we had zero open bugs was a proof to the incredibly high standards we maintain for our in-tree codebase.

The main friction with KPNG started when the goal shifted toward moving the code in-tree. There is a common misconception that "in-tree" solves adoption or maintenance. In reality, it doesn't always attract more contributors; often, it just shifts the heavy lifting—CVEs, kernel regressions, dependency upgrades—onto the core maintainers who remain for the long haul. As Solomon Hykes famously said: 'Rule #1 of open-source: no is temporary, yes is forever.' We have to be very careful about what we say 'yes' to.

Regarding the flakiness: as someone who has developed a 'weird skill' for hunting regressions by fixing flakes, I've learned these small signals are almost always the tip of a very deep iceberg. I saw those same jobs running cleanly in other projects (kindnet, Cilium, etc.), which is why I pushed for that same level of evidence for KPNG. I just caught a perfect example this week: two regressions mdlayher/netlink#283 and mdlayher/netlink#280 library, found only because of GitHub Actions errors in the kube-network-policies repo.

To give you an idea of the "deep regressions" I'm talking about that often start as "just a flake":

  • IPv6 UDP regressions: Where packets larger than MTU returned EMSGSIZE instead of fragmenting (Issue #133361) (Kernel regression).

  • Netlink library breaks: Impacting Cilium, Calico, and OVN Github actions (Netlink PR #925) (golang netlink library regression).

  • Race conditions in net.InterfaceAddrs: Causing NodePort Services to become inaccessible (Issue #129146) (golang standard library bug).

I’m super happy to hear the ideas are continuing with knls. This out-of-tree project, allows you to move fast and experiment without the weight of millions of production clusters on your shoulders, and we can always have the conversation again later about the benefits for the project.

Let’s keep the conversation positive. To me the main goal of the retrospective is to learn how to better support innovation without compromising the core.

Best,

Antonio

Reply all
Reply to author
Forward
0 new messages