Hello,
I wanted to share some interesting discussions that happened around a bug we faced in Kueue.
The bug:
- Kueue implements webhooks for resources that don't belong to the project, namely k8s (Jobs, Pods), kubeflow, kuberay. And it offers libraries to integrate arbitrary CRDs with Kueue, which can be deployed in a separate binary.
- Kueue uses controller-runtime to build webhooks. controller-runtime's CustomDefaulter builds json patches from the diff of the raw object in the request and a marshaled object modified by the defaulter code.
- Result: when a new version of the external resource definition adds new fields, Kueue's webhook patches drop fields, because it doesn't have those fields in its go types.
Searching for a solution:
- RoundTripDefaulter: The "obvious" solution was to, instead of building the json diff from the raw object in the request, do a roundtrip (unmarshal+marshal) of the raw object. This means that both objects (original and modified) no longer have the unknown fields, so there wouldn't be a jsondiff for them.
 However, this turned out to be not bug-free, as pointed out by @sbueringer. Go types could add or remove omitempty from one version to the other, which means that a roundtrip could potentially add fields that aren't in the raw object, leading to invalid patches, containing replace instead of add operations.
 
- Requiring webhooks written for Kueue to directly build jsondiffs: This just shifts the burden to developers that might not be experienced with all these nuances and problems can be easily missed in code reviews.
- LosslessDefaulter: I proposed a simpler solution that exploits one fact about Kueue webhooks: they don't remove any fields. So LosslessDefaulter takes the output of controller-runtime's CustomDefaulter and drops any remove operations.
 This will probably work for us for a long time, but we probably should be migrating to something like RoundTripDefaulter once we have ensured that the fields that we modify in Kueue aren't lacking omitempty.
So, this email is mostly a heads up for API reviewers. And if SIG API Machinery can point to us any details that we missed or that could help simplify the problem, please do let us know.
And, for anyone looking to contribute, maybe this and other things in the API Conventions could eventually be written to a linter?