exception request: dynamic resource allocation

20 views
Skip to first unread message

Patrick Ohly

unread,
Nov 9, 2022, 7:26:07 AM11/9/22
to releas...@kubernetes.io, kubernetes-...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-s...@googlegroups.com, Kevin Klues, Aldo Culquicondor, Tim Hockin
Hello!

I would like to ask for an exception that allows
https://github.com/kubernetes/kubernetes/pull/111023 to be merged into
1.26 after the code freeze.

Enhancement name: dynamic resource allocation
Enhancement status: alpha
SIG: Node, with Scheduling as participating SIG
k/enhancements repo issue #: #3063
PR #’s: #111023
Additional time needed (in days): 3 (= till Friday this week)

This is needed to give various code owners time to add their final
approval. Key stakeholders (Aldo for scheduling, Tim for architecture
and API) were basically ready for that already yesterday before the
code freeze [1, 2], but some smaller recent changes still needed to be
checked again and other reviewers need more time. All of these small
changes have now been addressed, and we don't anticipate any major
churn from the current state of the code. One package might get moved
into a staging repository (optional).

Reason this enhancement is critical for this milestone:

There is a lot of interest and momentum behind this feature right
now. Hardware vendors are ready to showcase it to customers, but
that will be harder when using the feature depends on building
a fork of Kubernetes from source. Not merging it now risks losing
this momentum.

Merging the feature also won't be easier for 1.27. It's now
fresh in the minds of reviewers, delaying until then would
imply that they need to make themselves familiar with it anew.

Risks from adding code late:

All the new code is behind a feature gate, so the risk for stability
of other features when not enabled is low.

The core API gets changed. The new fields are also feature gated, so
without the feature gate they won't be visible and not impact clients.
This in particular has been heavily scrutinized and changes were
made to support future extensions (for example, using a struct
instead of a plain string in one place), so the risk of making some
undesirable change is low to medium.

All code has unit tests that are passing. Test coverage (despite
being alpha) is getting close to that of comparable code that is GA.
For example, the GA volumebinding scheduler plugin has 82%
statement coverage while the new plugin has 70%. No test flakes
have been observed in these new unit tests.

E2E tests also exist and are passing in a new, optional Prow
pre-merge job that runs for PRs touching the code. They will
not run in other existing jobs.

Overall, the risk for testing stability is low.

Risks from cutting enhancement:

The biggest risk is that if don't merge, we will lose momentum
and then also won't get it merged in the future. This affects various
customer use cases where the currently possible workarounds are
not fully solving the problems that customers are having.

For example, Kevin Klues said in [3]:
"the ability to share resources is one of the main features that
NVIDIA is excited about finally being able to support with DRA. We
already have an out-of-tree DRA driver running with this
functionality and have been communicating to customers that this is
the (long-awaited) path towards finally doing GPU-sharing "the right
way" in Kubernetes."

[1] https://github.com/kubernetes/kubernetes/pull/111023#issuecomment-1307790335
[2] https://github.com/kubernetes/kubernetes/pull/111023#pullrequestreview-1172925263
[3] https://github.com/kubernetes/kubernetes/pull/111023#discussion_r1017545708

--
Best Regards

Patrick Ohly
Cloud Software Architect
Reply all
Reply to author
Forward
0 new messages