DRA "layering" in kube 1.31

1,034 views
Skip to first unread message

Tim Hockin

unread,
May 21, 2024, 12:51:34 PMMay 21
to wg-device-management
Ho all,

As the clock keeps ticking, I have spent a lot of time thinking about 1.31 and what is plausible to deliver.  I feel like we have 2 goals here that we are mixing together.  First is the "transport layer" for how devices are described, requested, and assigned.  It needs to be pretty general-purpose and to cover a lot of use-cases.  Second is a "user-experience layer" which mitigates some of the complexity from the generality of the transport layer.  My own mind keeps flipping between the two, and I feel like I am running in circles.

What if we thought of them more as distinct problems?  In some ways, it could be moving closer to the 1.30 model.  More directly, the discussions so far have given me high confidence we can do a (somewhat complex but directionally correct) transport layer which lands in 1.31 without much focus on UX, while I have diminishing confidence that we can do a good UX AND a transport layer in 1.31, without ANOTHER major shift in 1.32.

So, what I think I am proposing is that we focus on the expressivity of the capacity and claim aspects - make sure that the things we really need to allow are allowed and that scheduling and auto scaling can be made to work.  I'm not saying to ignore UX, but maybe it's OK to assume that it will come on top of the transport, perhaps in the form of more domain-centric CRDs, which I know I previously argued against, but maybe I am coming around to.  I think we need more time to really pin that down.

This would mean that 1.31 might be harder to use for some cases, but with more confidence that we won't radically reboot it again.

Thoughts?

Tim

Byonggon Chun

unread,
May 21, 2024, 2:12:26 PMMay 21
to Tim Hockin, wg-device-management

Hi this is bg.

I believe that using opaque parameters for end-users is a way to mitigate complexity from a UX perspective. 

By converting opaque parameters to structured parameters internally within the DRA driver, we can achieve cluster autoscaler integration. Although implementing CEL expressions and the conversion process might be a burden for DRA driver developers.

If using opaque parameters by end-user does not sufficiently reduce complexity, then, as you mentioned, using domain centric CRDs for abstraction on top of the transport layer will mitigate the complexcity.

Thank you.

--
You received this message because you are subscribed to the Google Groups "wg-device-management" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wg-device-manage...@kubernetes.io.
To view this discussion on the web visit https://groups.google.com/a/kubernetes.io/d/msgid/wg-device-management/40cc3a0c-a6f0-4b60-a744-44822888bef1n%40kubernetes.io.

Patrick Ohly

unread,
May 21, 2024, 3:16:49 PMMay 21
to Tim Hockin, wg-device-management
"'Tim Hockin' via wg-device-management"
<wg-device-...@kubernetes.io> writes:
> As the clock keeps ticking, I have spent a lot of time thinking about 1.31
> and what is plausible to deliver. I feel like we have 2 goals here that we
> are mixing together. First is the "transport layer" for how devices are
> described, requested, and assigned. It needs to be pretty general-purpose
> and to cover a lot of use-cases. Second is a "user-experience layer" which
> mitigates some of the complexity from the generality of the transport
> layer. My own mind keeps flipping between the two, and I feel like I am
> running in circles.
>
> What if we thought of them more as distinct problems? In some ways, it
> could be moving closer to the 1.30 model. More directly, the discussions
> so far have given me high confidence we can do a (somewhat complex but
> directionally correct) transport layer which lands in 1.31 without much
> focus on UX, while I have diminishing confidence that we can do a good UX
> AND a transport layer in 1.31, without ANOTHER major shift in 1.32.

A lot of the discussions around
https://github.com/kubernetes-sigs/wg-device-management/tree/main/k8srm-prototype
and now https://github.com/kubernetes-sigs/wg-device-management/pull/14
have indeed been around UX. Because we were so focused on that, we
didn't discuss some potentially useful functionality like scoring.

With https://github.com/kubernetes-sigs/wg-device-management/pull/14, I
was leaning more towards "let's be flexible" than "let's make this
nice", both when it comes to functionality (for example, allowing claims
to reference multiple classes instead of just one) and future extensions
(using one-of structs to have a safe way of adding new functionality
that isn't silently ignored by old components).

Perhaps I have gone too far with the flexibility (multi-inheritance is
contentious), but I am feeling confident that it can be implemented
despite that flexibility if we agree on that API at least in principle
soonish (this week or next).

> So, what I think I am proposing is that we focus on the expressivity of the
> capacity and claim aspects - make sure that the things we really need to
> allow are allowed and that scheduling and auto scaling can be made to
> work. I'm not saying to ignore UX, but maybe it's OK to assume that it will come
> on top of the transport, perhaps in the form of more domain-centric CRDs,
> which I know I previously argued against, but maybe I am coming around
> to.

Sounds good to me.

--
Best Regards

Patrick Ohly
Cloud Software Architect

abhishek malvankar

unread,
May 22, 2024, 10:55:42 AMMay 22
to Patrick Ohly, Tim Hockin, wg-device-management
Agree.

Can I ask for more information about the current scoring logic that has been implemented?

Abhishek

--
You received this message because you are subscribed to the Google Groups "wg-device-management" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wg-device-manage...@kubernetes.io.

Patrick Ohly

unread,
May 22, 2024, 11:37:47 AMMay 22
to abhishek malvankar, Tim Hockin, wg-device-management
abhishek malvankar <abhishekm...@gmail.com> writes:
> Can I ask for more information about the current scoring logic that has
> been implemented?

The code walks through each claim and each requested device and picks
the first candidate for each device that matches. There isn't any
backtracking, so it can happen that it doesn't even find any solution
(first device request is for "small GPU", gets a big one, then second
request for "big GPU" only finds small GPUs and cannot be satisfied).

For 1.31, I intended to add backtracking, but still without any scoring.
Reply all
Reply to author
Forward
0 new messages