[Proposal] Offload pods in Kubernetes

725 views
Skip to first unread message

Klaus Ma

unread,
Mar 10, 2024, 10:11:50 AMMar 10
to kubernetes-sig-architecture

 Hi team,

 

I'd like to start this thread to discuss how to support offload pods in Kubernetes. With the draft proposal, I'd like to get your suggestions on how to fill the gaps, e.g. storage, network; and looking forward for more discussion in the community 🙂

 

Motivation:

 

As you known, a new infrastructure element, often referred as Data Processing Unit (DPU) or Infrastructure Processing Unit (IPU), takes the form of a server hosted PCIe add-in card or on-board chip(s), containing one or more ASIC’s or FPGA's, usually anchored around a single powerful SoC device. The DPU/IPU-like (unified as xPU) devices have their roots in the evolution of SmartNIC devices but separate themselves from that legacy in several important ways. While a Smart NIC is clearly part of its host node’s compute system and exists to closely interact with and offload node hosted applications, the DPU/IPU dispenses with this secondary role.

 

The simplify the offload application by DPU/IPU, that'll be great to use the same infra to manage both the application on host and the offload application. For example, when running a MPI (host) with UCC (xPU), it's better to submit a job together to make sure the resources assignment are aligned between host and xPU.

 

Proposal (draft):

 

Currently, I'm building an example, named Chariot, by leveraging runtime class (with RuntimeClassInImageCriApi); and here's the overall architect:

 

overall

 

The chariot-shim will connect to two containerds: a containerd on the host, the other one on the xPU. Thanks to the remote CRI in containerd, the chariot-shim can re-direct the cri grpc request by a tcp connection. The chariot-shim will re-direct the cri request based on runtime class: the default runtime class will be the containerd on the host, and the xpu runtime class request will be re-directed to the containerd in xpu. I'm using runtime class name right now, and that'll be great to add cri-endpoint into RuntimeClass to avoid conflict; maybe other enhancement.

 

The demo uses host network, ,and not storage. I'm working on identifying the gaps of network and storage.

  

User Scenarios:

  • As a Kubernetes administrator, I’d like to re-configure DPU after get the instances from BMaaS, e.g. SR-IOV, SF, BFB.
  • As a Kubernetes administrator, I’d like to deploy common offload services into DPU for all applications, e.g. envoy per node, storage.
  • As an application developer, I’d like to offload part of my application into DPU to speed up, e.g. UCC + MPI.
  • As an application developer, I’d like to make offload service as simple as possible, e.g. no multi tenants support.
  • As an application developer, I’d like to ask Kubernetes to do the job scheduling based on DPU/GPU/CPU resources; and that’ll be great to consider network topology (e.g. DPU, Spectrum-X) to speed up jobs.
  • As an AI Infra (by k8s) administrator, that’ll be great to have a solution/suggestion on managing DPU/GPU/CPU for AI/ML platform.

 

Reference:

  1. Chariot: https://github.com/openbce/chariot
  2. Offloading collective operations:

-- Klaus

Clayton

unread,
Mar 10, 2024, 11:17:03 AMMar 10
to Klaus Ma, kubernetes-sig-architecture
This assumes the host machine is in control, what about scenarios when the DPU is where the kubelet is running (to prevent compromise of host tenants)?  Is this proposal symmetric to whether host or DPU has the kubelet?

On Mar 10, 2024, at 10:11 AM, Klaus Ma <klaus1...@gmail.com> wrote:


--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-architecture" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-arch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/8d1fa93d-5a6d-4f10-937b-b6cfcd480b32n%40googlegroups.com.

Tim Hockin

unread,
Mar 10, 2024, 1:14:10 PMMar 10
to Clayton, Klaus Ma, kubernetes-sig-architecture
What are the criteria used to decide what should go in the offload vs the host?  Is it security?  Cost?  IP protection?  Is the xPU hardware tuned for some particular operations (e.g. packet switching) or is it just another CPU behind a different API?  I assume that the offload capacity is significantly smaller than the host machine, is that true?  What is the order of magnitude?  Is there always a matching host-side Pod with the xPU pod?  Or can we schedule xPU pods alone?

Klaus Ma

unread,
Mar 11, 2024, 3:10:16 AMMar 11
to kubernetes-sig-architecture
This proposal focus on the  DPU without kubelet; the DPU is considered as a kind of resources of cluster in the host, whose runtime is pod/container.

Klaus Ma

unread,
Mar 11, 2024, 3:26:19 AMMar 11
to kubernetes-sig-architecture
> What are the criteria used to decide what should go in the offload vs the host?  Is it security?  Cost?  IP protection?  Is the xPU hardware tuned for some particular operations (e.g. packet switching) or is it just another CPU behind a different API?  I assume that the offload capacity is significantly smaller than the host machine, is that true?  What is the order of magnitude?

[K]: AFAIK, it's more about data processing, e.g. ip-sec, ucc (an accelerator for MPI), storage client. By default, xPU (including its sdk) did hardware and software tuned for those operations. In addition, there are usually a SDK for developer to optimise their application by offload (pod/container). 

> Is there always a matching host-side Pod with the xPU pod?  Or can we schedule xPU pods alone?

[K]: The lifecycle of host pods and xPUs pods maybe different, e.g. storage, monitoring; so we should support scheduling xPU pods alone.

Tim Hockin

unread,
Mar 11, 2024, 8:25:21 PMMar 11
to Klaus Ma, kubernetes-sig-architecture
If we support scheduling XPU pods alone, does that just make them a whole new node?

Klaus Ma

unread,
Mar 11, 2024, 9:46:30 PMMar 11
to kubernetes-sig-architecture
Scheduling xPU pods alone is one of the cases: for service, e.g. envoy per node, storage, scheduling xPU pods alone is reasonable, but for job accelerator, e.g. mpi + ucc, it's better to schedule them together.

As you known, the xPU is not powerful as host, for example, it's 16 core with 32G for Bluefield3; and there will be multiple xPUs for a single host, e.g. 8 xPUs per node. If make them as a new node, we need to put losts of resources to run kubelet, one kubelet per xPU; and the scalability is also a challenge, e.g. 1k host with 8k xPUs (with multi cluster, the scheduling for job and its accelerator will be another challendge).

-- Klaus

Clayton

unread,
Mar 11, 2024, 11:49:15 PMMar 11
to Klaus Ma, kubernetes-sig-architecture
If the xPU is a different architecture, that implies that scheduler and Kubelet now have to be multi architecture, which probably has a lot of implications.

The DPU feels more like a different Node than the host, regardless of whether you have one kubelet offering both or two kubelets. 

On Mar 11, 2024, at 9:46 PM, Klaus Ma <klaus1...@gmail.com> wrote:

Scheduling xPU pods alone is one of the cases: for service, e.g. envoy per node, storage, scheduling xPU pods alone is reasonable, but for job accelerator, e.g. mpi + ucc, it's better to schedule them together.

Tim Hockin

unread,
Mar 12, 2024, 12:48:18 AMMar 12
to Clayton, Klaus Ma, kubernetes-sig-architecture
I wonder how much it (multi-arch) will really matter.  It's the runtime itself which does things like image pulls, right? 

Clayton

unread,
Mar 12, 2024, 1:13:59 AMMar 12
to Tim Hockin, Klaus Ma, kubernetes-sig-architecture
Well, we have an official label on the Node that says what architecture the node is that is part of our public api… :)

On Mar 12, 2024, at 12:48 AM, Tim Hockin <tho...@google.com> wrote:



Tim Hockin

unread,
Mar 12, 2024, 1:24:10 AMMar 12
to Clayton, Klaus Ma, kubernetes-sig-architecture
Ehhhhhh, true. 

Klaus Ma

unread,
Mar 12, 2024, 2:10:39 AMMar 12
to kubernetes-sig-architecture
For the different Node, is there any idea on how to describe the relationship between host and multiple xPUs?
In addition, we may also allocate cpus to work with dedicated xPUs to speed up the application; for example, the distributed workload.

Evan Anderson

unread,
Mar 12, 2024, 9:44:03 AMMar 12
to Klaus Ma, kubernetes-sig-architecture
Would describing the relationships between xPUs and the host system be a job for the topology constraints and inter-pod affinity?

Clayton

unread,
Mar 12, 2024, 11:34:29 AMMar 12
to Evan Anderson, Klaus Ma, kubernetes-sig-architecture
So many of the constructs on Node deal with a single system (allocatable cpu, arch labels, resource limits), that if we want to target scheduling pods to xPU we really only have two primary options:

1. Two Node objects
2. Model the xPU as a device and make the pod responsible for describing the work the xPU must do (like GPUs)

If we chose the former, it’s not dissimilar to the GPU nvlink situation (these arbitrary nodes have very high interconnect, and pods that must schedule onto that need to use some mechanism to gain affinity).  At kubecon the unconference on accelerators probably needs to discuss similar challenges in modelling.

On Mar 12, 2024, at 9:44 AM, Evan Anderson <evan.k....@gmail.com> wrote:



Tim Hockin

unread,
Mar 12, 2024, 11:50:22 AMMar 12
to Clayton, Evan Anderson, Klaus Ma, kubernetes-sig-architecture
Worth calling out that the GPU/accelerator effort is working on a new way to describe resources in a node, and this might fit into that.

What I want to see clarified:
* If a user wants to run a pod which has a "buddy pod" for the xPU, what does that look like?
* If a user wants to run an xPU pod without a host pod, what does that look like?


John Belamaric

unread,
Mar 12, 2024, 12:35:57 PMMar 12
to Tim Hockin, Clayton, Evan Anderson, Klaus Ma, kubernetes-sig-architecture
This is a really interesting use case and I think tied directly into the discussion that some of us are proposing for next week. That discussion grew out of the work Patrick and Kevin have been doing on DRA. As I see it, the use cases that are coming up are pushing the current scheduling and resource management architecture we have to beyond its limits. I think we need a more comprehensive plan to address all of these. I am working on a document and prototype that I plan to publish before the summit that hopefully will help frame the discussion and provide a demonstration of one possible way forward.

The document and prototype will go into this more, but at the high level, I think it's helpful to break the problem into a few different areas:

- Representing the available resources and their relationships to one another. Today we do this with a very simple model in the node status, capturing CPU, memory, huge pages, extended resources and a few other things. But we don't capture the *relationships* (e.g., topology) in the control plane, which means higher level controllers like the scheduler and cluster autoscaler cannot reason about them.
- Representing the workloads' demands for those resources, and the constraints (e.g., topology) on those demands. Today we do this with Pod resource requests and node selectors (to target nodes pre-configured with, for example, specific topology policies). This requires tight coordination between the workload authors and the infrastructure teams putting together the sets of nodes, so that the appropriate pre-configured nodes are available and have the right labels. This can work when you have a single topology dimension - for example NUMA alignment. But we are seeing more complex topologies, and multiple topologies in use cases such as yours and with networking technologies like NVLink. Each of these topologies is independent or semi-independent. Trying to create node pools to satisfy all the combinations and then track them is painful, makes workloads non-portable (node selectors will differ across clusters), and also results in very "lumpy", inefficient consumption of the infrastructure. Even today, for example, you need to configure the topology manager scope as "pod" or "container" at the node level, rather than it being something the workload author can specify. That means you need two different node pools, one with each of those policies defined.
- Representing the results of the resource allocations. That is, tracking what resources have been allocated so that future requests take those into account. This also serves as input to actuating the workloads.
- Representing the infrastructure team's constraints on how workload authors consume the infrastructure. This is captured today in Resource Quota, and embedded in the way node pools are constructed and how taints and tolerations are applied. But this offers very limited control.
- Actuation of the workloads. This is what kubelet, container runtimes, CNI drivers, CSI drivers, DRA drivers, etc. do.

The prototype will provide some initial ideas on how to do the first three of these. I suspect I won't have time to include your use case, but I will try to give it some thought as to how we might approach it.

John

Klaus Ma

unread,
Mar 12, 2024, 10:08:45 PMMar 12
to kubernetes-sig-architecture
Honestly, both are ok to me.
In this proposal, I'm thinking a host with multiple devices is more straight forward to manage, e.g. scheduling and monitoring.
For those devices, the runtime maybe different; the xPU is using CRI, and the GPU is using CUDA.

Brendan Burns

unread,
Mar 12, 2024, 10:16:36 PMMar 12
to Klaus Ma, kubernetes-sig-architecture
I would strongly encourage people to look at the virtual kubelet and other efforts that I think match up with this notion of a kubelet on one infrastructure and the runtime on something distinct from that infrastructure (eg several containers or Web assembly)

I don't think these are new ideas and there's been lots of effort and exploration in this space over the years.

--brendan 


From: kubernetes-si...@googlegroups.com <kubernetes-si...@googlegroups.com> on behalf of Klaus Ma <klaus1...@gmail.com>
Sent: Tuesday, March 12, 2024 7:08:45 PM
To: kubernetes-sig-architecture <kubernetes-si...@googlegroups.com>
Subject: [EXTERNAL] Re: [Proposal] Offload pods in Kubernetes
 

Klaus Ma

unread,
Mar 12, 2024, 10:30:40 PMMar 12
to kubernetes-sig-architecture
>  Worth calling out that the GPU/accelerator effort is working on a new way to describe resources in a node, and this might fit into that.

[K]: Not only the resources, but also the runtime; for example CRI vs. kubelet :)

> What I want to see clarified:
> * If a user wants to run a pod which has a "buddy pod" for the xPU, what does that look like?
> * If a user wants to run an xPU pod without a host pod, what does that look like?

[K]: The major differentiation of this two cases is the lifecycle of xPU pod. In the first case (the "buddy pod"), the xPU pods come and go with the host pod; in the second case, the xPU pods are long running pods.


For the second cases, I did not find the presentation; but I think the HBN is a good example. The HBN is a long running service (pods) in xPU, help to configure network of xPU, e.g. create VFs/SFs for the host, enable BGP in xPU by FRR. 

Klaus Ma

unread,
Mar 13, 2024, 1:39:35 AMMar 13
to kubernetes-sig-architecture
In this proposal, I'd like to use remote CRI to handle xPU case; so we can manage all resources belong to the host, including CPU, GPU and DPU.
The only different is the runtime, CRI, CUDA and remote CRI with xPU SDK.

Klaus Ma

unread,
Mar 13, 2024, 4:49:22 AMMar 13
to kubernetes-sig-architecture
Just to clarify. For CRI, I means CRI + RuntimeClass + backend, e.g. containerd, cri-o.
Reply all
Reply to author
Forward
0 new messages