Hi team,
I'd like to start this thread to discuss how to support offload pods in Kubernetes. With the draft proposal, I'd like to get your suggestions on how to fill the gaps, e.g. storage, network; and looking forward for more discussion in the community 🙂
Motivation:
As you known, a new infrastructure element, often referred as Data Processing Unit (DPU) or Infrastructure Processing Unit (IPU), takes the form of a server hosted PCIe add-in card or on-board chip(s), containing one or more ASIC’s or FPGA's, usually anchored around a single powerful SoC device. The DPU/IPU-like (unified as xPU) devices have their roots in the evolution of SmartNIC devices but separate themselves from that legacy in several important ways. While a Smart NIC is clearly part of its host node’s compute system and exists to closely interact with and offload node hosted applications, the DPU/IPU dispenses with this secondary role.
The simplify the offload application by DPU/IPU, that'll be great to use the same infra to manage both the application on host and the offload application. For example, when running a MPI (host) with UCC (xPU), it's better to submit a job together to make sure the resources assignment are aligned between host and xPU.
Proposal (draft):
Currently, I'm building an example, named Chariot, by leveraging runtime class (with RuntimeClassInImageCriApi); and here's the overall architect:
The chariot-shim will connect to two containerds: a containerd on the host, the other one on the xPU. Thanks to the remote CRI in containerd, the chariot-shim can re-direct the cri grpc request by a tcp connection. The chariot-shim will re-direct the cri request based on runtime class: the default runtime class will be the containerd on the host, and the xpu runtime class request will be re-directed to the containerd in xpu. I'm using runtime class name right now, and that'll be great to add cri-endpoint into RuntimeClass to avoid conflict; maybe other enhancement.
The demo uses host network, ,and not storage. I'm working on identifying the gaps of network and storage.
User Scenarios:
Reference:
On Mar 10, 2024, at 10:11 AM, Klaus Ma <klaus1...@gmail.com> wrote:
--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-architecture" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-arch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/8d1fa93d-5a6d-4f10-937b-b6cfcd480b32n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/247CF27A-5395-4121-904C-D7F5C7B6CB54%40gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/616b935b-8d4e-41fb-94c2-ebb79c521698n%40googlegroups.com.
On Mar 11, 2024, at 9:46 PM, Klaus Ma <klaus1...@gmail.com> wrote:
Scheduling xPU pods alone is one of the cases: for service, e.g. envoy per node, storage, scheduling xPU pods alone is reasonable, but for job accelerator, e.g. mpi + ucc, it's better to schedule them together.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/7f142eba-ce41-4cb6-b3ae-a6b1c8e67c16n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/31E4F09F-F4BE-4A61-8A4A-3D36FBF76D81%40gmail.com.
On Mar 12, 2024, at 12:48 AM, Tim Hockin <tho...@google.com> wrote:
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/7b4d0db7-9141-4300-9c4f-f713df0d06a8n%40googlegroups.com.
On Mar 12, 2024, at 9:44 AM, Evan Anderson <evan.k....@gmail.com> wrote:
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CADAzeNR8b9%3DvXCL0eWxBcTbegkgzo9rpeic6KfayFOD0YSx2xA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/9640E7E1-9F05-4024-A9E3-C3A3B970C118%40gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAO_Rewb17kQHb1uBRPQ42rO6W_TSTq6MzMM9VNOpzLy0%2BLO7LA%40mail.gmail.com.