Proposal for dense LLM serving on Kubernetes via gateway

626 views

Skip to first unread message

Clayton Coleman

unread,

Aug 7, 2024, 11:01:53 AM8/7/24

to wg-se...@kubernetes.io

At today's wg-serving Jiaxin and I want to present a proposal for a project that allows multiple LLM use cases (prompts, LoRA adapters) to safely share base models and model servers for higher density and better operational control. We think an LLM gateway-centric API and Envoy OSS implementation is the best way to achieve those benefits, and that the approach is complementary to the higher-level LLM gateway use cases Dan presented several weeks ago as well as to many existing LLM serving ecosystem projects.

The proposal is described here: https://docs.google.com/document/d/1sFNHQqUWm1DIzC9GxXp3cKRm8cUtTcGuwZYkjkOkUqk/edit?tab=t.0

and we plan to do a demo of the PoC showing some of the most immediate benefits.