[PROPOSAL][KEP] OptimizationJob CRD for Hyperparameter Optimization

17 views
Skip to first unread message

Aniket Shaha

unread,
May 18, 2026, 4:15:45 AMMay 18
to kubeflow-discuss

Hello Kubeflow Community,

We are excited to introduce a new proposal and invite feedback on the OptimizationJob CRD design.

Why a New CRD?

While Katib's legacy Experiment CRD provides incredible flexibility for broad use cases (such as Neural Architecture Search and arbitrary Kubernetes workloads), it often requires users to write verbose YAML and rely on brittle regex string substitution to inject parameters into manifests.

With the introduction of the unified Kubeflow Python SDK (KEP-46), there is a strong need for a strongly-typed, iterative orchestration layer natively built for TrainJobs.

Phase 1 MVP Focus

To ensure a stable, maintainable, and easily reviewable initial release, we are targeting a highly focused Phase 1 MVP designed to simplify the core reconciliation loop and minimize external dependencies:

  • Native TrainJob Templating: Uses runtime.RawExtension for embedding templates, allowing users to freely inject flexible metadata, labels, and annotations for both the underlying Trial and the resulting TrainJob.

  • Type-Safe Parameter Injection: Replaces legacy regex string substitution with native Kubernetes Environment Variable injection via standard $(VAR) syntax expansion.

  • Simplified Metrics Flow: Pulls metrics directly from the TrainJob Progress API, completely avoiding an initial dependency on the Katib DB or DB Manager for the MVP phase.

  • Reduced Latency & Overhead: Runs stateless algorithms (like Random and Grid search) in-process directly within the controller, eliminating suggestion pod startup latency and cluster overhead.

Advanced features—such as the SharedInitializer plugin, stateful algorithms (e.g., Bayesian/TPE via transient One-Shot Jobs), and early-stopping schedulers have been explicitly deferred to Phase 2 and Phase 3.

Request for Feedback

We hope to gather community input and review the KEP asynchronously before discussing it on the upcoming community calls.

Please take a look at the Pull Request and leave your thoughts.

Looking forward to your valuable insights!

Best regards,

Aniket Shaha

GitHub: @aniket2405

Reply all
Reply to author
Forward
0 new messages