[PROPOSAL] Kubeflow Data Cache built on Apache Arrow and DataFusion

Andrey Velichkevich

unread,

Jun 5, 2025, 1:57:02 PMJun 5

to kubeflow-discuss

Hi Kubeflow Community,

We are excited to share a new in-memory caching solution we've been developing, designed to optimize data loading for distributed AI workloads - especially those involving tabular data.

Built on Apache Arrow and DataFusion, this solution enables:

✅ In-memory storage of Apache Iceberg tables.
✅ Efficient sharding across distributed nodes.
✅ High-throughput streaming to GPU-based AI workloads.

We've prepared a KEP and would love your feedback: https://github.com/kubeflow/community/pull/864

Our team also presented this solution at the recent KubeCon + CloudNativeCon Europe in London: https://youtu.be/s4KAe7AtN7s

Regards,
Andrey

Andrey Velichkevich

unread,

Jul 1, 2025, 4:41:09 PMJul 1

to kubeflow-discuss

Hi Kubeflow Community,

We will be presenting the final overview of the Arrow Data Cache KEP during the upcoming Kubeflow Community Call on July 8th at 8:00 AM PST.

Join us to:
✅ Learn about the latest updates on the Arrow Cache implementation.
✅ Discuss remaining open questions and gather community feedback.

Don’t miss this chance to engage with the contributors and explore this powerful new feature!

Kubeflow Community call: http://bit.ly/kf-meeting-notes
KEP link: https://github.com/kubeflow/community/pull/864

Regards,
Andrey

Andrey Velichkevich

unread,

Aug 14, 2025, 3:53:25 PMAug 14

to kubeflow-discuss

Hi Folks,

I’m thrilled to share that Kubeflow Data Cache, powered by Apache Arrow and Apache DataFusion, has officially been accepted by the Kubeflow community 🎉
We’re working on open-sourcing this solution as part of the Kubeflow Trainer project.

You can follow the KEP progress in this tracking issue: https://github.com/kubeflow/trainer/issues/2655
The Rust-based cache implementation PR is live and ready for review: https://github.com/kubeflow/trainer/pull/2755

Your reviews, thoughts, and feedback would be incredibly valuable as we move this forward.
Feel free to join #kubeflow-trainer Slack channel to any questions!

Thanks,
Andrey

Reply all

Reply to author

Forward