[PROPOSAL] Kubeflow Data Cache built on Apache Arrow and DataFusion

55 views
Skip to first unread message

Andrey Velichkevich

unread,
Jun 5, 2025, 1:57:02 PMJun 5
to kubeflow-discuss

Hi Kubeflow Community,


We are excited to share a new in-memory caching solution we've been developing, designed to optimize data loading for distributed AI workloads - especially those involving tabular data.


Built on Apache Arrow and DataFusion, this solution enables:

✅ In-memory storage of Apache Iceberg tables.
✅ Efficient sharding across distributed nodes.
✅ High-throughput streaming to GPU-based AI workloads.


We've prepared a KEP and would love your feedback: https://github.com/kubeflow/community/pull/864



Our team also presented this solution at the recent KubeCon + CloudNativeCon Europe in London: https://youtu.be/s4KAe7AtN7s


Regards,
Andrey


Andrey Velichkevich

unread,
Jul 1, 2025, 4:41:09 PMJul 1
to kubeflow-discuss

Hi Kubeflow Community,

We will be presenting the final overview of the Arrow Data Cache KEP during the upcoming Kubeflow Community Call on July 8th at 8:00 AM PST.

Join us to:
✅ Learn about the latest updates on the Arrow Cache implementation.
✅ Discuss remaining open questions and gather community feedback.

Don’t miss this chance to engage with the contributors and explore this powerful new feature!

Kubeflow Community call: http://bit.ly/kf-meeting-notes
KEP link: https://github.com/kubeflow/community/pull/864

Regards,
Andrey

Andrey Velichkevich

unread,
Aug 14, 2025, 3:53:25 PMAug 14
to kubeflow-discuss
Hi Folks,

I’m thrilled to share that Kubeflow Data Cache, powered by Apache Arrow and Apache DataFusion, has officially been accepted by the Kubeflow community 🎉
We’re working on open-sourcing this solution as part of the Kubeflow Trainer project.

You can follow the KEP progress in this tracking issue: https://github.com/kubeflow/trainer/issues/2655
The Rust-based cache implementation PR is live and ready for review: https://github.com/kubeflow/trainer/pull/2755

Your reviews, thoughts, and feedback would be incredibly valuable as we move this forward.
Feel free to join #kubeflow-trainer Slack channel to any questions!


Thanks,
Andrey
Reply all
Reply to author
Forward
0 new messages