Milin Bhade
unread,Oct 14, 2025, 10:44:11 PM (4 hours ago) Oct 14Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to OpenXLA Discuss
Hi — I’m working on a PJRT plugin for a custom accelerator and want to enable availability-aware, cost-driven partitioning of an XLA/HLO module across GPU, CPU, and the custom accelerator:
If only CPU + accelerator are available, run using those.
If GPU is present and used, automatically identify HLO subgraphs that are better offloaded to the accelerator and compile/run them there along with GPU
Questions:
Does XLA currently support multi-backend HLO partitioning/placement (i.e., splitting one HLO module across different backend types)?
Can a PJRT plugin expose device cost/constraints or otherwise influence partitioning during HLO-level compilation?
If not, what’s the recommended approach: implement an XLA pass to consume cost info, or build an orchestration layer that partitions the model and invokes multiple PJRT clients/executables? Which option is more realistic today?
I can prototype either an XLA/HLO pass or an external orchestrator — looking for pointers, existing examples, or caveats.
Thanks.