Is there any plan for XLA extensions?

Wenxi Zhu

unread,

Jun 3, 2020, 7:45:30 AM6/3/20

to XLA development

Hi.

Is there any way to have a plugin mechanisms, so users can define and run customized ops in XLA? Just like its counterpart in non-XLA version tensorflow: https://www.tensorflow.org/guide/create_op.

I suppose APIs such as "REGISTER_XLA_OP()" and "Thunk" related structures could be exported to help users define HLO codegen and thunk implementations for their customized ops? I'm wondering if there is a plan, because currently XLA itself is not even included in libtensorflow_framwork.so.

And if the answer is yes, would you accept a pull request with the work?

Thank you!

Wenxi

Jacques Pienaar

unread,

Jun 3, 2020, 10:38:49 AM6/3/20

to Wenxi Zhu, XLA development

Hey,

With customized ops, do you mean TF ops or you want to add a custom XLA HLO op? If the former then you should be able to register a kernel to do symbolic expansion today already, if the latter then not in XLA no.

Best,

Jacques

--
You received this message because you are subscribed to the Google Groups "XLA development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/4814fd28-dff1-4744-a3e0-709c19ee81bd%40googlegroups.com.

Sanjoy Das

unread,

Jun 3, 2020, 2:38:02 PM6/3/20

to Jacques Pienaar, George Karpenkov, Wenxi Zhu, XLA development

On Wed, Jun 3, 2020 at 7:38 AM 'Jacques Pienaar' via XLA development <xla...@googlegroups.com> wrote:

Hey,

With customized ops, do you mean TF ops or you want to add a custom XLA HLO op? If the former then you should be able to register a kernel to do symbolic expansion today already, if the latter then not in XLA no.

XLA does have a CustomCall HLO though that you can use to plug in your own CPU/GPU implementation.

Can you share some more details on what functionality you need?

+George Karpenkov

-- Sanjoy

Best,

Jacques

On Wed, Jun 3, 2020, 4:45 AM Wenxi Zhu <zhuwen...@gmail.com> wrote:
Hi.

Is there any way to have a plugin mechanisms, so users can define and run customized ops in XLA? Just like its counterpart in non-XLA version tensorflow: https://www.tensorflow.org/guide/create_op.

I suppose APIs such as "REGISTER_XLA_OP()" and "Thunk" related structures could be exported to help users define HLO codegen and thunk implementations for their customized ops? I'm wondering if there is a plan, because currently XLA itself is not even included in libtensorflow_framwork.so.

And if the answer is yes, would you accept a pull request with the work?

Thank you!
Wenxi

--
You received this message because you are subscribed to the Google Groups "XLA development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/4814fd28-dff1-4744-a3e0-709c19ee81bd%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "XLA development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/CAM4W%2BYcaNCkLSjyOSSJa%2B4rqXL41gBvf%3DS%3DVhQVoeT8WQPpJ0A%40mail.gmail.com.

朱文熙

unread,

Jun 4, 2020, 4:59:58 AM6/4/20

to Sanjoy Das, Jacques Pienaar, George Karpenkov, XLA development

Sanjoy, thank you for the reply!

I've read through the CustomCall documentation, however I don't think it can fulfill my requirement. Currently it looks more like a hack to me rather than a fully functional plugin mechanism because:

1. The custom call code that users wrote should be added to tensorflow source code and be compiled with tensorflow, while official tensorflow plugins can be compiled separately from tensorflow and loaded by tensorflow at runtime. (I haven't try it to write a custom call code yet, so please correct me if I'm wrong)

2. "Custom call doesn't know the dimensions of buffers it operates over", that's what I read from its doc page. It looks like custom call thunk is not a "first-class citizen" thunk so it lacks some basic functionalities? Is there any other differences like between custom-call thunk and those "first-class citizen" thunks? So probably there will be some trouble when writing a serious custom call implementation? I don't know.

I'm working on enabling a custom GPU op (kind of like horovod's allreduce/allgather operation but have some major differences) to run in XLA, the op is used in a model from Tencent AI Lab and will be deployed in Tencent's datacenter for large-scale multi machine training. I definitely don't want to do any hacking to existing XLA, because having a modified version of tensorflow which diverged from master branch is high-maintenance, especially in a datacenter. So I'm looking for a plugin mechanism in XLA but have no luck.

I've already describe the scenario in my replies to Jacques, please let me post here:

>>>

Actually, what I mean is a plugin mechanism, such as export the "REGISTER_XLA_OP()" to users, so they can register a customized op to XLA and decide which HLO instruction (or a sort of instructions) to generated; And something like "REGISTER_HLO_THUNK()" (an API i imaged, not yet in XLA) should also be exported, to let users provide their own thunk implementations for existing HLO instructions.

Please let me use the popular "horovod" tensorflow plugin as an example:

REGISTER_OP("HorovodAllreduce")... // Register "HorovodAllreduce" as a TF op

REGISTER_KERNEL_BUILDER(Name("HorovodAllreduce").Device(DEVICE_CPU)... // Register a kernel implementation for "HorovodAllreduce"

Above is existing source code of horovod, if we provide a XLA plugin mechanism as following lines:

REGISTER_XLA_OP("HorovodAllreduce", HorovodAllreduceOp) // Register "HorovodAllreduce" as XLA op, which generates a "kAllReduce" HLO instruction while "ExecuteGraph()"

REGISTER_HLO_THUNK("kAllReduce", HorovodAllreduceThunk) // Register the HorovodAllreduceThunk as a implementation of "kAllReduce" HLO instruction

>>>

According to my perspective, the plugin mechanism is quite useful and will benefit a broad spectrums of developers.

Sanjoy Das <san...@google.com> 于2020年6月4日周四上午2:38写道：

Bairen YI

unread,

Jun 4, 2020, 6:04:51 AM6/4/20

to 朱文熙, Sanjoy Das, Jacques Pienaar, George Karpenkov, XLA development

I would suggest you to take a look at the nccl_all_reduce_thunk if you are targeting something similar to that.

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/gpu/nccl_all_reduce_thunk.h

Best,

Bairen

On 4 Jun 2020, at 17:00, 朱文熙 <zhuwen...@gmail.com> wrote:

To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/CAFxN3F8Df4%3DegL1ya7KGNfDoKYDPcN51q06wQN%2B4Dhoz3R_ADw%40mail.gmail.com.

朱文熙

unread,

Jun 4, 2020, 8:09:55 AM6/4/20

to Bairen YI, Sanjoy Das, Jacques Pienaar, George Karpenkov, XLA development

Bairen.

I noticed there is a nccl_all_reduce_thunk and I know it has pretty much the same functionality of HorovodAllreduce. But still, it can not fulfill my requirement. I think a XLA plugin mechanism would be much more suitable and flexible for my situations.

Thanks

Wenxi

Bairen YI <b...@connect.ust.hk> 于2020年6月4日周四下午6:04写道：

Sanjoy Das

unread,

Jun 5, 2020, 3:11:28 AM6/5/20

to 朱文熙, Thomas Joerg, Tim Shen, Jacques Pienaar, George Karpenkov, XLA development

On Thu, Jun 4, 2020 at 1:59 AM 朱文熙 <zhuwen...@gmail.com> wrote:

Sanjoy, thank you for the reply!

I've read through the CustomCall documentation, however I don't think it can fulfill my requirement. Currently it looks more like a hack to me rather than a fully functional plugin mechanism because:
1. The custom call code that users wrote should be added to tensorflow source code and be compiled with tensorflow, while official tensorflow plugins can be compiled separately from tensorflow and loaded by tensorflow at runtime. (I haven't try it to write a custom call code yet, so please correct me if I'm wrong)

Yes, this is correct.

2. "Custom call doesn't know the dimensions of buffers it operates over", that's what I read from its doc page.

You get to decide the parameters to the custom call and you can choose to pass this information in via these parameters. I don't know for sure, but this could just be an efficiency concern -- it may be that this information is redundant for you (the custom call "hard codes" the dimensions) so materializing the shape is extra work that isn't needed. By making the shapes explicit args you don't pay for what you don't use.

It looks like custom call thunk is not a "first-class citizen" thunk so it lacks some basic functionalities?

Yes custom calls are not the same as HLOs, the compiler understands the semantics of each HLO while it does not deeply understand custom calls. So in that sense custom calls are not "first class" like HLOs.

Is there any other differences like between custom-call thunk and those "first-class citizen" thunks? So probably there will be some trouble when writing a serious custom call implementation? I don't know.

I'm working on enabling a custom GPU op (kind of like horovod's allreduce/allgather operation but have some major differences) to run in XLA, the op is used in a model from Tencent AI Lab and will be deployed in Tencent's datacenter for large-scale multi machine training. I definitely don't want to do any hacking to existing XLA, because having a modified version of tensorflow which diverged from master branch is high-maintenance, especially in a datacenter. So I'm looking for a plugin mechanism in XLA but have no luck.

Another (IMO better) alternative is to contribute the change to XLA. Is that something you'd be willing to do? If you mainly need to introduce a new HLO and add a corresponding xla::Thunk to the backend, the change will be fairly mechanical.

+Thomas Joerg +Tim Shen

-- Sanjoy

Wenxi Zhu

unread,

Jun 8, 2020, 7:14:00 AM6/8/20

to Sanjoy Das, Thomas Joerg, Tim Shen, Jacques Pienaar, George Karpenkov, XLA development

Sanjoy.

Thanks for the explanations! Totally makes sense to me, especially the "custom call" part.

Just curious, is there any specific reason why XLA can't have a plugin mechanism? Just like ordinary tensorflow distribution which already provides "REGISTER_OP()" and "REGISTER_KERNEL_BUILDER()" to let users create their own Op and corresponding Kernel implementation, it makes sense that XLA could also export similar APIs to users for creating HLO codegen and thunk implementation for a specific user-customized Op, I think.

Come to my situation, my worry is the OP I'm implementing is not generalize and mature enough to be added into XLA. I understand there won't be too much trouble to add a new HLO and a corresponding xla::Thunk, but after that every time when our DL scientists change the OP's behavior, I would have to change the HLO/Thunk implementations and upstream to community, which is troublesome in my opinions.

That's why I'm calling for a plugin or extension mechanism, that users can provide their own HLO codegen & thunk implementations for a customized op. With this approach, there will be no need change the XLA source code and recompile it every time; The plugin will be compiled and maintained separately from main tensorflow/XLA source code and be loaded at runtime. It is a much more flexible and productive way in my perspective, and I believe it would also benefit a broader spectrum of users.

I'm willing to take effort to introduce the plugin/extension mechanism to XLA. Actually I'm about to create a RFC for further discussions in the community (if it is not against XLA' roadmap, you know). Would you like to be my sponsor, or bridge me anyone who is suitable for that?

Thanks

Wenxi

Sanjoy Das <san...@google.com> 于2020年6月5日周五下午3:11写道：

Sanjoy Das

unread,

Jun 8, 2020, 2:38:17 PM6/8/20

to Wenxi Zhu, Mehdi Amini, Thomas Joerg, Tim Shen, Jacques Pienaar, George Karpenkov, XLA development

On Mon, Jun 8, 2020 at 4:14 AM Wenxi Zhu <zhuwen...@gmail.com> wrote:

Sanjoy.

Thanks for the explanations! Totally makes sense to me, especially the "custom call" part.

Just curious, is there any specific reason why XLA can't have a plugin mechanism? Just like ordinary tensorflow distribution which already provides "REGISTER_OP()" and "REGISTER_KERNEL_BUILDER()" to let users create their own Op and corresponding Kernel implementation, it makes sense that XLA could also export similar APIs to users for creating HLO codegen and thunk implementation for a specific user-customized Op, I think.

That's exactly the custom-call mechanism (but, as you said, there may be room for improvement).

Come to my situation, my worry is the OP I'm implementing is not generalize and mature enough to be added into XLA. I understand there won't be too much trouble to add a new HLO and a corresponding xla::Thunk, but after that every time when our DL scientists change the OP's behavior, I would have to change the HLO/Thunk implementations and upstream to community, which is troublesome in my opinions.

That's why I'm calling for a plugin or extension mechanism, that users can provide their own HLO codegen & thunk implementations for a customized op. With this approach, there will be no need change the XLA source code and recompile it every time; The plugin will be compiled and maintained separately from main tensorflow/XLA source code and be loaded at runtime. It is a much more flexible and productive way in my perspective, and I believe it would also benefit a broader spectrum of users.

I'm willing to take effort to introduce the plugin/extension mechanism to XLA. Actually I'm about to create a RFC for further discussions in the community (if it is not against XLA' roadmap, you know). Would you like to be my sponsor, or bridge me anyone who is suitable for that?

We are in the process of incrementally porting parts of XLA to the MLIR compiler infra, so this partly depends on your timeline. If you need something in the 1-2 quarters then IMO improving XLA's custom-call support makes sense to me. We can also use your input about what doesn't work well with XLA's custom-call HLO to inform our choices as we move to MLIR.

+Mehdi Amini

-- Sanjoy

Mehdi Amini

unread,

Jun 8, 2020, 3:19:18 PM6/8/20

to Sanjoy Das, Wenxi Zhu, Thomas Joerg, Tim Shen, Jacques Pienaar, George Karpenkov, XLA development

On Mon, Jun 8, 2020 at 11:38 AM Sanjoy Das <san...@google.com> wrote:

On Mon, Jun 8, 2020 at 4:14 AM Wenxi Zhu <zhuwen...@gmail.com> wrote:
Sanjoy.

Thanks for the explanations! Totally makes sense to me, especially the "custom call" part.

Just curious, is there any specific reason why XLA can't have a plugin mechanism? Just like ordinary tensorflow distribution which already provides "REGISTER_OP()" and "REGISTER_KERNEL_BUILDER()" to let users create their own Op and corresponding Kernel implementation, it makes sense that XLA could also export similar APIs to users for creating HLO codegen and thunk implementation for a specific user-customized Op, I think.

That's exactly the custom-call mechanism (but, as you said, there may be room for improvement).

Come to my situation, my worry is the OP I'm implementing is not generalize and mature enough to be added into XLA. I understand there won't be too much trouble to add a new HLO and a corresponding xla::Thunk, but after that every time when our DL scientists change the OP's behavior, I would have to change the HLO/Thunk implementations and upstream to community, which is troublesome in my opinions.

That's why I'm calling for a plugin or extension mechanism, that users can provide their own HLO codegen & thunk implementations for a customized op. With this approach, there will be no need change the XLA source code and recompile it every time; The plugin will be compiled and maintained separately from main tensorflow/XLA source code and be loaded at runtime. It is a much more flexible and productive way in my perspective, and I believe it would also benefit a broader spectrum of users.

I'm willing to take effort to introduce the plugin/extension mechanism to XLA. Actually I'm about to create a RFC for further discussions in the community (if it is not against XLA' roadmap, you know). Would you like to be my sponsor, or bridge me anyone who is suitable for that?

We are in the process of incrementally porting parts of XLA to the MLIR compiler infra, so this partly depends on your timeline. If you need something in the 1-2 quarters then IMO improving XLA's custom-call support makes sense to me. We can also use your input about what doesn't work well with XLA's custom-call HLO to inform our choices as we move to MLIR.

Yeah: coincidentally we were just discussing the topic of exposing custom op to the compiler just last week. This is something that we intend to make a first class concept as part of migrating to use MLIR dialects to power the compiler integration here.

Wenxi Zhu

unread,

Jun 8, 2020, 11:35:51 PM6/8/20

to Sanjoy Das, Mehdi Amini, Thomas Joerg, Tim Shen, Jacques Pienaar, George Karpenkov, XLA development

OK, got it. Currently I'm going to started from either (1) create a new HlO IR and Thunk or (2) create a new custom-call implementation, I will keep you posted and provide feedback from my perspective.

Thanks

Wenxi

Sanjoy Das <san...@google.com> 于2020年6月9日周二上午2:38写道：

Wenxi Zhu

unread,

Jun 8, 2020, 11:39:22 PM6/8/20

to Mehdi Amini, Sanjoy Das, Thomas Joerg, Tim Shen, Jacques Pienaar, George Karpenkov, XLA development

Great! A real plugin/extension mechanism is always tempting to me, in the long term.

Is there a discussion thread that I can involved in? I'm really interested about the decisions or process you made about custom op in MLIR/XLA.

Thanks

Wenxi

Mehdi Amini <ami...@google.com> 于2020年6月9日周二上午3:19写道：

Chris Leary

unread,

Jun 9, 2020, 12:02:28 AM6/9/20

to Wenxi Zhu, Mehdi Amini, Sanjoy Das, Thomas Joerg, Tim Shen, Jacques Pienaar, George Karpenkov, XLA development

Hi Wenxi,

I can probably speak a bit to the original reasoning.

We really wanted to make sure XLA programs worked functionally correctly across all platforms, with the same set of supported operations. Part of the difficulty in a TensorFlow style model is fragmentation -- you don't easily know on which devices which operations are supported. Just as an example, even for the core set of TensorFlow operations, some implicitly would run on the CPU instead of the GPU device, and it wasn't clear there would ever be an implementation to run specifically on the GPU.

If you could create a "basis" set of operations that are a) supported on all platforms and b) can be composed to make more interesting operations, you can guarantee true portability from one virtual "XLA device" to another. This is what we tried to do.

If you add a custom call in TensorFlow, you're usually not adding a high performance {GPU implementation, CPU implementation, TPU implementation, IPU implementation, WSE implementation, ...} for all possible devices. That means the program you write will be non-portable when it uses this extension, which is what we intended to avoid. By contrast, one can hope all those devices implement the core "basis" set of HLOs.

So, the suggested approach for XLA is hypothetically to:

1. add a custom op in TensorFlow land

2. lower it to an XLA custom call in the tf2xla bridge

3. define a "canonical lowering" of the XLA custom call in terms of the "basis set" of HLO operations, so that other platforms that implement the basis set can still run your op portably

There /are/ folks who care only about a single platform, but we were truly trying for platform portability while retaining performance, in part because we always wanted to make sure it was possible to evaluate the best hardware for the job, for a given workload, sans accidental lock-in.

The hope from there is that HLO provided a good set for defining fundamental compute and network operations, and that we would "pave the cowpaths" as people came up with new things that were important that seemed like they should be canonical. This process is purposefully human-designers-in-the-loop and sometimes referred to as "curation".

I do think this approach has worked out reasonably well overall -- XLA has a small surface area that's been possible to maintain and really optimize and iterate on with a relatively small team, and is capable of being implemented by a bunch of hardware providers. The strength/weight ratio of HLO has worked out fairly well for use cases most people have talked to us about -- if your HLO is a variant of a currently implemented HLO (e.g. a HorovodAllReduce instead of our normal AllReduce) it is possible you could swap out the registered GPU compiler, since compilers are actually plugins, and replace it with one that emits a different kind of Thunk (which are also open for extension as they're entirely virtual). So pluggability and customization is possible at the registered-compiler-and-how-it-lowers-to-hardware level, I believe.

Hope that helps provide some perspective, and that you find the best path to get your job done! There are definitely some design decisions and intent rolled up in here that I think have worked out pretty well in practice, but like all design tradeoffs they optimize for something potentially at the cost of something else (like you say, TensorFlow has extremely easy pluggability), and because it is all code, nothing is necessarily set in stone (we save setting things in stone for the hardware design process ;-)

Cheers,

- Leary

To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/CAFxN3F-psfdrJ5%2B5LbXJ7Z5UramYb%2BReTgERdm9yiYvp3zeNMw%40mail.gmail.com.

Wenxi Zhu

unread,

Jun 9, 2020, 6:06:12 AM6/9/20

to Chris Leary, Mehdi Amini, Sanjoy Das, Thomas Joerg, Tim Shen, Jacques Pienaar, George Karpenkov, XLA development

Thank you, Chris. Very helpful.

You mentioned the design choices about portability and consistency across devices, I totally agree with that. That's why I don't think add a new HLO/Thunk is the best solution for my situation, although it's probably the fastest way to get my work done.

That's also the reason why I believe a plugin mechanism is suitable and necessary for XLA. It is a non-invasive approach to extend XLA's capability, there will be no need for developers to hack existing XLA source code and thus have to maintain a modified tensorflow distribution themselves; And plugin development would be much easier, at least not taking too much consideration about portability or consistency, a custom op with only one device implementation (such as gpu) is appropriate. Because users of these plugins know exactly what device/platform they're running on, they just select the appropriate plugin to install. That's my thought about the plugin mechanism.

If I understand correctly, the design of XLA/MLIR plugin you're working on is still at the very early stage, probably purely proof-of-concept (you mentioned it's all code), no working prototype yet? But there's definitely a plan and we're marching for the target, right?

Thanks

Wenxi

Chris Leary <le...@google.com> 于2020年6月9日周二下午12:02写道：

Chris Leary

unread,

Jun 9, 2020, 12:48:58 PM6/9/20

to Wenxi Zhu, Mehdi Amini, Sanjoy Das, Thomas Joerg, Tim Shen, Jacques Pienaar, George Karpenkov, XLA development

On Tue, Jun 9, 2020 at 3:06 AM Wenxi Zhu <zhuwen...@gmail.com> wrote:

Thank you, Chris. Very helpful.

You mentioned the design choices about portability and consistency across devices, I totally agree with that. That's why I don't think add a new HLO/Thunk is the best solution for my situation, although it's probably the fastest way to get my work done.

That's also the reason why I believe a plugin mechanism is suitable and necessary for XLA. It is a non-invasive approach to extend XLA's capability, there will be no need for developers to hack existing XLA source code and thus have to maintain a modified tensorflow distribution themselves;

Just keep in mind it's important to have an implementation of every extension in terms of the core set of HLOs, so that the code remains functionally portable, even if there are faster implementations for single devices.

And plugin development would be much easier, at least not taking too much consideration about portability or consistency, a custom op with only one device implementation (such as gpu) is appropriate.

But is it clear that, at scale, single device implementations cause the fragmentation and non-portability / accidental device-lock-in problem that I was mentioning that we were trying to avoid? This is why I emphasize that there should be a fallback definition in terms of core HLOs above.

Because users of these plugins know exactly what device/platform they're running on, they just select the appropriate plugin to install. That's my thought about the plugin mechanism.

If I understand correctly, the design of XLA/MLIR plugin you're working on is still at the very early stage, probably purely proof-of-concept (you mentioned it's all code), no working prototype yet? But there's definitely a plan and we're marching for the target, right?

This shows a custom call coming from TF land in the tf2xla bridge: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/tf2xla/kernels/softmax_op.cc#L45 It's then up to the compiler plugin to handle those ops, e.g. https://github.com/tensorflow/tensorflow/blob/37aaafb0c1baa7acd0607748326cc12faf556277/tensorflow/compiler/xla/service/cpu/ir_emitter.cc#L2461 This has been used as sparingly as we can, because as we discussed it tends to make code less portable than using core HLOs and moving the definition of core HLOs forward as needed.

To be clear, I'm not actively working on XLA at the moment so can't help make a decision around this, just trying to provide some historical context / reasoning, so the current state of things makes sense.

Cheers,

- Leary

Wenxi Zhu

unread,

Jun 9, 2020, 11:30:27 PM6/9/20

to Chris Leary, Mehdi Amini, Sanjoy Das, Thomas Joerg, Tim Shen, Jacques Pienaar, George Karpenkov, XLA development

OK, I see. Thank you for the answer, Chris.

Wenxi

Chris Leary <le...@google.com> 于2020年6月10日周三上午12:48写道：

Reply all

Reply to author

Forward