Discussion of XLA's best targeting scenarios

Jun Yang

unread,

Jan 24, 2018, 7:51:28 PM1/24/18

to XLA development

Folks,

Recently we are busy working on deploying XLA into our production environment. Also we have made several technical investigation about other deep learning compiler optimization tech stacks(such as NNVM/TVM and TensorRT/, yep in my understanding TensorRT could be regarded as a somewhat compiler with limited pattern-matching capabilities).

Going deeper and deeper with those different tech stacks, technically I am wondering about a question---"which scenarios are best suitable for XLA alike solution?" since I think there isn't any silver bullet. In the following I would elaborate my thinkings in details.

1.XLA behaves most like a classical compiler, it have its own IR, its IR-level optimization passes and LLVM-based target codegen backend, etc.

Actually for mature target platforms, quite a lot of functionalities(mostly about the LLVM-backend) provided by XLA has significant overlap with corresponding platform's own compiler, such as Intel's gcc/icc and NVIDIA's nvcc. And I am not sure whether this is best for all platforms, especially for those mature platforms, such as NVIDIA and Intel.

2.NNVM/TVM behaves most like a domain-specific compiler, a.k.a. Deep Learning compiler. In its design it tries to reuse existing compiler when applicable. For example, on NVIDIA platform, it will codegen .cu file rather than directly generate PTX code like XLA. Also TVM has the flexibility of using LLVM as codegen backend and it already does this for AMD GPU platform.

3. TensorRT is a catalog-based optimization tool. Within TensorRT, NVIDIA just enumerates those common graph pattern which is most time-consuming such as Convolution+Bias+ReLU, or Inception's FireModule pattern etc., then they hand-write highly optimized kernels for those patterns. And during the execution of TensorRT, it will just replace those original complex pattern with highly optimization versions with some straightforward fusion work.

4.In TensorFlow, it has its own graph optimization passes outside XLA, such as GraphOptimizationPass, Grappler and core/common_runtime/GraphOptimizer. Within those passes, we could perform some op-graph level optimizations such as CSE, Dead node elimination, constant propagation and Loop Invariant Node Motion, etc. Optimizations performed at this granularity is higher than those performed at XLA HLO granularity.

Based on the above observation and investigations, I think maybe the potentially design of compiler optimization framework of TensorFlow could be organized as following:

1.For those coarse-level graph optimization, it is mostly suitable to be implemented at TF graph optimization passes, such as Loop Invariant Node Motion(take a look at this one https://github.com/tensorflow/tensorflow/pull/16306).

2.For those fusion work with very frequent usage and its pattern are not very complex, such as fusing Conv + Bias + ReLU or fusing a primitive-op based LSTM cell into a FusedLSTMCell, this fusion

work actually could also be done in TF graph optimization phase, since at most time human-written macro-op will have better performance. I think this may be one of the design motivation for TensorFlow Lite choosing macro-op style rather than compiler-based style optimization. We have to admit that op fusion done in this way is like TensorRT's catalog-based solution, it is not flexible enough, so I would emphasize that this is suitable for those "highly-frequent" computation patterns.

And one thing we are doing internally is adding a general op fusion engine into TF graph optimization pass for doing this kind of fusion in a more principal way, after finishing its implementation we would like to share the implementation details to the community.

3.TensorRT could be integrated as a special sub-graph into TensorFlow, so again as a pattern-matching driven op fusion work, actually NVIDIA is already submitting a related PR here(https://github.com/tensorflow/tensorflow/pull/16253), we are also working on the same task internally. Here TensorRT is just a concrete example, what is more interesting is that based on

sub-graph pattern replacement(actually XLA also is supported in the same alike way, just with different implementation details), other third-party libraries or frameworks could be easily integrated.

So I do like computation graph based design since it has enough room for extensibility.

One of the essential points we need to keep in mind is that third-party integration may bring context switching overhead, so careful performance benchmark is necessary for this kind of integration.

4.When we want to add support for a new hardware backend(such as FPGA-based DL accelerator or a new NPU chip), XLA is currently the best suitable choice, since its HLO IR representation is a good choice for bridging the high-level computation description and low-level target implementation. Based on this IR a lot of low-level graph optimization could be implemented. Also for new hardware platform, usually there isn't any mature existing compiler, so XLA's "native compiler" property is the nice hit.

5.In accordance with what we mentioned in 2, for those "infrequent" computation patterns which have fusion potential, it is unwise to add fusion in catalog-based way since it would require tedious and repeated laboring work. Currently XLA is a choice for supporting those "infrequent" pattern fusion in a principal way. But I would suggest that maybe we could support it in another way.

For example, internally we have found TVM is flexible and productive enough for generating CUDA kernels for some ad-hoc scenarios(such as some special-shape matmul/conv or fusion), so I think

we could add TVM as a new backend for TensorFlow. We could either directly add TVM support into TF graph optimization phase and let the graph optimization phase generate TVM DSL for those fused pattern, and then let TVM generate the corresponding .cu kernel to leverage NVCC. Also we could add TVM as a new back-end of XLA, and add a HLO2TVM emitter phases(it would require change of XLA IREmitter, and also we may need to add a LLVMIR2TVM backend) to transform HLO into TVM DSL to leverage NVCC. In this design, we omit most of the code-gen work and put them inside TVM, so could focus on the customization and optimization of TVM itself.

Any suggestions, feedbacks would be highly welcome.

Bjarke Roune

unread,

Jan 25, 2018, 4:42:34 PM1/25/18

to XLA development

On Wednesday, January 24, 2018 at 4:51:28 PM UTC-8, Jun Yang wrote:

Folks,

Recently we are busy working on deploying XLA into our production environment. Also we have made several technical investigation about other deep learning compiler optimization tech stacks(such as NNVM/TVM and TensorRT/, yep in my understanding TensorRT could be regarded as a somewhat compiler with limited pattern-matching capabilities).

Going deeper and deeper with those different tech stacks, technically I am wondering about a question---"which scenarios are best suitable for XLA alike solution?" since I think there isn't any silver bullet. In the following I would elaborate my thinkings in details.

1.XLA behaves most like a classical compiler, it have its own IR, its IR-level optimization passes and LLVM-based target codegen backend, etc.
Actually for mature target platforms, quite a lot of functionalities(mostly about the LLVM-backend) provided by XLA has significant overlap with corresponding platform's own compiler, such as Intel's gcc/icc and NVIDIA's nvcc. And I am not sure whether this is best for all platforms, especially for those mature platforms, such as NVIDIA and Intel.

I think XLA is actually a hybrid of this and all the other approaches here. E.g. on GPU we use CuDNN for convolutions (reusing existing infrastructure) and on TPU our convolution and matrix multiplication and other ops are implemented by hand, just in a way with extension points that allows substantial fusion even into a hand-written kernel. E.g. if you express a convolution as a decomposed set of mathematical operations, you'll lose a lot of efficiency because, while it will work, you'll lose out on the power of our hand-coded implementations. I think as long as XLA's IR fits what you're trying to do, there shouldn't be any reason to avoid using it, since XLA is able to embody all of these approaches and we do do that - though maybe I'm biased by working on XLA. :)

Note that not in all cases do we see a speed-up from hand-coding kernels. We tried an experiment with hand-coding a fused batch norm and it turned out that our general infrastructure-derived batchnorm is just as good as that. As each particular backend and XLA in general matures, this should be the case in more and more cases.

Bjarke Roune

unread,

Jan 25, 2018, 4:44:23 PM1/25/18

to XLA development

On Thursday, January 25, 2018 at 1:42:34 PM UTC-8, Bjarke Roune wrote:

On Wednesday, January 24, 2018 at 4:51:28 PM UTC-8, Jun Yang wrote:
Folks,

Recently we are busy working on deploying XLA into our production environment. Also we have made several technical investigation about other deep learning compiler optimization tech stacks(such as NNVM/TVM and TensorRT/, yep in my understanding TensorRT could be regarded as a somewhat compiler with limited pattern-matching capabilities).

Going deeper and deeper with those different tech stacks, technically I am wondering about a question---"which scenarios are best suitable for XLA alike solution?" since I think there isn't any silver bullet. In the following I would elaborate my thinkings in details.

1.XLA behaves most like a classical compiler, it have its own IR, its IR-level optimization passes and LLVM-based target codegen backend, etc.
Actually for mature target platforms, quite a lot of functionalities(mostly about the LLVM-backend) provided by XLA has significant overlap with corresponding platform's own compiler, such as Intel's gcc/icc and NVIDIA's nvcc. And I am not sure whether this is best for all platforms, especially for those mature platforms, such as NVIDIA and Intel.

I think XLA is actually a hybrid of this and all the other approaches here. E.g. on GPU we use CuDNN for convolutions (reusing existing infrastructure) and on TPU our convolution and matrix multiplication and other ops are implemented by hand, just in a way with extension points that allows substantial fusion even into a hand-written kernel. E.g. if you express a convolution as a decomposed set of mathematical operations, you'll lose a lot of efficiency because, while it will work, you'll lose out on the power of our hand-coded implementations. I think as long as XLA's IR fits what you're trying to do, there shouldn't be any reason to avoid using it, since XLA is able to embody all of these approaches and we do do that - though maybe I'm biased by working on XLA. :)

Note that not in all cases do we see a speed-up from hand-coding kernels. We tried an experiment with hand-coding a fused batch norm and it turned out that our general infrastructure-derived batchnorm is just as good as that. As each particular backend and XLA in general matures, this should be the case in more and more cases.

Oh, this is not to say that trying other approaches is bad, or that another approach couldn't be better than an approach of XLA-all-the-time-for-all-the-things, I'm just self-interestedly pointing out that it's not a given that XLA couldn't do all of this with some improvements. :)

Andy Davis

unread,

Jan 25, 2018, 7:06:03 PM1/25/18

to XLA development

I agree with Bjarke here, I feel like XLA does combine many of these approaches (but that should not discourage experimentation with other systems). Perhaps if there are benchmarks of models that run better with TVM or TensorRT, you could send them our way, and we could work on improving XLA. In addition, if there are usability issues with XLA vs other frameworks, please send us the use cases. Thanks!

Reply all

Reply to author

Forward