The purpose of operation fusion using TensorFlow XLA-JIT on CPU backend

Martin Z

unread,

Nov 23, 2017, 11:34:08 PM11/23/17

to XLA development

Can anyone give me any hints why XLA-JIT has better porformance on CPU backend?

I tried TensorFlow without and with XLA-JIT (manual mode) on mnist benchmark on a single CPU. Using XLA-JIT achieves 13.6x speedups against TensorFlow without XLA-JIT, which is quite significant, so I decided to get to know what is under the hood.

As operation fusion is often mentioned when talking about the advantages of XLA-JIT, I naturally thought this technique might be the reason behind, so I learned the source code and found the fusion procedure is roughly like this (please correct me if anything is wrong):

Check if there are operations in an HloComputation (CompOld) can be fused;
If so, a new Fusion instruction is added to CompOld, and fused operations are removed from CompOld;
Then a new HloComputation (CompNew) is created consisting of the fused operations. The added Fusion instruction in CompOld has a pointer pointing to CompNew.
When it comes to the backend, the LLVM IR are emitted independently for both CompOld and CompNew.

Considering the significant performance improvement, I think there must be something more that I miss or am mistaken about. May I have your advice?

Justin Lebar

unread,

Nov 23, 2017, 11:51:30 PM11/23/17

to Martin Z, XLA development

Fusion allows us to do operations without materializing them to memory. This can dramatically increase our "arithmetic intensity" -- the amount of work we do divided by the number of loads and stores we do. Fusion can also allow us to elide entirely operations that merely shuffle things around in memory (e.g. reshapes).

For more information about this technique, see e.g. https://arxiv.org/pdf/1601.05400.pdf (random paper I found by googling "operation fusion programming").

--
You received this message because you are subscribed to the Google Groups "XLA development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+u...@googlegroups.com.
To post to this group, send email to xla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/3d63f1f6-81f8-4f69-8101-56932cdfe2c0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Martin Z

unread,

Nov 24, 2017, 12:00:19 AM11/24/17

to XLA development

Thank you for your quick reply. I will learn more about this technique. Btw, I noticed there many other optimization passes on the HLO IR level, do you think the fusion pass plays the most important role in achieving such speedups? Thanks.

在 2017年11月24日星期五 UTC+11下午3:51:30，Justin Lebar写道：

Justin Lebar

unread,

Nov 24, 2017, 12:20:24 PM11/24/17

to Martin Z, XLA development

> I noticed there many other optimization passes on the HLO IR level, do you think the fusion pass plays the most important role in achieving such speedups?

The two main structural advantages that XLA has over TensorFlow Classic are (a) the ability to form novel fusions (TF Classic does have fusions, but they're all identified by humans ahead of time) and (b) knowledge of all shapes' sizes, statically.

In my estimation fusion is probably the more important one of these two, although knowing shape sizes can help a lot e.g. on the GPU, where it lets us avoid integer divisions and shorten some indices to 32 bits.

To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/0b255b1d-b1fb-496b-9036-905c89da9971%40googlegroups.com.

Reply all

Reply to author

Forward