Difference between loop fusion, input fusion and output fusion

sdama...@gmail.com

unread,

Feb 18, 2019, 10:54:57 AM2/18/19

to XLA development

From what I understand:

- Loop fusion refers to the case where two ops have conformable loops with element-wise operations that can be fused together so that the intervening results are reused on device

- Input fusion appears to refer to fusion into a reduction or scatter op. Is this because of the change in data size?

- Output fusion appears to be specific to fusion of a dot operation into its consumer. I do not understand why this treated differently from a reduction operation.

Sanjoy Das

unread,

Feb 18, 2019, 1:18:09 PM2/18/19

to sdama...@gmail.com, XLA development

I think you're basically correct, see
https://github.com/tensorflow/tensorflow/blob/6adb75f80028ad6c919e37f399c63b0a2b5fedf1/tensorflow/compiler/xla/service/hlo_instruction.h#L280
for details.

> --
> You received this message because you are subscribed to the Google Groups "XLA development" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+u...@googlegroups.com.
> To post to this group, send email to xla...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/1d5b6f7c-b49c-4977-82c1-af9a6dd78646%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

sdama...@gmail.com

unread,

Feb 18, 2019, 1:48:22 PM2/18/19

to XLA development

Thanks for the link Sanjoy! I have a few additional questions based on the code comments.

1. Input Fusion:

Generate code for all inputs and then generate code for the root of the computation tree within the fusion computation. Write results to memory.

Why is this restricted to reduction/scatter ops?

2. Output Fusion:

Primary node is not the root of the computation tree. Output must be forwarded to consumer and not written to memory. Output must alias one of the inputs (i.e. A = A op B).

Why does this need special handling?

3. Loop Fusion:

The comment says that loop fusion requires special codegen.

Can you provide a high-level example for how this is different from input fusion?

Thanks,

Sana

Sanjoy Das

unread,

Feb 18, 2019, 4:26:01 PM2/18/19

to sdama...@gmail.com, XLA development

On Mon, Feb 18, 2019 at 10:48 AM <sdama...@gmail.com> wrote:
> 1. Input Fusion:
> Generate code for all inputs and then generate code for the root of the computation tree within the fusion computation. Write results to memory.
> Why is this restricted to reduction/scatter ops?
>

> 3. Loop Fusion:
> The comment says that loop fusion requires special codegen.
> Can you provide a high-level example for how this is different from input fusion?

I think (1) and (3) are related.

The in-practice definition of kLoop fusion is that it chains together
N elemental IR emitters. Elemental IR emitters codegen *one* element
in the output of an HLO given a) its index and b) the elemental IR
emitters for all of the operands to the HLO. E.g. the elemental IR
emitter for tanh is `tanh(Idx) =
llvm.tanh(ElementalIrEmitterForOperand(Idx))` and the elemental IR
emitter for transpose is `transpose({I,J}) =
ElementalIrEmitterForOperand({J,I})`. Most operations have an
associated elemental IR emitter and can thus participate in kLoop
fusions, but for some operations it can be tricky to implement an
elemental IR emitter:

- We have tiled codegen for reductions which does not produce the
output one element at a time. So it is difficult to write an
elemental IR emitter since it needs to (efficiently) produce *one*
element in the output, given that element's index.

- Scatter has a "table" that maps an input index to an output index.
So given an output index we can't efficiently tell which input index
that output "comes from" (this would require a linear scan over the
table).

So we special cases these ops as input fusions. The emitters for
reduce and scatter get elemental IR emitters for each of their inputs
and they can invoke these elemental IR emitters in some non-trivial
complex way to produce their outputs -- we don't enforce the elemental
IR emitter structure (given an output index produce only that output
element) on the scatter and reduce emitters.

Elemental IR emitters are defined in
tensorflow/compiler/xla/service/elemental_ir_emitter.cc

> 2. Output Fusion:
> Primary node is not the root of the computation tree. Output must be forwarded to consumer and not written to memory. Output must alias one of the inputs (i.e. A = A op B).
> Why does this need special handling?

kOutput is the dual of kInput fusion -- instead of allowing arbitrary
elemental operations happen "before" the operation with complex
tiling, it allows arbitrary (in principle, not in practice!) elemental
operations to happen "after" the operation with complex tiling.

E.g. on CPU we fuse the bias-add in "Ax+B" by creating an kOutput
fusion for the GEMM. The GEMM produces its output using some
complicated tiling scheme and does the bias-add before writing the
result to memory. Again, we don't want an elemental implementation
for the GEMM because that force us to produce the output one element
at a time which would be inefficient.

-- Sanjoy

Reply all

Reply to author

Forward