On Mon, Feb 18, 2019 at 10:48 AM <
sdama...@gmail.com> wrote:
> 1. Input Fusion:
> Generate code for all inputs and then generate code for the root of the computation tree within the fusion computation. Write results to memory.
> Why is this restricted to reduction/scatter ops?
>
> 3. Loop Fusion:
> The comment says that loop fusion requires special codegen.
> Can you provide a high-level example for how this is different from input fusion?
I think (1) and (3) are related.
The in-practice definition of kLoop fusion is that it chains together
N elemental IR emitters. Elemental IR emitters codegen *one* element
in the output of an HLO given a) its index and b) the elemental IR
emitters for all of the operands to the HLO. E.g. the elemental IR
emitter for tanh is `tanh(Idx) =
llvm.tanh(ElementalIrEmitterForOperand(Idx))` and the elemental IR
emitter for transpose is `transpose({I,J}) =
ElementalIrEmitterForOperand({J,I})`. Most operations have an
associated elemental IR emitter and can thus participate in kLoop
fusions, but for some operations it can be tricky to implement an
elemental IR emitter:
- We have tiled codegen for reductions which does not produce the
output one element at a time. So it is difficult to write an
elemental IR emitter since it needs to (efficiently) produce *one*
element in the output, given that element's index.
- Scatter has a "table" that maps an input index to an output index.
So given an output index we can't efficiently tell which input index
that output "comes from" (this would require a linear scan over the
table).
So we special cases these ops as input fusions. The emitters for
reduce and scatter get elemental IR emitters for each of their inputs
and they can invoke these elemental IR emitters in some non-trivial
complex way to produce their outputs -- we don't enforce the elemental
IR emitter structure (given an output index produce only that output
element) on the scatter and reduce emitters.
Elemental IR emitters are defined in
tensorflow/compiler/xla/service/elemental_ir_emitter.cc
> 2. Output Fusion:
> Primary node is not the root of the computation tree. Output must be forwarded to consumer and not written to memory. Output must alias one of the inputs (i.e. A = A op B).
> Why does this need special handling?
kOutput is the dual of kInput fusion -- instead of allowing arbitrary
elemental operations happen "before" the operation with complex
tiling, it allows arbitrary (in principle, not in practice!) elemental
operations to happen "after" the operation with complex tiling.
E.g. on CPU we fuse the bias-add in "Ax+B" by creating an kOutput
fusion for the GEMM. The GEMM produces its output using some
complicated tiling scheme and does the bias-add before writing the
result to memory. Again, we don't want an elemental implementation
for the GEMM because that force us to produce the output one element
at a time which would be inefficient.
-- Sanjoy