Fusion vs Call

dav...@graphcore.ai

unread,

Aug 13, 2017, 5:17:58 AM8/13/17

to XLA development

I thought that this conversation should have its own thread.

The theory is that fusion is not intended for finding clusters of ops which map to specific backend supported operations (e.g. 'average pool'). So i am going to migrate our backend to use Call instead of Fusion. As previously discussed, I am not using CustomCall because it doesn't allow the op to find the replaced instructions safely.

Call is not as easy to use as I first thought it would be.

1) while call allows me to get back to the original instructions, it doesn't have somewhere to store the general type of the replacement. I have used the name of the sub-computation. That works, although the name is mangled by a uniquifier. I can work around this, but I am now assuming that the uniquifier will only append a '.' followed by arbitrary data, and no other mangling of the name prior to the '.' will occur. I don't think that this is a big issue.

2) unfortunately the ops extraction and replace by Call code isn't as sophisticated as the ops and replace by Fusion code. Specifically the fusion code is capable of taking a sub graph with where some nodes are used by ops not in the sub-graph, and doing the right thing - duplicating the node into the fusion computation and leaving the original in the main graph. Consider extracting the const+add from this graph:

const       param
\            /
--- add ----
|    |
|    |
\    /
    sub
     |

OutlineExpressionFromComputation cannot do it, but CreateFusionInstruction can. The error from OutlineExpressionFromComputation would be "The subcomputation to outline has multiple outputs:" because it doesn't allow for the const to remain outside the fusion and be duplicated within it.

   Main comp            Fusion comp

const      param         const   param
|           |             \       /
|         fusion           --add--
\           /                 |
\         /
   -- sub --
       |

The solutions seem to be:

1) use a common code base for OutlineExpressionFromComputation and CreateFusionInstruction.
2) to keep using CreateFusionInstruction, without extending the enumeration, but to add general annotations to the HloInstruction.

Option 1 seems like a better solution if you intend to eliminate fusion.   Would it be ok for me to make the OutlineExpressionFromComputation code use the fusion code for extraction?

David

dav...@graphcore.ai

unread,

Aug 13, 2017, 1:55:26 PM8/13/17

to XLA development

I have looked at the code of OutlineExpressionFromComputation. The output count check is because the root of the new computation is found by looking for the single instruction that has an external output. When there are two, it isn't obvious which should be the root of the new graph.

I presume this code could be used to outline instructions like Send/OutFeed, which are not dependencies of the root but cannot be pruned. If that were not true then the final instruction would necessarily be the output, and you wouldn't need the output check. But since it is, a check is needed.

Since both fusion and outliner provide features that the other one does not, I suspect that they cannot be reconciled. I will create my own version of outliner which doesn't support Send/Outfeed type instructions, and so can do the removal of instructions by user count as the fusion one does.

Cheers

David

Martin Z

unread,

Nov 23, 2017, 10:49:07 PM11/23/17

to XLA development

Thanks for the sharing. If I understand correctly, Call and Fusion are similar in terms of executing the fused operations, so could you please give me any hints on what are the performance advantages of Fusion compared without Fusion?

I learned from the source code that if serveral operations in an HloComputation (CompOld) can be fused, a new Fusion instruction is added to CompOld, and fused operations are removed from CompOld. Then a new HloComputation (CompNew) is created consisting of the fused operations. The added Fusion instruction in CompOld has a pointer pointing to CompNew. When it comes to the backend, the LLVM IR are emitted independently for both CompOld and CompNew. So what are the performance advantages of doing fusion?

在 2017年8月13日星期日 UTC+10下午7:17:58，dav...@graphcore.ai写道：

dav...@graphcore.ai

unread,

Nov 28, 2017, 6:39:00 AM11/28/17

to XLA development

I don't really know much about the LLVM backend. From what I understand, the fusion mechanism is for combining operations at a low level which helps to direct the LLVM compiler to produce more optimized code.

The call mechanism is more like a traditional call operation - a subcomputation is called, passing the caller parameters into the subcomputation and taking the output and putting it back into the main graph.

I suspect that the performance advantages are backend dependent.

Again, I cannot speak for the LLVM backends, but on the Graphcore backend we only use the call mechanism. Some subcomputations are called in the normal sense described above, but some subcomputations have well known names and therefore are translated into specific operations that the backend supports (gradient calculating convolutions, sigmoid op, truncated normal distribution random op, etc).

Reply all

Reply to author

Forward