Hi XLA devs,
I successfully replay an clustered subgraph from ResNet50 by using XLA on both a hardware accelerator and a NV GPU with the same input tensor.
For bring the accelerator into XLA, I just hack the LLVM IR emittion process to make sure the emitted LLVM IRs are suitable for the accelerator. It means all the HLO pass are the same as NV GPU path.
I used numpy.allclose() with the threashold == 1e-5 to compare the output tensors (_Retval nodes) from these 2 devices, I got 184 failed out of 270 output tensors.
My question is how can I nail down the first thunk(kernel/custom-call) that introduced the difference? More generally, is there any method to dump the compute result of a thunk?
PS: I used tensorflow-r1.15