Hi XLA team and friends,
I am using tf.function to build ML models with XLA compilation enabled. The models basically perform calculation on a 2D input tensor, whose shape could be (100000, 15).
I created two models with different calculation methods. In the first method, I have a loop to iterate each row of input tensor, which has 100000 rows, and the loop body performs mathematical operations on each row. In the second method, I remove the loop but do operations in column wise, which basically perform the same mathematical operations on columns as the first method.
While testing the two models on GPU, the second method is much faster than the first method. However, on CPU, it's different and the first method is about 1.3 times faster than the second one, which is out of my expectation.
When Tensorboard profiles executions of two models on CPU, I got almost the same results as below:
To understand the performance difference of two models, how can I get the root reasons?
Best,
Simon