Understanding performance difference of two calculation methods on CPU

Simon King

unread,

Oct 29, 2021, 4:18:50 AM10/29/21

to XLA development

Hi XLA team and friends,

I am using tf.function to build ML models with XLA compilation enabled. The models basically perform calculation on a 2D input tensor, whose shape could be (100000, 15).

I created two models with different calculation methods. In the first method, I have a loop to iterate each row of input tensor, which has 100000 rows, and the loop body performs mathematical operations on each row. In the second method, I remove the loop but do operations in column wise, which basically perform the same mathematical operations on columns as the first method.

While testing the two models on GPU, the second method is much faster than the first method. However, on CPU, it's different and the first method is about 1.3 times faster than the second one, which is out of my expectation.

When Tensorboard profiles executions of two models on CPU, I got almost the same results as below:

To understand the performance difference of two models, how can I get the root reasons?

Best,

Simon

Sean Moriarity

unread,

Oct 29, 2021, 12:38:36 PM10/29/21

to XLA development

As far as I understand this is pretty expected.

Assuming you are using something such as `tf.while`, XLA while loops run on the CPU. So you have to return to the CPU after every iteration. They are always going to be more expensive than kernels generated from regular operations. You can see the implementation of the GPU while thunk here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/gpu/while_thunk.cc#L45

As far as the discrepancy on CPU, it would be hard to say without seeing what operations you are doing.

- Sean

Simon King

unread,

Oct 29, 2021, 2:29:31 PM10/29/21

to XLA development

Hi Sean,

Thanks for your insights of model performance on GPU.

I am very interested in the performance difference on CPU. The model basically uses 6 types of ops: tf.math.add, tf.math.mul, tf.gather, tf.while, tf.bitwise.bitwise_xor and tf.reshape. The model basically first do algebraic operations and gather tensor elements within the tf.while loop body.

How could I understand the difference of two methods on CPU?

Reply all

Reply to author

Forward