Hi TFLite team and friends,
I built a model based on tf.function and its runtime inference performance is very important for us. While using Tensorboard to profile the model inference time on GPU, I found the element-wise add and multiply operations are performed on CPU instead of GPU, which cause a remarkable overhead. Here's the codes of element-wise operations in the model:
state = tf.gather(input_text, indices=indices, axis=1)
state = tf.math.add(state, byte_indices)
box = tf.gather(box, indices=state)
# more tf.gather and bit-wise operations
To run the model on GPU, I used:
strategy = tf.distribute.MirroredStrategy()
With Tensorboard, I got the profile UI as below. The state = tf.math.add(state, byte_indices) is done on CPU not GPU and it cause unexpected memory copy overhead.
Is there any way to force the element-wise operations, especially add and multiply, work on GPU?