I would like to benchmark some different layers in Tensorflow. This can be
done in PyTorch by using CUDA events to ensure sychronization. I am wondering if there is a Tensorflow equivalent for this. I am currently using the naive time.time() method for measuring in Tensorflow, but this seems incorrect. For example, when benchmarking a Conv2D layer with different batch sizes, I get roughly equal times for batch sizes {1, 64, 128, 256}. When benchmarking the same layer in PyTorch using the CUDA events, however, I see that the runtime increases as the batch size increases which matches my intuition (run time increases but throughput improves).
What is the best lightweight way to profile snippets of Tensorflow GPU code programmatically? Running the TF profiler and then inspecting the output in the browser is not ideal.