This is for applications which do benefit from GPU/TPU acceleration but individual kernels (even fused by XLA) are small enough that latencies in between those kernels matter significantly (in my particular apps, those seem to be in the range 1-5us).
Plus, there are latencies in between XLA clusters (in my particular apps, those seem to be in the range 10-100us).
Basically, I'd simply like to know what's happening there for a particular app and which functions are taking how much of these latencies.
Personally, I also find that profiling specific apps is a better way of learning about the design choices and inner workings of a framework than reading documentation or code without context, but that's just a byproduct.