Profiling library overhead

455 views
Skip to first unread message

Matthias Jouanneaux DE

unread,
Dec 1, 2020, 7:22:49 AM12/1/20
to XLA development
Hi,

I'm fairly new to XLA.
I realize that there are profilers for the operations, but how do you profile end-to-end including library overheads (such as compilation phases, but also from the stream executor) ?

Thanks,
Matt

George Karpenkov

unread,
Dec 1, 2020, 1:51:53 PM12/1/20
to Matthias Jouanneaux DE, XLA development
Hi Matt,

I think this depends on what exactly you are trying to do. Get an overall profiler picture? Get timings for a particular operation? Create benchmarks?

George

--
You received this message because you are subscribed to the Google Groups "XLA development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/74bf7387-570e-4044-97f2-986c5ad1dfacn%40googlegroups.com.

Matthias Jouanneaux DE

unread,
Dec 2, 2020, 6:35:30 AM12/2/20
to XLA development
Hi George,

Best would be an overall profiler picture. Something similar to what you would get with callgrind (which I've tried using but didn't get it to work with my tensorflow app for now).
Also, I'm not particularly interested in the operations themselves, I'd like to see what happens in between two operations.

Thanks,
Matt

George Karpenkov

unread,
Dec 2, 2020, 1:22:36 PM12/2/20
to Matthias Jouanneaux DE, XLA development
But that would make almost no sense for e.g. running on GPU, where most of the time on the host (as recorded by the profiler) would be simply waiting.

What part are you trying to optimize?

Matthias Jouanneaux DE

unread,
Dec 2, 2020, 2:55:36 PM12/2/20
to XLA development
This is for applications which do benefit from GPU/TPU acceleration but individual kernels (even fused by XLA) are small enough that latencies in between those kernels matter significantly (in my particular apps, those seem to be in the range 1-5us).
Plus, there are latencies in between XLA clusters (in my particular apps, those seem to be in the range 10-100us).
Basically, I'd simply like to know what's happening there for a particular app and which functions are taking how much of these latencies.
Personally, I also find that profiling specific apps is a better way of learning about the design choices and inner workings of a framework than reading documentation or code without context, but that's just a byproduct.

Chris Leary

unread,
Dec 2, 2020, 3:53:54 PM12/2/20
to Matthias Jouanneaux DE, XLA development
Usually for this I believe we tend to use the "TraceMe" instrumentation points (which can capture labels, and I think also optionally stack traces) and look at them in the xprof profiling tool.


I'm not sure how much of xprof is fully open sourced at this point (I think there was a concerted effort to open source more of it last time I heard, which was a while back), but the instrumentation points / hooks could still be interesting potentially?

- Chris Leary

Chris Leary

unread,
Dec 2, 2020, 3:55:13 PM12/2/20
to Matthias Jouanneaux DE, Jose Baiocchi Paredes, XLA development
(xprof can also correlate against libcupti launch events which I think don't cause much extra synchronization and are helpful to see inline with the host instrumentation points, adding +Jose Baiocchi Paredes for the expert info!)

George Karpenkov

unread,
Dec 2, 2020, 5:57:51 PM12/2/20
to Chris Leary, Matthias Jouanneaux DE, Jose Baiocchi Paredes, XLA development
xprof should be fully open-sourced as a part of tensorboard.

@Matthias: Chris is right, tensorboard profilter traces are most useful for this kind of debugging (I believe NVidia profiler was actually able to see those annotations as well?)

Matthias Jouanneaux DE

unread,
Dec 4, 2020, 9:51:15 AM12/4/20
to XLA development
Thank you all for the suggestions!
I'm very new to xprof/tensorboard profiler, but I was able to use it and I'm currently investigating further.

Reply all
Reply to author
Forward
0 new messages