Profiling library overhead

Matthias Jouanneaux DE

unread,

Dec 1, 2020, 7:22:49 AM12/1/20

to XLA development

Hi,

I'm fairly new to XLA.

I realize that there are profilers for the operations, but how do you profile end-to-end including library overheads (such as compilation phases, but also from the stream executor) ?

Thanks,

Matt

George Karpenkov

unread,

Dec 1, 2020, 1:51:53 PM12/1/20

to Matthias Jouanneaux DE, XLA development

Hi Matt,

I think this depends on what exactly you are trying to do. Get an overall profiler picture? Get timings for a particular operation? Create benchmarks?

George

--
You received this message because you are subscribed to the Google Groups "XLA development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/74bf7387-570e-4044-97f2-986c5ad1dfacn%40googlegroups.com.

Matthias Jouanneaux DE

unread,

Dec 2, 2020, 6:35:30 AM12/2/20

to XLA development

Hi George,

Best would be an overall profiler picture. Something similar to what you would get with callgrind (which I've tried using but didn't get it to work with my tensorflow app for now).

Also, I'm not particularly interested in the operations themselves, I'd like to see what happens in between two operations.

Thanks,

Matt

George Karpenkov

unread,

Dec 2, 2020, 1:22:36 PM12/2/20

to Matthias Jouanneaux DE, XLA development

But that would make almost no sense for e.g. running on GPU, where most of the time on the host (as recorded by the profiler) would be simply waiting.

What part are you trying to optimize?

To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/7fb3f97b-5d8c-48ca-a6ef-d46d79290e82n%40googlegroups.com.

Matthias Jouanneaux DE

unread,

Dec 2, 2020, 2:55:36 PM12/2/20

to XLA development

This is for applications which do benefit from GPU/TPU acceleration but individual kernels (even fused by XLA) are small enough that latencies in between those kernels matter significantly (in my particular apps, those seem to be in the range 1-5us).

Plus, there are latencies in between XLA clusters (in my particular apps, those seem to be in the range 10-100us).

Basically, I'd simply like to know what's happening there for a particular app and which functions are taking how much of these latencies.

Personally, I also find that profiling specific apps is a better way of learning about the design choices and inner workings of a framework than reading documentation or code without context, but that's just a byproduct.

Chris Leary

unread,

Dec 2, 2020, 3:53:54 PM12/2/20

to Matthias Jouanneaux DE, XLA development

Usually for this I believe we tend to use the "TraceMe" instrumentation points (which can capture labels, and I think also optionally stack traces) and look at them in the xprof profiling tool.

See, for example: https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc;l=521?q=traceme%20file:xla&ss=tensorflow

I'm not sure how much of xprof is fully open sourced at this point (I think there was a concerted effort to open source more of it last time I heard, which was a while back), but the instrumentation points / hooks could still be interesting potentially?

- Chris Leary

To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/becb62b1-2542-4e60-a5c5-e698419c4381n%40googlegroups.com.

Chris Leary

unread,

Dec 2, 2020, 3:55:13 PM12/2/20

to Matthias Jouanneaux DE, Jose Baiocchi Paredes, XLA development

(xprof can also correlate against libcupti launch events which I think don't cause much extra synchronization and are helpful to see inline with the host instrumentation points, adding +Jose Baiocchi Paredes for the expert info!)

George Karpenkov

unread,

Dec 2, 2020, 5:57:51 PM12/2/20

to Chris Leary, Matthias Jouanneaux DE, Jose Baiocchi Paredes, XLA development

xprof should be fully open-sourced as a part of tensorboard.

@Matthias: Chris is right, tensorboard profilter traces are most useful for this kind of debugging (I believe NVidia profiler was actually able to see those annotations as well?)

To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/CAFxp7WYRShHbn5VDA0SOFBqQCVfyMRWZOSqvHCSGQNBW%2Bq2iRA%40mail.gmail.com.

Matthias Jouanneaux DE

unread,

Dec 4, 2020, 9:51:15 AM12/4/20

to XLA development

Thank you all for the suggestions!

I'm very new to xprof/tensorboard profiler, but I was able to use it and I'm currently investigating further.

Reply all

Reply to author

Forward