Huge measuring overhead for some cuda:::metric* events (CUPTI Metric API)

149 views
Skip to first unread message

Frank Winkler

unread,
Sep 3, 2020, 9:17:52 AM9/3/20
to ptools-perfapi

Hi all,

during some PAPI measurements with Kokkos kernels for V100 GPUs on summit using the PAPI-Kokkos connector, I discovered a huge measurement overhead for some cuda:::metric* events, in particular "cuda:::metric:inst_integer" and  "cuda:::metric:flop_count_dp".

Is this issue already known?

To exclude PAPI as the cause, I checked the mentioned events in a small CUPTI example which is part of PAPI's cuda component:


The following shows the GPU processing time for a simple reduce kernel:

./cudaTest_cupti_only inst_integer
GPU Processing time (includes setup and memcpy overheads): 4.036579 (ms)

./cudaTest_cupti_only flop_count_dp
GPU Processing time (includes setup and memcpy overheads): 3.899511 (ms)

./cudaTest_cupti_only inst_executed
GPU Processing time (includes setup and memcpy overheads): 1.022937 (ms)

When measuring events from "CUPTI Event API" such as "cuda:::event:inst_executed" the kernel takes about 1ms. "cuda:::metric:inst_integer" (CUPTI Metric API) takes four times longer.

I can reproduce this behavior for the Kokkos Kernel "sparse_spgemm" with large input matrices such as "nd24k" using the PAPI-Kokkos connector.
In this example the measuring time is 18s with "cuda:::event:inst_executed" and 33s with "cuda:::metric:flop_count_dp".

According to the CUPTI documentation, the profiling overhead for software instrumented events can be enormous, see: https://docs.nvidia.com/cupti/Cupti/r_main.html#r_overhead_profiling

I think "cuda:::metric:flop_count_dp" falls into this category.

I am just wondering if anyone has already encountered this problem before...

Thanks,
Frank

Kevin Huck

unread,
Sep 3, 2020, 9:48:32 AM9/3/20
to Frank Winkler, ptools-perfapi
Frank -

I think what you are seeing is overhead from the CUPTI metric/event measurement.  The current CUPTI metric/event measurement (deprecated in compute 7.5 devices) has notoriously high overhead.  Compute 7.0 introduced the new PerfWorks API in CUPTI with supposedly lower overhead, but most open source tools (including PAPI, I believe) haven’t ported to the new API.  We (the TAU team) have been investigating the PerfWorks API, but haven’t finished our port either.

Thanks -
Kevin

On Sep 3, 2020, at 6:17 AM, Frank Winkler <frank....@icl.utk.edu> wrote:


--
You received this message because you are subscribed to the Google Groups "ptools-perfapi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ptools-perfap...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/ptools-perfapi/3255C149-E639-422D-869D-CE67F279D686%40icl.utk.edu.

Frank Winkler

unread,
Sep 3, 2020, 10:13:20 AM9/3/20
to Kevin Huck, ptools-perfapi
Thanks for the hint!

Best,
Frank
Reply all
Reply to author
Forward
0 new messages