Hi all,
during some PAPI measurements with Kokkos kernels for V100 GPUs on summit using the PAPI-Kokkos connector, I discovered a huge measurement overhead for some cuda:::metric* events, in particular "cuda:::metric:inst_integer" and "cuda:::metric:flop_count_dp".
Is this issue already known?
To exclude PAPI as the cause, I checked the mentioned events in a small CUPTI example which is part of PAPI's cuda component:
The following shows the GPU processing time for a simple reduce kernel:
./cudaTest_cupti_only inst_integer
GPU Processing time (includes setup and memcpy overheads): 4.036579 (ms)
./cudaTest_cupti_only flop_count_dp
GPU Processing time (includes setup and memcpy overheads): 3.899511 (ms)
./cudaTest_cupti_only inst_executed
GPU Processing time (includes setup and memcpy overheads): 1.022937 (ms)
When measuring events from "CUPTI Event API" such as "cuda:::event:inst_executed" the kernel takes about 1ms. "cuda:::metric:inst_integer" (CUPTI Metric API) takes four times longer.
I can reproduce this behavior for the Kokkos Kernel "sparse_spgemm" with large input matrices such as "nd24k" using the PAPI-Kokkos connector.
In this example the measuring time is 18s with "cuda:::event:inst_executed" and 33s with "cuda:::metric:flop_count_dp".
I think "cuda:::metric:flop_count_dp" falls into this category.
I am just wondering if anyone has already encountered this problem before...
Thanks,
Frank