Re: confusing output for CUDA

26 views
Skip to first unread message

Heike Jagode

unread,
Jul 29, 2021, 7:06:51 PM7/29/21
to Kaufmann, Steve, perfap...@icl.utk.edu
Steve,

Thanks for reporting this!
I just added a fix for this to the repo. The papi_component_avail utility should report the correct numbers now.

Please let us know if there are any issues.

Thanks again!
Heike


On Thu, Jul 29, 2021 at 4:31 PM Kaufmann, Steve <steven....@hpe.com> wrote:
Not sure if I am interpreting this correctly, but when I configure PAPI with 'cuda' and run papi_component_avail on a node with NVIDIA GPU attached I see that there are no native events (-1). But if my next command is papi_native_avail the list of available events for the GPU are listed. See attached.

How do I reconcile what papi_component_avail says and what papi_native_avail lists?

This is NOT CUDA 11 stuff.

Thanks!

Steve



--
______________________________________
Heike Jagode, Ph.D., Research Asst. Professor
Innovative Computing Laboratory, University of Tennessee Knoxville
http://icl.utk.edu/~jagode/

Heike Jagode

unread,
Aug 5, 2021, 7:18:48 PM8/5/21
to Kaufmann, Steve, perfap...@icl.utk.edu
 Hi Steve,

I see that your cuda and nvml components are "disabled". Therefore, they shouldn't be listed as "active components" at all. That is still a pending issue we will look into.

The "-1" issue is fixed for components that are not disabled and active.


Before the fix, I was able to reproduce your reported error:

[jagode@b04 bin]$ ./papi_component_avail
....
Compiled-in components:
Name:   perf_event              Linux perf_event CPU counters
Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
Name:   cuda                    CUDA events and metrics via NVIDIA CuPTI interfaces
Name:   nvml                    NVML provides the API for monitoring NVIDIA hardware (power usage, temperature, fan speed, etc)

Active components:
Name:   perf_event              Linux perf_event CPU counters
                                Native: 164, Preset: 56, Counters: 10
                                PMUs supported: ix86arch, perf, perf_raw, hsw_ep

Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
                                Native: 850, Preset: 0, Counters: 112
                                PMUs supported: rapl, hswep_unc_cbo0, hswep_unc_cbo1, hswep_unc_cbo2, hswep_unc_cbo3
                                                hswep_unc_cbo4, hswep_unc_cbo5, hswep_unc_cbo6, hswep_unc_cbo7, hswep_unc_cbo8
                                                hswep_unc_cbo9, hswep_unc_ha0, hswep_unc_ha1, hswep_unc_imc0, hswep_unc_imc1
                                                hswep_unc_imc4, hswep_unc_imc5, hswep_unc_pcu, hswep_unc_qpi0, hswep_unc_qpi1
                                                hswep_unc_ubo, hswep_unc_r2pcie, hswep_unc_r3qpi0, hswep_unc_r3qpi1
                                                hswep_unc_sbo0, hswep_unc_sbo1, hswep_unc_sbo2, hswep_unc_sbo3

Name:   cuda                    CUDA events and metrics via NVIDIA CuPTI interfaces
                                Native: -1, Preset: 0, Counters: -1

Name:   nvml                    NVML provides the API for monitoring NVIDIA hardware (power usage, temperature, fan speed, etc)
                                Native: -1, Preset: 0, Counters: -1



After the fix, I get the following:


[jagode@b04 bin]$ ./papi_component_avail
...
Compiled-in components:
Name:   perf_event              Linux perf_event CPU counters
Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
Name:   cuda                    CUDA events and metrics via NVIDIA CuPTI interfaces
Name:   nvml                    NVML provides the API for monitoring NVIDIA hardware (power usage, temperature, fan speed, etc)

Active components:
Name:   perf_event              Linux perf_event CPU counters
                                Native: 164, Preset: 56, Counters: 10
                                PMUs supported: ix86arch, perf, perf_raw, hsw_ep

Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
                                Native: 850, Preset: 0, Counters: 112
                                PMUs supported: rapl, hswep_unc_cbo0, hswep_unc_cbo1, hswep_unc_cbo2, hswep_unc_cbo3
                                                hswep_unc_cbo4, hswep_unc_cbo5, hswep_unc_cbo6, hswep_unc_cbo7, hswep_unc_cbo8
                                                hswep_unc_cbo9, hswep_unc_ha0, hswep_unc_ha1, hswep_unc_imc0, hswep_unc_imc1
                                                hswep_unc_imc4, hswep_unc_imc5, hswep_unc_pcu, hswep_unc_qpi0, hswep_unc_qpi1
                                                hswep_unc_ubo, hswep_unc_r2pcie, hswep_unc_r3qpi0, hswep_unc_r3qpi1
                                                hswep_unc_sbo0, hswep_unc_sbo1, hswep_unc_sbo2, hswep_unc_sbo3

Name:   cuda                    CUDA events and metrics via NVIDIA CuPTI interfaces
                                Native: 792, Preset: 0, Counters: 792

Name:   nvml                    NVML provides the API for monitoring NVIDIA hardware (power usage, temperature, fan speed, etc)
                                Native: 72, Preset: 0, Counters: 72


Thanks,
Heike

--------------------------------------------------------------------------------

On Thu, Aug 5, 2021 at 4:04 PM Kaufmann, Steve <steven....@hpe.com> wrote:
Hi Heiki - I've applied your fix but am still having issues, ie, the #of events is still -1. I am also noticing that events now require the leading "PMU:::" when the event is not a "perf_event". I am wondering if this is related - having looked at the code in papi_internal.c where the ":::" is evaluated and a dummy event using -1's is set up in the logic (?). These -1's may be short-ciruiting the search for the event name and therefore not all components are searched (depending on the order they are configured). Thanks! Steve

...
Compiled-in components:
Name:   perf_event              Linux perf_event CPU counters
Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
   \-> Disabled: No uncore PMUs or events found
Name:   cray_cuda               Nvidia GPU hardware counters
   \-> Disabled: CUDA runtime library unavailable.
Name:   cray_zenl3              Cray AMD Zen Level 3 Cache performance counters
Name:   cray_pm                 Cray Power Management counters
Name:   cray_rapl               Cray RAPL energy measurements
Name:   cray_cassini            HPE Cray Cassini NIC performance counters
   \-> Disabled: CRAYPE_NETWORK_TARGET is 'ofi' and not relevant for the cray_cassini component.
Name:   cray_npu                Cray network interconnect performance counters
   \-> Disabled: CRAYPE_NETWORK_TARGET environment variable contains a value that is not valid
Name:   cuda                    CUDA events and metrics via NVIDIA CuPTI interfaces
   \-> Disabled: CUDA initialization (cuInit) failed: no CUDA-capable device is detected
Name:   nvml                    NVML provides the API for monitoring NVIDIA hardware (power usage, temperature, fan speed, etc)
   \-> Disabled: The NVIDIA management library failed to initialize.
Name:   infiniband              Linux Infiniband statistics using the sysfs interface
   \-> Disabled: Infiniband sysfs interface not found

Active components:
Name:   perf_event              Linux perf_event CPU counters
                                Native: 233, Preset: 23, Counters: 6
                                PMUs supported: perf, perf_raw, amd64_fam19h_zen3

Name:   cray_zenl3              Cray AMD Zen Level 3 Cache performance counters
                                Native: 4, Preset: 0, Counters: 6

Name:   cray_pm                 Cray Power Management counters
                                Native: 14, Preset: 0, Counters: 14

Name:   cray_rapl               Cray RAPL energy measurements
                                Native: 5, Preset: 0, Counters: 5

Name:   cuda                    CUDA events and metrics via NVIDIA CuPTI interfaces
                                Native: -1, Preset: 0, Counters: -1

Name:   nvml                    NVML provides the API for monitoring NVIDIA hardware (power usage, temperature, fan speed, etc)
                                Native: -1, Preset: 0, Counters: -1
...

From: Heike Jagode <jag...@icl.utk.edu>
Sent: Thursday, July 29, 2021 6:06 PM
To: Kaufmann, Steve <steven....@hpe.com>
Cc: perfap...@icl.utk.edu <perfap...@icl.utk.edu>
Subject: Re: confusing output for CUDA
 
Reply all
Reply to author
Forward
0 new messages