Hi Kevin,
If PAPI is configured (built and installed) with the cuda component, e.g. ./configure --with-components=cuda, but there is no NVIDIA device on the node, then this component will be disabled, and PAPI should work as usual. The same policy applied to all other PAPI components.
I tried to reproduce this on our local machine, and it looks like the following:
(1)
I clone PAPI and configure, build, install on the login node (where there are no GPUs):
./configure --prefix=$PWD/install --with-components="cuda nvml"
make && make install
(2)
I run papi_component_avail to see what components are enabled / disabled:
[jagode@login bin]$ ./papi_component_avail
....
Compiled-in components:
Name: perf_event Linux perf_event CPU counters
Name: perf_event_uncore Linux perf_event CPU uncore and northbridge
\-> Disabled: No uncore PMUs or events found
Name: cuda CUDA events and metrics via NVIDIA CuPTI interfaces
\-> Disabled: CUDA initialization (cuInit) failed: no CUDA-capable device is detected
Name: nvml NVML provides the API for monitoring NVIDIA hardware (power usage, temperature, fan speed, etc)
\-> Disabled: The NVIDIA management library failed to initialize.
Active components:
Name: perf_event Linux perf_event CPU counters
Native: 179, Preset: 65, Counters: 6
PMUs supported: nhm_ex, ix86arch, perf, perf_raw
As you can see, the cuda and nvml components are diabled.
(3)
I continue using that PAPI installation to monitor a non-GPU event:
[jagode@login bin]$ ./papi_command_line PAPI_TOT_INS
Successfully added: PAPI_TOT_INS
PAPI_TOT_INS : 200623004
(4)
Now, if I run papi_component_avail (from the same PAPI installation) on a node that has GPUs, then these components become enabled and cuda events can be collected:
[jagode@b04 bin]$ ./papi_component_avail
...
Compiled-in components:
Name: perf_event Linux perf_event CPU counters
Name: perf_event_uncore Linux perf_event CPU uncore and northbridge
Name: cuda CUDA events and metrics via NVIDIA CuPTI interfaces
Name: nvml NVML provides the API for monitoring NVIDIA hardware (power usage, temperature, fan speed, etc)
Active components:
Name: perf_event Linux perf_event CPU counters
Native: 162, Preset: 56, Counters: 10
PMUs supported: ix86arch, perf, perf_raw, hsw_ep
Name: perf_event_uncore Linux perf_event CPU uncore and northbridge
Native: 850, Preset: 0, Counters: 112
PMUs supported: rapl, hswep_unc_cbo0, hswep_unc_cbo1, hswep_unc_cbo2, hswep_unc_cbo3
hswep_unc_cbo4, hswep_unc_cbo5, hswep_unc_cbo6, hswep_unc_cbo7, hswep_unc_cbo8
hswep_unc_cbo9, hswep_unc_ha0, hswep_unc_ha1, hswep_unc_imc0, hswep_unc_imc1
hswep_unc_imc4, hswep_unc_imc5, hswep_unc_pcu, hswep_unc_qpi0, hswep_unc_qpi1
hswep_unc_ubo, hswep_unc_r2pcie, hswep_unc_r3qpi0, hswep_unc_r3qpi1
hswep_unc_sbo0, hswep_unc_sbo1, hswep_unc_sbo2, hswep_unc_sbo3
Name: cuda CUDA events and metrics via NVIDIA CuPTI interfaces
Native: 792, Preset: 0, Counters: 792
Name: nvml NVML provides the API for monitoring NVIDIA hardware (power usage, temperature, fan speed, etc)
Native: 72, Preset: 0, Counters: 72
Since you mention that you built your own PAPI library, can you reconfigure your local PAPI so that it includes the cuda component, and report back if you are still running into issues? I'm asking because in your backtrace I see "_cray_cuda_init_component at components/cray_cuda" and we don't have a cray_cuda component. So, I don't know if there is a difference between the PAPI cuda component and the "cray_cuda" component from your bt.
Thanks,
Heike