Help papi cuda component disabled

162 views
Skip to first unread message

Buket Benek Gursoy

unread,
Mar 17, 2020, 11:27:37 AM3/17/20
to ptools-...@icl.utk.edu
Dear Sir/Madam,

I am having an issue when installing papi with the cuda component in our system which is CentOS Linux release 7.3.1611 and has 2 Quadro RTX 5000 GPUs.

I tried both as a root and as a user but cuda component not available after the install. I followed these steps:
$ module load cuda/10.1.243 gcc/8.2.0
$ cd papi
$ git pull 
$ cd src
$ ./configure --prefix=$PAPIDIR --with-components="cuda nvml rapl" 
$ export PAPI_CUDA_ROOT=$CUDADIR
$ make
$ make install

When I check the list of components with papi_component_avail, I get:
$ papi_component_avail
Compiled-in components:
Name:   perf_event              Linux perf_event CPU counters
Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
Name:   cuda                    CUDA events and metrics via NVIDIA CuPTI interfaces
   \-> Disabled: 
Name:   nvml                    NVML provides the API for monitoring NVIDIA hardware (power usage, temperature, fan speed, etc)
Name:   rapl                    Linux RAPL energy measurements

Active components:
Name:   perf_event              Linux perf_event CPU counters
                                Native: 224, Preset: 56, Counters: 10
                                PMUs supported: ix86arch, perf, perf_raw, hsw_ep

Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
                                Native: 761, Preset: 0, Counters: 97
                                PMUs supported: rapl, hswep_unc_cbo0, hswep_unc_cbo1, hswep_unc_cbo2, hswep_unc_cbo3
                                                hswep_unc_cbo4, hswep_unc_cbo5, hswep_unc_cbo6, hswep_unc_cbo7, hswep_unc_ha0
                                                hswep_unc_imc0, hswep_unc_imc1, hswep_unc_imc2, hswep_unc_imc3, hswep_unc_imc4
                                                hswep_unc_pcu, hswep_unc_qpi0, hswep_unc_qpi1, hswep_unc_ubo, hswep_unc_r2pcie
                                                hswep_unc_r3qpi0, hswep_unc_r3qpi1, hswep_unc_sbo0, hswep_unc_sbo1

Name:   nvml                    NVML provides the API for monitoring NVIDIA hardware (power usage, temperature, fan speed, etc)
                                Native: 36, Preset: 0, Counters: 36

Name:   rapl                    Linux RAPL energy measurements
                                Native: 28, Preset: 0, Counters: 28


For your information, profiling support in the nvidia kernel module was enabled and standard user was given access to profile by:
echo -e "options nvidia "NVreg_RestrictProfilingToAdminUsers=0"" > /etc/modprobe.d/nvidia.conf

I really appreciate if you help me debug this issue.

Regards,
Buket

Buket Benek Gursoy, Ph.D.
Computational Scientist
Irish Centre for High-End Computing  (ICHEC) - www.ichec.ie
7th Floor, The Tower, Trinity Technology & Enterprise Campus,
Grand Canal Quay, Dublin 2, Ireland




Anthony Castaldo

unread,
Mar 17, 2020, 1:31:24 PM3/17/20
to Buket Benek Gursoy, ptools-perfapi
Hi Dr. Gursoy,

I will be looking into this.

I may have questions, or debug code for you to run.

-Tony





--
You received this message because you are subscribed to the Google Groups "ptools-perfapi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ptools-perfap...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/ptools-perfapi/DBCF5823-8245-4AC5-9535-14FA530EEC8F%40ichec.ie.

Anthony Castaldo

unread,
Mar 23, 2020, 10:55:44 AM3/23/20
to Buket Benek Gursoy, ptools-perfapi
Hi Dr. Gursoy;

I am working on a debug version of the component, but one thing occurred to me; on some systems 'module load cuda" sets $CUDA_DIR, on others $CUDADIR.
Can you double check that your PAPI_CUDA_ROOT is actually pointing at a directory?
'echo $PAPI_CUDA_ROOT'

I will send you a replacement for papi/src/components/cuda/linux-cuda.c in a few hours, with instructions.
Right now I can't see anything in the code that would allow the component to be disabled without any
explanation why, and I can't match your particular equipment, so I will have to track it down with debug printfs.

-Tony


 

Anthony Castaldo

unread,
Mar 23, 2020, 11:44:32 AM3/23/20
to Buket Benek Gursoy, ptools-perfapi
Dr. Gursoy,

Attached is linux-cuda.c. Please copy to replace papi/src/components/cuda/linux-cuda.c.
Also, from papi/src, 'touch utils/papi_component_avail.c' to ensure it will be recompiled.
then, from papi/src, execute 'make', it should recompile linux-cuda.c and rebuild the library and papi_component_avail.

This version contains stderr checkpoints for our component initialization code, basically around every decision point. Some are inside of loops, so a lot can be output.

I was wrong a few minutes ago, I realized we do have some exit points inside macros that wouldn't set a reason for the component to be disabled; that is something we should address in future development.

But for now, looking at the checkpoints should help. These will look something like this:

_cuda_init_component:930 Checkpoint.
_cuda_linkCudaLibraries:286 Checkpoint.
_cuda_add_native_events:470 Checkpoint.

Basically 'routine_name:line# Checkpoint'.

There were 2078 of these when I tested it with a GPU to find. I suggest you redirect them to a file, since they are on stderr, something like:
papi/src/utils/papi_component_avail 2>errors.txt

After it runs, send me the errors.txt file.
Hopefully in this round we can narrow this down to a failure point, and then in a subsequent round with more detailed output figure out more exactly what went wrong.

(After we are done testing, you can restore the original linux-cuda.c by doing 'git checkout linux-cuda.c' from the papi/src/components/cuda/ directory, but I might want to send you a replacement to use that circumvents whatever error we are experiencing.

Let me know if any of that isn't clear.

-Tony



On Mon, Mar 23, 2020 at 10:09 AM Buket Benek Gursoy <buket....@ichec.ie> wrote:
Dear Tony,

Thank you very much for looking into this.

I manually exported CUDADIR and PAPIDIR variables before using them. I double checked anyway and PAPI_CUDA_ROOT is pointing: /ichec/packages/cuda/10.1.243/ in our system.

I will try with the new .c file when you send and follow the instructions given.

Regards,
Buket

<PastedGraphic-1.tiff>


--
You received this message because you are subscribed to the Google Groups "ptools-perfapi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ptools-perfap...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/ptools-perfapi/DBCF5823-8245-4AC5-9535-14FA530EEC8F%40ichec.ie.
Buket Benek Gursoy, Ph.D.
Computational Scientist
Irish Centre for High-End Computing  (ICHEC) - www.ichec.ie
7th Floor, The Tower, Trinity Technology & Enterprise Campus,
Grand Canal Quay, Dublin 2, Ireland




linux-cuda.c

Anthony Castaldo

unread,
Mar 23, 2020, 2:44:20 PM3/23/20
to Buket Benek Gursoy, ptools-perfapi
Okay, Buket!

That narrows it down to a line (546), about getting the max domains on the device, which apparently failed.
(It was in a macro). Now I have to go read and try to remember what that means!

Attached find a new linux-cuda.c to replace that one, so we can see exactly what error code gets returned.
There will probably be just two outputs, after deleting all those printf()s the checkpoint is on line 499.

Same instructions as before; copy it over, touch papi_component_avail, and make.

-Tony

On Mon, Mar 23, 2020 at 1:18 PM Buket Benek Gursoy <buket....@ichec.ie> wrote:
Dear Tony,

Thank you very much for the explanation. 

I attach the error and output file below.

Best Regards,
Buket


<PastedGraphic-1.tiff><PastedGraphic-1.tiff>

<linux-cuda.c>
linux-cuda.c

Anthony Castaldo

unread,
Mar 24, 2020, 10:24:50 AM3/24/20
to Buket Benek Gursoy, ptools-perfapi
Hi Buket,

From the online manual (https://docs.nvidia.com/cupti/Cupti/modules.html) I get:
"An unknown internal error has occurred. Legacy CUPTI Profiling is not supported on devices with Compute Capability 7.5 or higher (Turing+).
Using this error to specify this case and differentiate it from other errors."

Then here (https://developer.nvidia.com/cuda-gpus#compute) we find the Quadro GTX 5000 has a Compute Capability of 7.5.

So bad news, PAPI does not support it.

Unfortunately, we are CUPTI centric in our component, so I'd probably have to learn the new paradigm and write a new NVIDIA component for PAPI to support your GPUs.

-Tony




On Mon, Mar 23, 2020 at 8:02 PM Buket Benek Gursoy <buket....@ichec.ie> wrote:
Great! Thanks a lot. This is the new error:

_cuda_add_native_events:499 Checkpoint.
Line 502 CUPTI_CALL macro '(*cuptiDeviceGetNumEventDomainsPtr) (mydevice->cuDev, &mydevice->maxDomains)' failed with error #00000026='CUPTI_ERROR_LEGACY_PROFILER_NOT_SUPPORTED'.

I am also going to search where this error is coming from. If I can find a solution, I will let you know while waiting suggestion from you at the same time.

Regards,
Buket

Anthony Castaldo

unread,
Mar 24, 2020, 6:40:23 PM3/24/20
to Buket Benek Gursoy, ptools-perfapi
Buket,

Yes, sorry. I'm not sure what the schedule is; I have some code to get done by the end of this month, I don't know what the development priorities are after that.

-Tony

On Tue, Mar 24, 2020 at 4:55 PM Buket Benek Gursoy <buket....@ichec.ie> wrote:
Dear Tony,

Thank you very much for looking into the issue.

This is unfortunate. I assume you won’t be able to prioritise such implementation soon. 

I’ll talk with my team to see if we can source older hardware with earlier version of compute capability till we have  PAPI with 7.5+ support.

Regards,
Buket

<PastedGraphic-1.tiff><PastedGraphic-1.tiff>

Buket Benek Gursoy

unread,
Mar 25, 2020, 4:00:25 PM3/25/20
to Anthony Castaldo, ptools-perfapi
Dear Tony,

Thank you very much for looking into this.

I manually exported CUDADIR and PAPIDIR variables before using them. I double checked anyway and PAPI_CUDA_ROOT is pointing: /ichec/packages/cuda/10.1.243/ in our system.

I will try with the new .c file when you send and follow the instructions given.

Regards,
Buket
<PastedGraphic-1.tiff>


--
You received this message because you are subscribed to the Google Groups "ptools-perfapi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ptools-perfap...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/ptools-perfapi/DBCF5823-8245-4AC5-9535-14FA530EEC8F%40ichec.ie.
Buket Benek Gursoy, Ph.D.
Computational Scientist
Irish Centre for High-End Computing  (ICHEC) - www.ichec.ie
7th Floor, The Tower, Trinity Technology & Enterprise Campus,
Grand Canal Quay, Dublin 2, Ireland




Buket Benek Gursoy

unread,
Mar 25, 2020, 4:00:25 PM3/25/20
to Anthony Castaldo, ptools-perfapi
Great! Thanks a lot. This is the new error:
_cuda_add_native_events:499 Checkpoint.
Line 502 CUPTI_CALL macro '(*cuptiDeviceGetNumEventDomainsPtr) (mydevice->cuDev, &mydevice->maxDomains)' failed with error #00000026='CUPTI_ERROR_LEGACY_PROFILER_NOT_SUPPORTED'.

I am also going to search where this error is coming from. If I can find a solution, I will let you know while waiting suggestion from you at the same time.

Regards,
Buket

Buket Benek Gursoy

unread,
Mar 25, 2020, 4:00:25 PM3/25/20
to Anthony Castaldo, ptools-perfapi
Dear Tony,

errors.txt
output.txt
PastedGraphic-1.tiff
PastedGraphic-1.tiff

Buket Benek Gursoy

unread,
Mar 25, 2020, 4:00:25 PM3/25/20
to Anthony Castaldo, ptools-perfapi
Dear Tony,

Thank you very much for looking into the issue.

This is unfortunate. I assume you won’t be able to prioritise such implementation soon. 

I’ll talk with my team to see if we can source older hardware with earlier version of compute capability till we have  PAPI with 7.5+ support.

Regards,
Buket
<PastedGraphic-1.tiff><PastedGraphic-1.tiff>

Kaufmann, Steve

unread,
May 18, 2020, 4:14:34 PM5/18/20
to Anthony Castaldo, ptools-perfapi
Since the AMD fam17h zen1 and zen2 event tables were split into separate tables the PAPI presets for zen2 have not been updated and many no longer apply. The papi_events.csv file needs updating such that presets custom for zen2 are defined.

Thanks,
Steve

Vince Weaver

unread,
Jul 6, 2020, 12:51:27 PM7/6/20
to Kaufmann, Steve, Anthony Castaldo, ptools-perfapi
it looks like the zen2 and zen1 events are quite different so this is not
a quick fix. Do you happen to have an idea of what the new events should
be, especially the cache ones?

Vince

Anthony Danalis

unread,
Jul 17, 2020, 12:15:45 PM7/17/20
to Kaufmann, Steve, ptools-perfapi
I updated the list for Zen2 by removing events that do not exist and adding some that were missing. All changes have been tested on AMD hardware we have access to, but It would be greatly appreciated if you can test them on your hardware. The changes are currently in this pull request:

thanks,
Anthony


--
You received this message because you are subscribed to the Google Groups "ptools-perfapi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ptools-perfap...@icl.utk.edu.
Reply all
Reply to author
Forward
0 new messages