rocp_sbk component segfaults

34 views
Skip to first unread message

Kaufmann, Steve

unread,
Apr 29, 2025, 2:44:34 PMApr 29
to perfap...@icl.utk.edu, jag...@icl.utk.edu
Testing the rocp_sdk component with /opt/rocm-6.3.0 I get this message and segfault:

E20250429 10:34:57.569977 140232389658560 agent.cpp:969] size of rocprofiler agent struct used by caller is ABI-incompatible with rocprofiler_agent_v0_t in rocprofiler
srun: error: pinoak0005: task 0: Segmentation fault

after setting export PAPI_ROCP_SDK_ROOT=/opt/rocm-6.3.0. Setting it to /opt/rocm-6.4.0 allowed it to work. I also got the message and segfault when setting

export PAPI_ROCP_SDK_LIB=/opt/rocm-6.3.3/lib/librocprofiler-sdk.so.0

The traceback looked like:

(gdb) where
#0  0x000000000045bc75 in papi_rocpsdk::populate_event_list ()
    at components/rocp_sdk/sdk_class.cpp:593
#1  0x000000000045e7a9 in rocprofiler_sdk_init ()
    at components/rocp_sdk/sdk_class.cpp:1112
#2  0x0000000000458db7 in rocp_sdk_init_private ()
    at components/rocp_sdk/rocp_sdk.c:210
#3  0x0000000000459400 in check_n_initialize ()
    at components/rocp_sdk/rocp_sdk.c:450
#4  0x00000000004592a2 in rocp_sdk_ntv_enum_events (event_code=0x7ffe460ac800, 
    modifier=1) at components/rocp_sdk/rocp_sdk.c:382
#5  0x000000000040bf01 in PAPI_enum_cmp_event (EventCode=0x7ffe460ac85c, 
    modifier=1, cidx=1) at papi.c:1958
#6  0x000000000040537b in force_cmp_init (cid=1) at papi_component_avail.c:196
#7  0x000000000040504e in main (argc=1, argv=0x7ffe460ac9e8)
    at papi_component_avail.c:122

This was using the papi_component_avail utility. Same thing for papi_native_avail.

Thanks, Steve



Kaufmann, Steve

unread,
Apr 29, 2025, 2:44:38 PMApr 29
to perfap...@icl.utk.edu, jag...@icl.utk.edu
Follow up on this. I used rocm 6.4.0 to build PAPI and the component. So if one expected some type of backwards compatibility it is not there. I have not built with 6.3.0 and tested with 6.3.0 or later at runtime so I cannot atest to this compatibility.

Thanks, Steve


From: Kaufmann, Steve
Sent: Tuesday, April 29, 2025 10:40 AM
To: perfap...@icl.utk.edu <perfap...@icl.utk.edu>; jag...@icl.utk.edu <jag...@icl.utk.edu>
Subject: rocp_sbk component segfaults
 

Kaufmann, Steve

unread,
Apr 29, 2025, 2:44:43 PMApr 29
to perfap...@icl.utk.edu, jag...@icl.utk.edu
Another follow up on the rocp_sdk component logic.

The "librocprofiler-sdk.so" file is not allowed to be found via LD_LIBRARY_PATH using dlopen. For example, "/opt/rocm-6.4.0/lib" may appear in the LD_LIBRARY_PATH (our modules set this).

In the "obtain_function_pointers" function the env variables for PAPI_ROCP_SDK_LIB/_ROOT specify full paths to find the DSO.

There should be another dlopen call (if the previous two dlopen's fail using the fullpath) that just look for "librocprofiler-sbk.so", something analogous to what is done in the function "load_hsa_sym" to find the hsa DSO file.

This would avoid requiring the user to set the PAPI_ROCP_SDK_* env var to find the rocprofiler* symbols (not ideal).

Also, if ROCP_SDK_LIB is a valid env var this one is not checked to see if it can find the librocprofiler-sdk.so file.

Unless I am missing something ;-)

Steve

From: Kaufmann, Steve <steven....@hpe.com>
Sent: Tuesday, April 29, 2025 10:55 AM
Subject: Re: rocp_sbk component segfaults
 

Anthony Danalis

unread,
Apr 29, 2025, 3:02:08 PMApr 29
to Kaufmann, Steve, perfap...@icl.utk.edu, jag...@icl.utk.edu
Steve, thanks for doing all this testing. We really appreciate it.
Regarding rocm-6.3.0, it has API/ABI changes with the more recent ones, that's why the crash. 6.3.1 is the first version that partially works (only dispatch mode), and 6.3.2 is the first version that should work fine. I will update the README file to state that explicitly.

Regarding adding a dlopen without a path, you are right, I will add it. Regarding the variable "ROCP_SDK_LIB", why would that variable be set? If the user is willing to set a variable to point us to the path, they could set "PAPI_ROCP_SDK_LIB", which we do check. Maybe I'm the one missing something. :-)

Thanks,
Anthony


--
You received this message because you are subscribed to the Google Groups "perfapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to perfapi-deve...@icl.utk.edu.
To view this discussion visit https://groups.google.com/a/icl.utk.edu/d/msgid/perfapi-devel/DM4PR84MB31278AAE9963E91B33A71F2CEC802%40DM4PR84MB3127.NAMPRD84.PROD.OUTLOOK.COM.

Kaufmann, Steve

unread,
Apr 29, 2025, 4:11:34 PMApr 29
to Anthony Danalis, perfap...@icl.utk.edu, jag...@icl.utk.edu
The only reason I mention "ROCP_SDK_LIB" is that it is referred to in the error message, at least telling me that this might be a non-unexpected env var set by some "rocm" module. It also is in "ROCm" namespace where as the PAPI ones are obviously in PAPI namespace. I also see things like:

roc_profiler.c:    char *pathname = getenv("HSA_TOOLS_LIB");
roc_profiler.c:    const char *rocp_mode = getenv("ROCP_HSA_INTERCEPT");
roc_profiler.c:    char *hsa_tools_lib = getenv("HSA_TOOLS_LIB");
roc_profiler.c:    char *rocp_metrics = getenv("ROCP_METRICS");

which implies someone is providing env var in ROCm namespace. We could scan the rocp source code for "getenv" and see what it thinks is available to it at runtime. If not checking for it in PAPI then it should be removed from the error message.

Thanks, Steve


From: Anthony Danalis <adan...@icl.utk.edu>
Sent: Tuesday, April 29, 2025 2:01 PM
To: Kaufmann, Steve <steven....@hpe.com>
Cc: perfap...@icl.utk.edu <perfap...@icl.utk.edu>; jag...@icl.utk.edu <jag...@icl.utk.edu>
Subject: Re: [perfapi-devel] Re: rocp_sbk component segfaults
 

Kaufmann, Steve

unread,
Apr 30, 2025, 11:22:31 AMApr 30
to Anthony Danalis, Kaufmann, Steve, perfap...@icl.utk.edu, jag...@icl.utk.edu
Another issue I ran across. When I try to configure for rocm_smi component using rocm 6.4.0 I get:
...
In file included from /opt/rocm/include/rocm_smi/rocm_smi.h:57,
                 from components/rocm_smi/rocs.c:4:
/opt/rocm/include/rocm_smi/kfd_ioctl.h:26:10: fatal error: libdrm/drm.h: No such file or directory
   26 | #include <libdrm/drm.h>
      |          ^~~~~~~~~~~~~~
compilation terminated.
...
But this seems like a ROCm issue - I haven't figured out a way around it. Should papi configure "libdrm" for the build?

Thanks, Steve


From: 'Kaufmann, Steve' via perfapi-devel <perfap...@icl.utk.edu>
Sent: Tuesday, April 29, 2025 2:20 PM
To: Anthony Danalis <adan...@icl.utk.edu>

Kaufmann, Steve

unread,
Apr 30, 2025, 3:26:13 PMApr 30
to Anthony Danalis, perfap...@icl.utk.edu, jag...@icl.utk.edu
I "fixed" this by creating a

libdrm/drm.h

(empty) file in the rocm_smi directory and changed the Rules file to look in the components/rocm_smi directory when compiling. The compilation was successful (since no source references anything out of the drm.h file).

This is one "solution" at least to get by. The "libdrm/drm.h" could be anywhere in the papi source tree, and maybe even created on-the-fly out the rocm_smi directory itself, then removed afterwords. Just need that empty (just a touch drm.h) file to satisfy its reference from within rocm's kfd_ioctl.h file (looks like it is unfair to have DRM installed as a requirement for the build).

Steve


From: Kaufmann, Steve <steven....@hpe.com>
Sent: Tuesday, April 29, 2025 5:18 PM
To: Anthony Danalis <adan...@icl.utk.edu>; Kaufmann, Steve <steven....@hpe.com>

Treece Burgess

unread,
May 2, 2025, 9:09:52 AMMay 2
to Kaufmann, Steve, Anthony Danalis, perfap...@icl.utk.edu, jag...@icl.utk.edu
Hello Steve,

I do not encounter this issue during compile time when using ROCm 6.4.0 on a system with an MI300A. I also verified that the rocm_smi component was active with the PAPI utilities. I went ahead and looked in /opt/rocm/include/rocm_smi/kfd_ioctl.h and do not see #include <libdrm/drm.h>.

A couple questions:
  • What AMD device are you using?
  • Can you share how you are setting PAPI_ROCMSMI_ROOT and ./configure?
Best wishes,

Treece

Anthony Danalis

unread,
May 5, 2025, 5:03:10 PMMay 5
to Kaufmann, Steve, perfap...@icl.utk.edu, jag...@icl.utk.edu
Hi Steve,

I created PR#359, which (a) allows dlopen() to check the default paths and (b) removes the reference to ROCP_SDK_LIB. This variable does not come from ROCprofiler-SDK (I have confirmed with AMD); it is the old version of the variable we introduced, which we later changed to PAPI_ROCP_SDK_LIB so that it is in the PAPI namespace, since only we use it.

It would be great if you could test it before we merge it.

Thanks,
Anthony

Kaufmann, Steve

unread,
May 5, 2025, 6:18:54 PMMay 5
to Anthony Danalis, perfap...@icl.utk.edu, jag...@icl.utk.edu
I'll grab the files changed and try it out tomorrow....Thanks, Steve


From: Anthony Danalis <adan...@icl.utk.edu>
Sent: Monday, May 5, 2025 4:02 PM

Kaufmann, Steve

unread,
May 5, 2025, 7:00:41 PMMay 5
to Anthony Danalis, perfap...@icl.utk.edu, jag...@icl.utk.edu
Quick testing shows that the DSO is found via LD_LIBRARY_PATH (when the rocm component is loaded). So that'd good.

The issue I see is what PAPI_ROCP_SDK_LIB is set to something bogus, the DSO is still found via LD_LIBRARY_PATH. I don't think this is how it should behave. If a user explicitly sets an env var like this, and the file is not found/accessible/exists then it should be an error and not fall through to some default. It may give the user the wrong impression that the requested file was found. So when I set PAPI_ROCP_SDK_LIB to something like "xyz" I see no error and still get component info:
...
PAPI_ROCP_SDK_LIB=xyz srun ./papi_component_avail
...
Name:   rocp_sdk                GPU events and metrics via AMD ROCprofiler-SDK
                                Native: 529, Preset: 0, Counters: 529

PAPI_ROCP_SDK_ROOT behaves more or less correctly:
...
PAPI_ROCP_SDK_ROOT=xyz srun ./papi_component_avail
...
Name:   rocp_sdk                GPU events and metrics via AMD ROCprofiler-SDK
   \-> Disabled: xyz/lib/libhsa-runtime64.so: cannot open shared object file: No such file or directory

I think PAPI_ROCP_SDK_LIB should behave similarly. Thanks for looking into this!

Steve


From: 'Kaufmann, Steve' via perfapi-devel <perfap...@icl.utk.edu>
Sent: Monday, May 5, 2025 5:18 PM
To: Anthony Danalis <adan...@icl.utk.edu>

Anthony Danalis

unread,
May 5, 2025, 9:51:37 PMMay 5
to Kaufmann, Steve, perfap...@icl.utk.edu
I pushed an update to the PR to check the validity of PAPI_ROCP_SDK_LIB.

Thanks,
Anthony

Kaufmann, Steve

unread,
May 6, 2025, 9:46:50 AMMay 6
to Anthony Danalis, perfap...@icl.utk.edu
If I got the latest version of the file, the following still falls through to use the default:

PAPI_ROCP_SDK_LIB=xyz srun ./papi_component_avail

so there is no error messages. I'm making sure I've gotten the latest version of the file. I've attached what I am using.

Thanks, Steve



From: Anthony Danalis <adan...@icl.utk.edu>
Sent: Monday, May 5, 2025 8:51 PM
sdk_class.cpp

Kaufmann, Steve

unread,
May 6, 2025, 10:04:14 AMMay 6
to Anthony Danalis, Kaufmann, Steve, perfap...@icl.utk.edu
I made the following updates to sdk_class.cpp which induce a failure when PAPI_ROCP_SDK_ROOT is set to a bad filepath. One way to get the behavior I seek but you can address this the best way you see fit. Thanks, Steve

201c201,206
<         dllHandle = dlopen(pathname, RTLD_NOW | RTLD_GLOBAL);
---
>         if ((dllHandle = dlopen(pathname, RTLD_NOW | RTLD_GLOBAL)) == nullptr) {
>            std::string err_str = std::string("Invalid RocProfiler SDK library path: ")+pathname;
>            set_error_string(err_str);
>            ret_val = err_str.c_str();
>            goto fn_fail;
> }




From: 'Kaufmann, Steve' via perfapi-devel <perfap...@icl.utk.edu>
Sent: Tuesday, May 6, 2025 8:46 AM
sdk_class.cpp

Anthony Danalis

unread,
May 6, 2025, 11:22:26 AMMay 6
to Kaufmann, Steve, perfap...@icl.utk.edu
Hi Steve,

I added the check you asked for and did some additional cleanup.

Thanks,
Anthony

Kaufmann, Steve

unread,
May 6, 2025, 11:43:01 AMMay 6
to Anthony Danalis, perfap...@icl.utk.edu
Yup, tests are good. Ship it! Thanks so much for your time. Steve

...
Name:   rocp_sdk                GPU events and metrics via AMD ROCprofiler-SDK
   \-> Disabled: Invalid path in PAPI_ROCP_SDK_LIB: xyz
...


From: Anthony Danalis <adan...@icl.utk.edu>
Sent: Tuesday, May 6, 2025 10:22 AM

Treece Burgess

unread,
May 12, 2025, 3:51:05 PMMay 12
to Kaufmann, Steve, perfap...@icl.utk.edu
Hello Steve,

The rocm_smi component does not explicitly request libdrm. We only include rocm_smi.h. The problem you are facing seems to come from a mismatch in packages and distributions, rather than PAPI. Specifically, the offending include, #include <libdrm/drm.h> does not appear in $ROCM_ROOT/include/rocm_smi/kfd_ioctl.h on any computers I have checked. However, I saw that the fedora package libdrm-devel does provide the directory libdrm and the header file drm.h under it. Is installing this package an option for you?

Best wishes,

Treece


Kaufmann, Steve

unread,
May 13, 2025, 9:14:38 AMMay 13
to Treece Burgess, perfap...@icl.utk.edu
Thanks Treece!

This is unfortunate. Somewhere, sometime, ROCm is including libdrm/libdrm.h in one of its headers (note that the rocm header that does does NOT make any references from libdrm.h(!)). I suspect a latter version of ROCm 6.4.0 removes that header (or the versions you've noted are older and WILL eventually have a reference to libdrm), but the fact that is does appear in a ROCm 6.4.0 compromises the build of a PAPI component. While not technically a PAPI issue PAPI could work around this issue in order to successfully build the rocm_smi component. Not sure this matches the philosophy of PAPI development or if you'd want to spend the time.

I guess we'll not be able to configure for the rocm_smi component until upstream ROCm removes the inclusion of this header - it just isn't necessary. I hesitate to pull in yet another package into our build container but we'll have to if the consensus is that this component is important for end-users.

I appreciate your time into looking into this! Steve


From: Treece Burgess <tbur...@icl.utk.edu>
Sent: Monday, May 12, 2025 2:50 PM

Kaufmann, Steve

unread,
May 13, 2025, 4:38:45 PMMay 13
to Treece Burgess, perfap...@icl.utk.edu
A very minor thing, the following files could be added to .gitignore so they don't show up with a git status after a build:

        src/papi_components_config_event_defs.h
        src/papi_cuda_std_event_defs.h

Steve

Daniel Barry

unread,
May 13, 2025, 7:43:47 PMMay 13
to perfapi-devel, steven....@hpe.com, perfap...@icl.utk.edu, tbur...@icl.utk.edu
Hi Steve,

Thank you for bringing this to our attention. I have created PR #370 to address this.

Please let me know if it works on your end.

Thank you,
Daniel

Kaufmann, Steve

unread,
May 14, 2025, 9:38:01 AMMay 14
to Daniel Barry, perfapi-devel, tbur...@icl.utk.edu
Good to go. Thanks!


From: Daniel Barry <dba...@vols.utk.edu>
Sent: Tuesday, May 13, 2025 6:43 PM
To: perfapi-devel <perfap...@icl.utk.edu>
Cc: Kaufmann, Steve <steven....@hpe.com>; perfap...@icl.utk.edu <perfap...@icl.utk.edu>; tbur...@icl.utk.edu <tbur...@icl.utk.edu>
Subject: Re: minor change to .gitignore
 

Daniel Barry

unread,
May 14, 2025, 3:31:36 PMMay 14
to perfapi-devel, steven....@hpe.com, tbur...@icl.utk.edu, Daniel Barry
Hi Steve,

Thank you for testing these changes. As a quick update, the changes I introduced in the PR caused other buggy behavior with some of the 'make' commands.

I wanted to loop back on this. You mentioned that 'git status' shows the files:
- src/papi_components_config_event_defs.h
- src/papi_cuda_std_event_defs.h

In addition to these files, I see other files resulting from a build, such as:
- src/components_config.h
in addition to various object files.

Are the files you listed causing issues or otherwise being treated differently from, say " src/components_config.h," on your end?

Thank you for your help.
Daniel
Reply all
Reply to author
Forward
0 new messages