Re: ROCM a mess

44 views
Skip to first unread message

Anthony Castaldo

unread,
Sep 9, 2021, 10:17:18 AM9/9/21
to Kaufmann, Steve, perfap...@icl.utk.edu, jag...@icl.utk.edu
Steve,

We worked fine on 4.0, then bugs were introduced (by AMD) in 4.1 and 4.2, which were supposed to be fixed in 4.3, but I never got to test with 4.3 before I moved to the Linear Algebra group.

So no, I never did build and test with 4.3.

In 4.1 and 4.2, one of their initialization functions (that PAPI must use) was failing on us with a generic non-specific error (0x1000).

In PAPI, we don't throw any exceptions; this error sounds like something wrong with AMD code in 4.3.

PAPI does of course use the rocprofiler to get event values. In the past, it has been tricky, AMD requires various environment variables be set for the rocprofiler to work correctly, and specifically what they are and how to set them has changed on us without warning in the past.

-Tony

On Thu, Sep 9, 2021 at 8:50 AM Kaufmann, Steve <steven....@hpe.com> wrote:
We are having all sorts of issues with the ROCM component. No doubt that this may be more of AMD ROCM/HIP/HSA issues than anything, but the latest is when building and executing with ROCM 4.3.0. Commands such as papi_component_avail and papi_native_avail dump core when run on a MI60 node with:

terminate called after throwing an instance of 'rocprofiler::util::exception'
  what():  OnLoad(), code objects tracking without intercept mode enabled

Have you been able to build and test with the latest ROCM? Thanks, Steve

Anthony Danalis

unread,
Sep 15, 2021, 12:13:57 PM9/15/21
to Kaufmann, Steve, perfap...@icl.utk.edu
Steve, I can reproduce the following behaviors:

- When /opt/rocm-4.0.0/lib is in my LD_LIBRARY_PATH, then the PAPI
utilities work as expected.
- When /opt/rocm-4.1.1/lib, or /opt/rocm-4.2.0/lib is in my
LD_LIBRARY_PATH, then the rocm component is disabled and the error
message is: ROCM hsa_init() failed with error 4096.
- When /opt/rocm-4.3.0/lib, or /opt/rocm-4.4.0/lib is in my
LD_LIBRARY_PATH, then the utilities result in a core dump, just as you
experienced. When the core dump occurs gdb shows the following
backtrace:
#0 0x0000155554d9c70f in raise () from /lib64/libc.so.6
#1 0x0000155554d86b25 in abort () from /lib64/libc.so.6
#2 0x0000155553d625e3 in ?? () from
/cm/local/apps/gcc/9.2.0/lib64/libstdc++.so.6
#3 0x0000155553d6e006 in ?? () from
/cm/local/apps/gcc/9.2.0/lib64/libstdc++.so.6
#4 0x0000155553d6e051 in std::terminate() () from
/cm/local/apps/gcc/9.2.0/lib64/libstdc++.so.6
#5 0x0000155553d6dffb in
std::rethrow_exception(std::__exception_ptr::exception_ptr) () from
/cm/local/apps/gcc/9.2.0/lib64/libstdc++.so.6
#6 0x0000155554925326 in rocr::AMD::handleException() () from
/opt/rocm-4.4.0/lib/libhsa-runtime64.so
#7 0x0000155554922f10 in rocr::HSA::hsa_init() [clone .cold.46] ()
from /opt/rocm-4.4.0/lib/libhsa-runtime64.so
#8 0x000000000041a4d3 in _rocm_init_private () at
components/rocm/linux-rocm.c:733
#9 0x0000000000403244 in PAPI_get_component_info (cidx=cidx@entry=2)
at papi.c:1354
#10 0x00000000004029f8 in main (argc=<optimized out>, argv=<optimized
out>) at papi_component_avail.c:115

Since PAPI accesses vendor libraries through dlopen (instead of
linking against them at compile/link time), the run-time environment
plays a more important role than the environment at compile time.

In summary, rocm-4.0 currently works with PAPI, and we will work with
AMD to address the problems with more recent versions of rocm.


> On Thu, Sep 9, 2021 at 8:50 AM Kaufmann, Steve <steven....@hpe.com> wrote:
>>
>> We are having all sorts of issues with the ROCM component. No doubt that this may be more of AMD ROCM/HIP/HSA issues than anything, but the latest is when building and executing with ROCM 4.3.0. Commands such as papi_component_avail and papi_native_avail dump core when run on a MI60 node with:
>>
>> terminate called after throwing an instance of 'rocprofiler::util::exception'
>> what(): OnLoad(), code objects tracking without intercept mode enabled
>>
>> Have you been able to build and test with the latest ROCM? Thanks, Steve
>
> --
> You received this message because you are subscribed to the Google Groups "perfapi-devel" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to perfapi-deve...@icl.utk.edu.
> To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/perfapi-devel/CAMa2CE9YFGQ73iMJqpjAjDUY%3D9z23wjoZ1TZY-L%3Djwz-94GfsQ%40mail.gmail.com.

Anthony Danalis

unread,
Oct 13, 2021, 4:30:26 PM10/13/21
to Kaufmann, Steve, perfapi-devel
Steve, the latest commit (629344d) should fix the problem. It would be
great if you could check it in your local environment.

thanks,
Anthony

On Wed, Sep 15, 2021 at 12:20 PM Kaufmann, Steve
<steven....@hpe.com> wrote:
>
> Thanks for the feedback Anthony! I'll let you know of any further new RE: ROCM's behavior. Steve
>
> ________________________________
> From: Anthony Danalis <adan...@icl.utk.edu>
> Sent: Wednesday, September 15, 2021 11:13 AM
> To: Kaufmann, Steve <steven....@hpe.com>
> Cc: perfap...@icl.utk.edu <perfap...@icl.utk.edu>
> Subject: Re: [perfapi-devel] Re: ROCM a mess
> > To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/perfapi-devel/CAMa2CE9YFGQ73iMJqpjAjDUY=9z23wjoZ1TZY-L=jwz-9...@mail.gmail.com .
>
> --
> You received this message because you are subscribed to the Google Groups "perfapi-devel" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to perfapi-deve...@icl.utk.edu.
> To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/perfapi-devel/CA+1F=1Lem7fhAiZ2H7T+ecj0vA7...@mail.gmail.com .
Reply all
Reply to author
Forward
0 new messages