Dear PAPI users,
This email is to let you all know that we have a new ROCm component that we would like you to try out and, possibly, report issues on.
The component is currently in a separate development branch. You can get the component by cloning the following PAPI fork:
The new component improves on the currently distributed one in a number of compartments. Most notably:
- support for ROC profiler sampling as well as kernel intercept mode (component currently in master only supports sampling);
- support single as well as multi-thread monitoring (component currently in master only supports single);
- support counter sampling through PAPI_overflow (component currently in master does not support counter sampling).
Once you have obtained the code, you can configure and build PAPI as follows:
$ cd papi/src
$ git checkout 2022.01.11_rocm-rewrite
$ export PAPI_ROCM_ROOT=<ROCM_INSTALL_DIR>
$ ./configure --with-components=rocm && make
The papi/src/component/rocm/tests directory contains a range of tests/examples that are useful to understand how the different component's features work. Currently, there are five main test categories:
- single-thread monitoring: one thread (the main one typically) creates and manages one event set that contains events from (potentially) all GPUs in the system;
- multi-thread monitoring: the main thread spawns one worker thread per GPU and each of those creates and manages its own event set to monitor only the corresponding GPU;
- single-kernel monitoring: one thread launches only one kernel on a GPU device and monitors it;
- multi-kernel monitoring: one thread launches multiple kernels on a GPU device and monitors them. In this case the final value of the counters is accumulated by the component across the different kernel executions;
- counter sampling: one thread samples counters on a GPU device using the PAPI_overflow mechanism. To be noted that rocprofiler does not support counter overflow in hardware, thus PAPI emulates overflow using timers.
Each of the above tests is provided in different flavours. For example, single-thread monitoring is implemented for both sampling and intercept modes. The user can switch between the two modes through the ROCP_HSA_INTERCEPT environment variable (although tests do that automatically through setenv). If ROCP_HSA_INTERCEPT is unset (or set to "0") the component selects sampling mode as the default. If this is set to either "1" or "2" the component selects intercept mode. There are also test counterparts that use the PAPI high-level API (e.g. hl_intercept_single_thread_monitoring.cpp).
In order for the component to work properly, users have to make sure that PAPI_library_init() is called before any HIP/ROCm interface, as this initializes the rocprofiler environment. If PAPI_library_init() can't be called before HIP/ROCm interfaces, the user can still make the component work properly by doing a little extra work before running his application:
$ export HSA_TOOLS_LIB=<ROCM_INSTALL_DIR>/rocprofiler/lib/librocprofiler64.so
$ export ROCP_TOOL_LIB=<PAPI_INSTALL_DIR>/lib/libpapi.so
The first environment variable tells hsa_init() to load librocprofiler as a plugin tool. The second environment variable tells librocprofiler to load libpapi. The ROCm component in PAPI exports an OnLoadToolProp() interface that librocprofiler calls when initialized. This interface initializes the rocprofiler environment (as it would have been done if the user had called PAPI_library_init() first). An example of this initialization mechanism is provided by the hl_intercept_single_thread_monitoring.cpp test.
Sampling and intercept mode have different semantics for counters. Specifically, in sampling mode counters are GPU wide. This has some effect on the single vs multi thread monitoring capabilities of the component. For example, in sampling mode it is not possible for different threads to manage event sets that have events from the same GPU device (even if they are different events). In this case PAPI_add_event() will return PAPI_ECNFLT to indicate that the event set can't contain events that conflict with another event set. The user has to wait for the other event set to release the GPU first and then try again.
Intercept mode, on the other hand, does not have the above restriction as kernels are serialized by the GPU runtime. However, the user has to expect his application to take longer to run.
For any doubt/issue don't hesitate contacting us.
Best,
Giuseppe Congiu