Hi Team,
My application fails with following error [compiled with openmpi-5.0.7, ucx-1.18.0, cuda-12.8, gdrcopy-2.5 ]:
Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x14bd8f464160)
==== backtrace (tid:1104544) ====
0 0x000000000006141c ucs_callbackq_cleanup() ???:0
1 0x00000000000615da ucs_callbackq_cleanup() ???:0
2 0x000000000003e6f0 __GI___sigaction() :0
3 0x0000000000159af7 __memcpy_avx_unaligned_erms() :0
4 0x0000000000076b3d ucp_proto_rndv_handle_data() ???:0
5 0x000000000005ef21 ucs_callbackq_add_safe() ???:0
6 0x000000000004a42a ucp_worker_progress() ???:0
7 0x0000000000027ce4 opal_progress() ???:0
8 0x000000000009028f ompi_request_default_wait_any() ???:0
9 0x00000000000d94a2 MPI_Waitany() ???:0
10 0x00000000010c71c7 gmx::PmeCoordinateReceiverGpu::Impl::waitForCoordinatesFromAnyPpRank() ???:0
11 0x00000000010d211c pme_gpu_spread() ???:0
12 0x0000000000f4502e pme_gpu_launch_spread() ???:0
13 0x0000000000f2cf0a gmx_pmeonly() ???:0
14 0x0000000000f9a15c gmx::Mdrunner::mdrunner() ???:0
15 0x000000000040960a gmx::gmx_mdrun() ???:0
16 0x000000000040975d gmx::gmx_mdrun() ???:0
17 0x000000000077d2a3 gmx::CommandLineModuleManager::run() ???:0
18 0x0000000000405f1d main() ???:0
19 0x0000000000029590 __libc_start_call_main() ???:0
20 0x0000000000029640 __libc_start_main_alias_2() :0
21 0x0000000000405fa5 _start() ???:0
=================================
This error is due to CUDA GDR_COPY.
For the GPU Direct RDMA feature, openmpi needs to be installed with ucx, in which ucx needs to be installed with cuda & gdr_copy. The latest versions of ucx & gdr_copy are 1.18.0 & 2.5 respectively. But openmpi recommends ucx-1.4:
which was released in 2018 [6-7 years old].
Is openmpi not tested with the latest versions of ucx, cuda, gdr_copy? Do we have to still use ucx-1.4 only?