Hi, I'm using OpenMPI 4.0.5 with CUDA support on PSC Bridges-2. I'm calling collectives like MPI_Allreduce on buffers that have already been shared between ranks via cudaIpcGetMemHandle/cudaIpcOpenMemHandle.
On these buffers, I receive the following message and some communication sizes fail:
--------------------------------------------------------------------------
The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
cuIpcGetMemHandle return value: 1
address: 0x147d54000068
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------
If I pass in the two mca parameters to disable OpenMPI IPC, everything works.
I'm wondering two things:
Is this failure to handle IPC buffers in OpenMPI 4 a known issue?
When I disable OpenMPI CUDA IPC with mca parameters, does OpenMPI still use GPUDirect RDMA?
Thanks,
Mike Adams