Dear all,
We are hitting the following error when running a simple Open MPI “Hello World” with UCX 1.16 or newer and Open MPI 5.0.x and some 4.1.5+ versions on a single node:
rcache.c:248 UCX ERROR mmap(size=151552) failed: Cannot allocate memory
pgtable.c:75 Fatal: Failed to allocate page table directory
*** Process received signal ***
Signal: Aborted (6)
Signal code: (-6)
This is on CentOS 8.10, kernel 4.18, 192 GB RAM, Intel Xeon Gold 6138 (dual-socket Skylake, 40 cores). The failure is reproducible only when using more than 20-24 MPI ranks; fewer than 20 ranks work fine. Older UCX versions on the same system (e.g. 1.12) do not show this issue.
The issue also goes away if we run Open MPI with the ob1 PML (without UCX) or disable for the UCX PML some of the TLS with UCX_TLS=^shm or UCX_TLS=^ib.
Has anyone seen similar "mmap failed / Failed to allocate page table directory" errors with UCX > 1.15 and Open MPI 4.1.x/5.0.x, or is aware of known regressions or configuration pitfalls (e.g. rcache, huge pages, memtype cache, or other UCX/Open MPI memory-related settings)? Are there specific UCX environment variables or OMPI MCA parameters you would recommend trying to diagnose this further?
I can provide full ompi_info, ucx_info, build options, and more complete logs if that is helpful.
Many thanks in advance for any hints or suggestions.
Best regards,
Christoph Niethammer
--
Dr.-Ing. Christoph Niethammer
High Performance Computing Center Stuttgart (HLRS)
Nobelstrasse 19
70569 Stuttgart
Tel:
++49(0)711-685-87203
email:
christoph....@hlrs.de
https://www.hlrs.de/people/christoph-niethammer