"Cannot allocate memory” / pgtable failure with Open MPI and UCX 1.16 or newer

0 views
Skip to first unread message

Christoph Niethammer

unread,
4:25 AM (12 hours ago) 4:25 AM
to us...@lists.open-mpi.org
Dear all,

We are hitting the following error when running a simple Open MPI “Hello World” with UCX 1.16 or newer and Open MPI 5.0.x and some 4.1.5+ versions on a single node:

rcache.c:248 UCX ERROR mmap(size=151552) failed: Cannot allocate memory
pgtable.c:75 Fatal: Failed to allocate page table directory
*** Process received signal ***
Signal: Aborted (6)
Signal code: (-6)

This is on CentOS 8.10, kernel 4.18, 192 GB RAM, Intel Xeon Gold 6138 (dual-socket Skylake, 40 cores). The failure is reproducible only when using more than 20-24 MPI ranks; fewer than 20 ranks work fine. Older UCX versions on the same system (e.g. 1.12) do not show this issue.

The issue also goes away if we run Open MPI with the ob1 PML (without UCX) or disable for the UCX PML some of the TLS with UCX_TLS=^shm or UCX_TLS=^ib.

Has anyone seen similar "mmap failed / Failed to allocate page table directory" errors with UCX > 1.15 and Open MPI 4.1.x/5.0.x, or is aware of known regressions or configuration pitfalls (e.g. rcache, huge pages, memtype cache, or other UCX/Open MPI memory-related settings)? Are there specific UCX environment variables or OMPI MCA parameters you would recommend trying to diagnose this further?

I can provide full ompi_info, ucx_info, build options, and more complete logs if that is helpful.


Many thanks in advance for any hints or suggestions.


Best regards,
Christoph Niethammer

--

Dr.-Ing. Christoph Niethammer
High Performance Computing Center Stuttgart (HLRS)
Nobelstrasse 19
70569 Stuttgart

Tel: ++49(0)711-685-87203
email: christoph....@hlrs.de
https://www.hlrs.de/people/christoph-niethammer

George Bosilca

unread,
11:19 AM (5 hours ago) 11:19 AM
to us...@lists.open-mpi.org
It looks like some form of resource exhaustion, possibly exceeding the number of entries into the mmap table. What is the value of `vm.max_map_count` on this system ? You can obtain it with `sysctl vm.max_map_count` or `cat /proc/sys/vm/max_map_count`.

  George


To unsubscribe from this group and stop receiving emails from it, send an email to users+un...@lists.open-mpi.org.

Reply all
Reply to author
Forward
0 new messages