"Cannot allocate memory” / pgtable failure with Open MPI and UCX 1.16 or newer

3 views
Skip to first unread message

Christoph Niethammer

unread,
Mar 4, 2026, 4:25:11 AMMar 4
to us...@lists.open-mpi.org
Dear all,

We are hitting the following error when running a simple Open MPI “Hello World” with UCX 1.16 or newer and Open MPI 5.0.x and some 4.1.5+ versions on a single node:

rcache.c:248 UCX ERROR mmap(size=151552) failed: Cannot allocate memory
pgtable.c:75 Fatal: Failed to allocate page table directory
*** Process received signal ***
Signal: Aborted (6)
Signal code: (-6)

This is on CentOS 8.10, kernel 4.18, 192 GB RAM, Intel Xeon Gold 6138 (dual-socket Skylake, 40 cores). The failure is reproducible only when using more than 20-24 MPI ranks; fewer than 20 ranks work fine. Older UCX versions on the same system (e.g. 1.12) do not show this issue.

The issue also goes away if we run Open MPI with the ob1 PML (without UCX) or disable for the UCX PML some of the TLS with UCX_TLS=^shm or UCX_TLS=^ib.

Has anyone seen similar "mmap failed / Failed to allocate page table directory" errors with UCX > 1.15 and Open MPI 4.1.x/5.0.x, or is aware of known regressions or configuration pitfalls (e.g. rcache, huge pages, memtype cache, or other UCX/Open MPI memory-related settings)? Are there specific UCX environment variables or OMPI MCA parameters you would recommend trying to diagnose this further?

I can provide full ompi_info, ucx_info, build options, and more complete logs if that is helpful.


Many thanks in advance for any hints or suggestions.


Best regards,
Christoph Niethammer

--

Dr.-Ing. Christoph Niethammer
High Performance Computing Center Stuttgart (HLRS)
Nobelstrasse 19
70569 Stuttgart

Tel: ++49(0)711-685-87203
email: christoph....@hlrs.de
https://www.hlrs.de/people/christoph-niethammer

George Bosilca

unread,
Mar 4, 2026, 11:19:46 AMMar 4
to us...@lists.open-mpi.org
It looks like some form of resource exhaustion, possibly exceeding the number of entries into the mmap table. What is the value of `vm.max_map_count` on this system ? You can obtain it with `sysctl vm.max_map_count` or `cat /proc/sys/vm/max_map_count`.

  George


To unsubscribe from this group and stop receiving emails from it, send an email to users+un...@lists.open-mpi.org.

Christoph Niethammer

unread,
Mar 6, 2026, 5:16:19 AMMar 6
to Open MPI Users
Hello George,

thanks for the suggestion.

On ou rsystem vm.max_map_count is currently 64k.

I should also mention that memory overcommit is diabled (vm.overcommit_memory = 2).
If we change this setting to allow memory overcommit (vm.overcommit_memory = 0 or 1), the issue dissapears.

However, it still looks somewhat surprising that a simple “Hello World” application with ~20 MPI processes
already triggers this behaviour. Given that the system has 192 GB of RAM, it does not seem obvious why startup
would fail due to memory allocation at such a small scale.

Ideally we would prefer to keep memory overcommit disabled, since it helps detect memory issues in user applications
early rather than failing later at runtime.

Is there a way to influence this memory exhausting behaviour with some settings in OpenMPI (or UCX compinents).
So far we also experimented with adjusting the UCX FIFO sizes, since the defaults changed in newer UCX releases.
In particular we tried restoring the older values used in UCX 1.15: UCX_POSIX_FIFO_SIZE=64, UCX_SYSV_FIFO_SIZE=64, UCX_XPMEM_FIFO_SIZE=64.
Unfortunately this did not resolve the issue.

Are there other UCX parameters (e.g. related to shared-memory transports, rcache behaviour, or memtype cache) or
Open MPI MCA parameters that could reduce the number of memory mappings or the amount of virtual memory reserved
during startup?

Any suggestions for further debugging or configuration options to try would be highly appreciated.

Best regards,
Christoph

George Bosilca

unread,
Mar 6, 2026, 10:00:04 AMMar 6
to us...@lists.open-mpi.org
Christoph,

As you indicated that disabling UCX makes this issue go away, it seems the memory exhaustion is arising from UCX. I have limited knowledge about the UCX internals, when I need to change its behavior I use `ucx_info -c` and then dig into the output. For a better answer I would suggest asking on the UCX github.

Best,
  George.

Reply all
Reply to author
Forward
0 new messages