Hello George,
thanks for the suggestion.
On ou rsystem vm.max_map_count is currently 64k.
I should also mention that memory overcommit is diabled (vm.overcommit_memory = 2).
If we change this setting to allow memory overcommit (vm.overcommit_memory = 0 or 1), the issue dissapears.
However, it still looks somewhat surprising that a simple “Hello World” application with ~20 MPI processes
already triggers this behaviour. Given that the system has 192 GB of RAM, it does not seem obvious why startup
would fail due to memory allocation at such a small scale.
Ideally we would prefer to keep memory overcommit disabled, since it helps detect memory issues in user applications
early rather than failing later at runtime.
Is there a way to influence this memory exhausting behaviour with some settings in OpenMPI (or UCX compinents).
So far we also experimented with adjusting the UCX FIFO sizes, since the defaults changed in newer UCX releases.
In particular we tried restoring the older values used in UCX 1.15: UCX_POSIX_FIFO_SIZE=64, UCX_SYSV_FIFO_SIZE=64, UCX_XPMEM_FIFO_SIZE=64.
Unfortunately this did not resolve the issue.
Are there other UCX parameters (e.g. related to shared-memory transports, rcache behaviour, or memtype cache) or
Open MPI MCA parameters that could reduce the number of memory mappings or the amount of virtual memory reserved
during startup?
Any suggestions for further debugging or configuration options to try would be highly appreciated.
Best regards,
Christoph