Freddie,
We got the code to work, by reverting to OPAL hooks. Your suggestion was
correct, but I fear some more work is needed. The code runs with this command:
mpirun \
--mca pml_ucx_opal_mem_hooks 1 \
-report-bindings \\
pyfr run -b cuda mesh.pyfrm ../config.ini
For details, please read below. Are you running PyFR on Summit? I am not 100%
sure, but I think this may become relevant for you at some point.
I actually build OpenMPI myself. So the my build the following transport
layers are enabled:
Transports
-----------------------
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no
Intel SCIF: no
Intel TrueScale (PSM): no
Mellanox MXM: yes
Open UCX: yes <- This guy seems to be the culprit.
OpenFabrics Libfabric: no
OpenFabrics Verbs: yes
Portals4: no
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: yes
Shared memory/XPMEM: no
TCP: yes
When I build OpenMPI I need to point it to mellanox libraries. I validate my
build of OMPI with Intel Memory Benchmark. To maximise performance on
Infiniband OMPI needs to be able to find these libs.
Now here's how the train of thought on UCX went:
1) As per your suggestion, memory hooks are an issue here.
2) gdb top most backtrace said
#0 0x00003fff82994740 in ucm_malloc_mmaped_ptr_remove_if_exists (ptr=0x3eff0dd9bdd0) at malloc/malloc_hook.c:153
3) What is ucm? We go into openmpi and we look for "ucm_"
openmpi-3.1.2$grep -r ucm_*
ompi/mca/pml/ucx/pml_ucx_component.c:#include <ucm/api/ucm.h>
ompi/mca/pml/ucx/pml_ucx_component.c: ucm_vm_munmap(buf, length);
ompi/mca/pml/ucx/pml_ucx_component.c: ucm_set_external_event(UCM_EVENT_VM_UNMAPPED);
So UCX component is the only thing that uses it.
4) We run a command:
ompi_info --param pml ucx --level 9
MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v3.1.2)
MCA pml ucx: ---------------------------------------------------
MCA pml ucx: parameter "pml_ucx_verbose" (current value: "0", data source: default, level: 9 dev/all, type: int)
Verbose level of the UCX component
MCA pml ucx: parameter "pml_ucx_priority" (current value: "51", data source: default, level: 3 user/all, type: int)
Priority of the UCX component
MCA pml ucx: parameter "pml_ucx_num_disconnect" (current value: "1", data source: default, level: 3 user/all, type: int)
How may disconnects go in parallel
MCA pml ucx: parameter "pml_ucx_opal_mem_hooks" (current value: "false", data source: default, level: 3 user/all, type: boo
Use OPAL memory hooks, instead of UCX internal memory hooks
Valid values: 0: f|false|disabled|no|n, 1: t|true|enabled|yes|y
We use the last one to suppress UCX memory hooks. Code seems to work. Elementary?
Now I am going to test a few more examples. It's still not clear why this
manifests itself in 3D cylinder but not in 2D examples provided by you. It
baffles me why this works on the login node?
I need to test it with Spectrum MPI too, but DL has an older version of
Spectrum and I think it may take a while to get the new one on the system
and I want to test cuda-aware comms.
Hope this report helps. I think UCX may be important for you in the future so
it would be good to test PyFR with it. It's possible that my old builds of
OpenMPI did not include it and it's why I had a recollection that it all worked
smoothly in the past, but I have this hopeless habit of installing the latest
software whenever I start a new project...
--Unless stated otherwise above: