Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#984956: openmpi-bin: with mpirun --host <remote>: orte crashes with FORCE-TERMINATE [...] plm_base_launch_support.c

337 views
Skip to first unread message

Matthias Maier

unread,
Mar 10, 2021, 6:40:04 PM3/10/21
to
Package: openmpi-bin
Version: 4.1.0-7
Severity: normal
X-Debbugs-Cc: tamiko...@kyomu.43-1.org

Dear Maintainer,

mpirun crashes when trying to schedule a task on a foreign host:

$ mpirun --host bob hostname
[alice:705956] [[31919,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/odls/base/odls_base_default_fns.c at line 226
[alice:705956] [[31919,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/plm/base/plm_base_launch_support.c at line 552
--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[31919,0],0] FORCE-TERMINATE AT (null):1 - error ../../../../../orte/mca/plm/base/plm_base_launch_support.c(553)

This is something that should be reported to the developers.
--------------------------------------------------------------------------

Here, the mpirun command was issued on computer "alice" and "bob" is a foreign
host reachable via ssh.


Steps to reproduce:
===================

I originally encountered this issue on a small cluster (that I am currently
setting up). But, I was able to reproduce this locally by setting up two lxc
containers. Thus, the following should work to reproduce the issue:

- use two debian computers with a local user that can ssh (via pubkey) from
one machine to another

- make sure that no firewall drops packets between the two.

- install openmpi-bin and run

# mpirun --host <remote host ip> hostname


What was the outcome of this action?
====================================

An internal error in ORTE terminated mpirun.


What outcome did you expect instead?
====================================

mpirun --host bob should print "bob" and succeed (or complain loudly that I am
using it wrongly).



-- System Information:
Debian Release: bullseye/sid
APT prefers testing
APT policy: (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 5.10.0-3-amd64 (SMP w/64 CPU threads)
Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages openmpi-bin depends on:
ii libc6 2.31-9
ii libevent-core-2.1-7 2.1.12-stable-1
ii libopenmpi3 4.1.0-7
ii openmpi-common 4.1.0-7
ii openssh-client [ssh-client] 1:8.4p1-4

openmpi-bin recommends no packages.

Versions of packages openmpi-bin suggests:
ii gfortran [fortran-compiler] 4:10.2.1-1

-- no debconf information

Matthias Maier

unread,
Mar 12, 2021, 4:40:03 PM3/12/21
to
Dear all,

A short update on the problem:

The ORTE slurm plm is unaffected by this issue. So we were able to use
the Debian default openmpi package on the cluster when using slurm to
manage tasks.

Best,
Matthias

sixerjman

unread,
Mar 21, 2021, 2:40:03 PM3/21/21
to
I am hit by this bug also. Going to try and learn SLURM in the meantime. Is there
any progress? Thank you for your work on this excellent package.

Vassilis Virvilis

unread,
May 7, 2021, 10:30:04 AM5/7/21
to
Ok I think I made some headway but I would welcome some insight from somebody more knowledgeable

I think the problem is a potential mixup of the internal vs the external pmix library in openmpi.

In my setup the call to
` rc = PMIx_Get(&p, key, pinfo, sz, &pval);
`
at ext3x_client.c:656 fills the pval buffer with 3 entries.
```
(gdb) p ((pmix_info_t*) pval->data.darray->array)[1].value.type
$2 = 56
```
The second value (index  == 1) has value.type = 56 (pmix.topo2) which is outside the range of supported value types. I think the last entry is PMIX_REGEX 46 at ./debian/build-gfortran/opal/mca/pmix/pmix3x/pmix/include/pmix_common.h

However in /usr/lib/x86_64-linux-gnu/pmix2/include/pmix_common.h the list goes further ending in PMIX_COMPRESSED_BYTE_OBJECT 59 with 56 being PMIX_TOPO

If I hack a bit

bill@odin:~/src/openmpi-4.1.0$ diff -u ./debian/build-gfortran/opal/mca/pmix/ext3x/ext3x.c~ ./debian/build-gfortran/opal/mca/pmix/ext3x/ext3x.c
--- ./debian/build-gfortran/opal/mca/pmix/ext3x/ext3x.c~        2021-05-07 11:21:38.000000000 +0300
+++ ./debian/build-gfortran/opal/mca/pmix/ext3x/ext3x.c 2021-05-07 16:51:48.223653488 +0300
@@ -1239,6 +1239,8 @@
         }
         kv->data.envar.separator = v->data.envar.separator;
         break;
+    case 56:
+        break;
     default:
         /* silence warnings */
         rc = OPAL_ERROR;

I can get further with the following error.

bill@odin:~/src/openmpi-4.1.0$ mpirun.openmpi -v -v -v --host thor ls
[odin:442452] PACK-OPAL-VALUE: UNSUPPORTED TYPE 0 FOR KEY pmix.topo2
[odin:442452] [[61465,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/odls/base/odls_base_default_fns.c at line 250
[odin:442452] [[61465,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/plm/base/plm_base_launch_support.c at line 552

--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[61465,0],0] FORCE-TERMINATE AT (null):1 - error ../../../../../orte/mca/plm/base/plm_base_launch_support.c(553)


This is something that should be reported to the developers.
--------------------------------------------------------------------------

which is a more reasonable error anyway.

--
Vassilis Virvilis

Lucas Nussbaum

unread,
May 11, 2021, 6:30:03 AM5/11/21
to
Control: severity -1 serious

Hi,

This breaks OpenMPI in very basic cases, so I'm upgrading the severity
to serious.

Lucas

Lucas Nussbaum

unread,
May 11, 2021, 8:10:04 AM5/11/21
to
On 11/05/21 at 14:48 +0300, Vassilis Virvilis wrote:
> I believe the problem is that mpirun is built with the internal pmix
> library when there is external available.
>
> bill@odin:~/src/openmpi-4.1.0$ dpkg -l '*pmix*' | grep ^ii
> ii libpmix-dev:amd64 4.0.0-4 amd64 Development files for the
> PMI Exascale library
> ii libpmix2:amd64 4.0.0-4 amd64 Process Management
> Interface (Exascale) library
>
> mpirun is not linked to the external libpmix2
>
> bill@odin:~/src/openmpi-4.1.0$ ldd /usr/bin/mpirun.openmpi
> linux-vdso.so.1 (0x00007fffa1153000)
> libopen-rte.so.40 => /usr/lib/x86_64-linux-gnu/libopen-rte.so.40
> (0x00007f3cdf657000)
> libopen-pal.so.40 => /usr/lib/x86_64-linux-gnu/libopen-pal.so.40
> (0x00007f3cdf5a3000)
> libevent_core-2.1.so.7 =>
> /usr/lib/x86_64-linux-gnu/libevent_core-2.1.so.7 (0x00007f3cdf569000)
> libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3cdf3a4000)
> libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f3cdf387000)
> libhwloc.so.15 => /usr/lib/x86_64-linux-gnu/libhwloc.so.15
> (0x00007f3cdf32e000)
> libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
> (0x00007f3cdf30a000)
> libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f3cdf304000)
> libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1
> (0x00007f3cdf2ff000)
> libevent_pthreads-2.1.so.7 =>
> /usr/lib/x86_64-linux-gnu/libevent_pthreads-2.1.so.7 (0x00007f3cdf2fa000)
> /lib64/ld-linux-x86-64.so.2 (0x00007f3cdf72c000)
> libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f3cdf1b6000)
> libudev.so.1 => /usr/lib/x86_64-linux-gnu/libudev.so.1
> (0x00007f3cdf18e000)

That's because it is loaded dynamically.

mca_pmix_ext3x.so is linked to libpmix.so.2:

# ldd /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext3x.so
linux-vdso.so.1 (0x00007ffeba72a000)
libopen-pal.so.40 => /lib/x86_64-linux-gnu/libopen-pal.so.40 (0x00007fae76f77000)
libpmix.so.2 => /lib/x86_64-linux-gnu/libpmix.so.2 (0x00007fae76e2c000)
libevent_core-2.1.so.7 => /lib/x86_64-linux-gnu/libevent_core-2.1.so.7 (0x00007fae76df2000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fae76dd0000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fae76c0b000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fae76c05000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007fae76bfe000)
libhwloc.so.15 => /lib/x86_64-linux-gnu/libhwloc.so.15 (0x00007fae76ba5000)
libevent_pthreads-2.1.so.7 => /lib/x86_64-linux-gnu/libevent_pthreads-2.1.so.7 (0x00007fae76ba0000)
/lib64/ld-linux-x86-64.so.2 (0x00007fae77057000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fae76a5c000)
libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007fae76a34000)

> There is this https://github.com/open-mpi/ompi/issues/8335 but it looks
> applied in Debian.
>
> I am trying to find the config.log to understand why it prefers the
> internal version of the pmix library but so far I can't tell.
>
> Can you give me a hint where the config.log lives after a successful build?

It's not available. I rebuilt it locally, and got:

configure:13919: checking if user requested internal PMIx support(/usr/lib/x86_64-linux-gnu/pmix2)
configure:13932: result: no
configure:13985: checking for pmix.h in /usr/lib/x86_64-linux-gnu/pmix2
configure:13993: result: not found
configure:13995: checking for pmix.h in /usr/lib/x86_64-linux-gnu/pmix2/include
configure:13999: result: found
configure:14048: checking libpmix.* in /usr/lib/x86_64-linux-gnu/pmix2/lib64
configure:14056: result: not found
configure:14058: checking libpmix.* in /usr/lib/x86_64-linux-gnu/pmix2/lib
configure:14062: result: found
configure:14081: checking PMIx version
configure:14091: result: version file found
configure:14099: checking version 4x
configure:14117: gcc -E -I/usr/lib/x86_64-linux-gnu/pmix2/include -Wdate-time -D_FORTIFY_SOURCE=2 conftest.c
configure:14117: $? = 0
configure:14118: result: found
configure:14305: checking PMIx version to be used
configure:14308: result: external(4x)

(which looks OK)

Lucas

Lucas Nussbaum

unread,
May 11, 2021, 8:10:04 AM5/11/21
to
On 07/05/21 at 17:24 +0300, Vassilis Virvilis wrote:
> The second value (index == 1) has value.type = 56 (pmix.topo2) which is
> outside the range of supported value types. I think the last entry is
> PMIX_REGEX 46 at
> ./debian/build-gfortran/opal/mca/pmix/pmix3x/pmix/include/pmix_common.h
>
> However in /usr/lib/x86_64-linux-gnu/pmix2/include/pmix_common.h the list
> goes further ending in PMIX_COMPRESSED_BYTE_OBJECT 59 with 56 being
> PMIX_TOPO

I tried to ensure that pmix_common.h (inside the sources) was unused
during build, and added an #error inside it. The build succeeded, which
seems to confirm that it is not used...

Lucas

Vassilis Virvilis

unread,
May 11, 2021, 10:50:03 AM5/11/21
to
On Tue, 11 May 2021 14:04:56 +0200 Lucas Nussbaum <lu...@debian.org> wrote:
That's because it is loaded dynamically.
mca_pmix_ext3x.so is linked to libpmix.so.2:

Aah that's a great hint. Thanks

The output is the same as yours.


It's not available. I rebuilt it locally, and got:

Yes I also rebuilt it locally. I was looking for configure.log. My bad.

I found it at  ./debian/build-gfortran/config.log and it is the same as yours. Looks ok.

Vassilis Virvilis

unread,
May 12, 2021, 6:00:03 AM5/12/21
to
> > However in /usr/lib/x86_64-linux-gnu/pmix2/include/pmix_common.h the list
> > goes further ending in PMIX_COMPRESSED_BYTE_OBJECT 59 with 56 being
> > PMIX_TOPO
>
> I tried to ensure that pmix_common.h (inside the sources) was unused
> during build, and added an #error inside it. The build succeeded, which
> seems to confirm that it is not used...

Ok I don't get it.

mca_pmix_ext3x.so is in ./debian/build-gfortran/opal/mca/pmix/ext3x/.libs/mca_pmix_ext3x.so in the local build.

So I guess this must be either
 a) the internal pmix library - or -
 b) a client wrapper that utilizes the pmix library

It links dynamically to pmix.so as you said so it must be option b).

But then against does not have support for value types more than 46.

So I am confused. If you could elaborate this I would be grateful.

  Vassilis

Lucas Nussbaum

unread,
May 16, 2021, 1:40:03 AM5/16/21
to
Hi Alaitair,

Thanks a lot for fixing this.

Unfortunately, I noticed that the upload to unstable was built against
ucx 1.10.1~rc1-1, so both need to migrate to testing.

Did you already engage discussions with the release team? I did not find
an unblock request.

Lucas

Lucas Nussbaum

unread,
May 26, 2021, 3:00:04 AM5/26/21
to
Hi Alastair,

Any news on that?

If you have to time to resolve the issue, please tell me. I can maybe
find some time on my side.

Lucas

On 16/05/21 at 07:25 +0100, Alastair McKinstry wrote:
> Hi Lucas
>
> Yikes.
>
> No, I wanted to wait and check if there were any issues before issuing an
> unblock request.
>
> Alastair
> --
> Alastair McKinstry, <alas...@sceal.ie>, <mcki...@debian.org>, https://diaspora.sceal.ie/u/amckinstry
> Misentropy: doubting that the Universe is becoming more disordered.
>
>

Alastair McKinstry

unread,
May 27, 2021, 10:30:04 AM5/27/21
to
Ok, openmpi, redone ucx (to avoid 1.10.1~rc1 ) uploaded and unblock sent.

Alastair

Samuel Thibault

unread,
Sep 7, 2021, 3:50:04 AM9/7/21
to
Hello,

The openmpi version is unstable does not contain the patch included in
4.1.0-9, and thus this bug Bug#984956 is still preventing openmpi from
migrating (as well as rdeps such as starpu).

Samuel
0 new messages