OpenMPI, Slurm & portability

1,876 views
Skip to first unread message

victor sv

unread,
Jul 3, 2017, 5:07:41 AM7/3/17
to singu...@lbl.gov
Dear Singularity team,

first of all, thanks for the great work with Singularity. It looks amazing!

Sorry if this topic is duplicated and for the length of the email, but I want to share my experience about Singularity and OpenMPI compatibility, and also ask some questions.

I've being reading a lot about OpenMPI and Singularity compatibility because we are trying to find the generic way to run OpenMPI applications within Singularity containers. It was not so clear (for me) in the documentation, forums and mailing lists, and this is why we've performed an OpenMPI empiric compatibility study.

We ran these comparisons in CESGA FinisTerrae II cluster (https://www.cesga.es/en/infraestructuras/computacion/FinisTerrae2).

We used several versions of OpenMPI. The chosen versions of OpenMPI were the versions already installed in the cluster:

- openmpi/1.10.2
- openmpi/2.0.0
- openmpi/2.0.1
- openmpi/2.0.2
- openmpi/2.1.1

We have created Singularity images containing the same versions of OpenMPI and with the basic OpenMPI ring example. I share the bootstrap definition file template used below:

```
BootStrap: docker
From: ubuntu:16.04
IncludeCmd: yes

%post
        sed -i 's/main/main restricted universe/g' /etc/apt/sources.list
        apt-get update
        apt-get install -y bash git wget build-essential gcc time libc6-dev libgcc-5-dev
        apt-get install -y dapl2-utils libdapl-dev libdapl2 libibverbs1 librdmacm1 libcxgb3-1 libipathverbs1 libmlx4-1 libmlx5-1 libmthca1 libnes1 libpmi0 libpmi0-dev libslurm29 libslurm-dev

        ##Install OpenMPI
        cd /tmp
        wget 'https://www.open-mpi.org/software/ompi/vX.X/downloads/openmpi-X.X.X.tar.gz' -O openmpi-X.X.X.tar.gz
        tar -xzf openmpi-X.X.X.tar.gz -C openmpi-X.X.X
        mkdir -p /tmp/openmpi-X.X.X/build
        cd /tmp/openmpi-X.X.X/build
         ../configure --enable-shared --enable-mpi-thread-multiple --with-verbs --enable-mpirun-prefix-by-default --with-hwloc --disable-dlopen --with-pmi --prefix=/usr
        make all install

        # Install ring
        cd /tmp
        wget https://raw.githubusercontent.com/open-mpi/ompi/master/examples/ring_c.c
        mpicc ring_c.c -o /usr/bin/ring
```

Once the containers were created, we ran the ring app with mpirun using 2 cores of 2 different nodes mixing all possible combinations of those OpenMPI versions inside and outside the container.

The obtained results shown that we need the same versions of OpenMPI inside and outside the container to succesfully run the contained application in parallel with mpirun.

Is this the expected behaviour or am I missing something?

Will be this the expected behaviour in the future (with future versions of OpenMPI)?

Currently, we have slurm 14.11.10-Bull.1.0 installed as job scheduler at FinisTerrae II. We found the following tip/trick to use srun as process manager:

http://singularity.lbl.gov/tutorial-gpu-drivers-open-mpi-mtls

In order to run whatever Singularity image containing OpenMPI applications using Slurm, we've adapted it to our infrastructure and checked the same test cases running them with srun. It seems that it's working properly (no real world applications were tested yet).

What do you think about this strategy?
Can you confirm that it provides portability of singularity images containing OpenMPI applications?

I think this strategy is similar to the one you are following with "--nv" option  for NVidia drivers.

Why not to do the same strategy with MPI, PMI, libibverbs, etc.?

Thanks in advance and congrats again for your great work!

Víctor.

Gregory M. Kurtzer

unread,
Jul 9, 2017, 3:19:08 PM7/9/17
to singu...@lbl.gov, Ralph Castain
Hi Victor,

Sorry for the latency, I'm on email overload.

Open MPI uses PMI to communicate both inside and outside of the container. Ralph Castain (on this list, but possibly not monitoring actively) is leading the PMI effort and he is an active Open MPI developer. We have had several talks about how to achieve "hetero-versionistic" compatibility through the PMI handshake. I was under the impression that PMI now supports that, as long as you are running equal or newer version on the host (outside the container). Also, I don't know what version of PMI this feature was introduced in, nor do I know what version of Open MPI includes that compatibility.

I have CC'ed Ralph, and hopefully he will be able to offer some suggestions.

Regarding your question about supporting the MPI libraries in the same manner that we are doing the Nvidia libraries, that would be hard. Nvidia specifically builds their libraries to be as generally compatible as possible (e.g. the same libraries/binaries work on a large array of Linux distributions). Most people do not build host libraries in a manner that would be generally compatible as Nvidia does.

Hope that helps!

Greg



--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.



--
Gregory M. Kurtzer
CEO, SingularityWare, LLC.
Senior Architect, RStor
Computational Science Advisor, Lawrence Berkeley National Laboratory

Gregory M. Kurtzer

unread,
Jul 9, 2017, 5:45:32 PM7/9/17
to singu...@lbl.gov, Ralph Castain
Hiya Victor, et al., 

I didn't realize this but Ralph had to drop off of the Singularity list. Hopefully we will get him back again, as he is a fantastic resource for all of the OMPI questions and always a great source of information and ideas (poke, poke Ralph!). Ralph did send me this in response to the previous email hoping it helps to explain things:


On Sun, Jul 9, 2017 at 2:22 PM, r...@open-mpi.org <r...@open-mpi.org> wrote:
...
You are welcome to forward the following to the list:

As Greg said, we have been concerned about this since we started looking at Singularity support. Just for clarity, the version of PMI OMPI uses is PMIx (https://pmix.github.io/pmix/). While our plan from the beginning was to support cross-versions specifically to address this problem, we fell behind on its implementation due to priorities. We just committed the code to the PMIx repo in the last week, and it won’t be released into production for a few months while we shake it down.

I fear it will be impossible to get the OMPI 1.10 series to work with anything other than itself as it pre-dates PMIx.

The OMPI 2.0 and 2.1 series should work across each other as they both include PMIx 1.x. However, you probably will need to configure the 2.1 series with --disable-pmix-dstore as there was an unintended compatibility break there (the shared memory store was added during the PMIx 1.x series and we didn’t catch the compatibility break it introduced).

Looking into the future, OMPI 3.0 is about to be released. It includes PMIx 2.0, which isn’t backwards compatible at this time, and so it won’t cross-version with OMPI 2.x “out-of-the-box”. We haven’t tested this, but one thing you could try is to build all three OMPI versions against the same PMIx external library (you would probably have to experiment a bit with PMIx versions to see which works across the different OMPI versions as the glue between the two also changed a bit). This will ensure that the shared memory store in PMIx is compatible across the versions, and things should work since OMPI doesn’t care how the data is moved across the host-container boundary.

As I said, we will be adding cross-version support to the PMIx release series soon, without changing the API, that will ensure support across all PMIx versions starting with v1.2. Thus, you could (once that happens) build OMPI 2.0, 2.1, and 3.0 against the new PMIx release (probably PMIx v2.1.0) and the resulting containers would be future-proof as OMPI moves ahead. The RMs plan to follow that path as well, so you should be in good shape once this is done if you prefer to “direct launch” your containers (e.g., “srun ./mycontainer” under SLURM).

Sorry if that is all confusing - we sometimes get lost in the numbering schemes between OMPI and PMIx ourselves. Feel free to contact me directly, or on the OMPI or PMIx mailing lists, if you have more questions or encounter problems. We definitely want to make this work.

Ralph

To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.



--
Gregory M. Kurtzer
CEO, SingularityWare, LLC.
Senior Architect, RStor
Computational Science Advisor, Lawrence Berkeley National Laboratory

victor sv

unread,
Jul 11, 2017, 9:03:22 AM7/11/17
to singu...@lbl.gov, Ralph Castain
Hi Greg and Ralph,

Thank you for your precise and elaborated answers.

Only for confirmation and to sum up some conclussions (if I understood correctly):

 - OpenMPI process management compatibility depends on PMIx.
 - OpenMPI (and also Slurm) complete  backward/forward compatibility will come (hopefully) in the future by means of PMIx 2.1.
 - Nowadays, there exists compatibility with OpenMPI 2.X if we compile it with default PMIx (1.X) support.
 - OpenMPI 2.1 must be compiled with --disable-pmix-dstore due to a compatibility break.
 - OpenMPI 1.X does not suppot PMIx and we can ignore it from this thread.

I'm right?

I'm interested in performing the tests you purpose. I will try to build all three OMPI versions (2.0, 2.1 and 3.0) against the same PMIx external library to check the compatibility. Which PMIx version (1.2.0, 1.2.1 or 1.2.2 ) do you recommend as a start point?

I will report this results ASAP to this thread.

On the other hand, although we are planning to add support to PMIx, unfortunately, our Slurm version (14.11.10-Bull.1.0) does not support it yet.

The second strategy we are testing to get compatibility between OpenMPI inside and outside a Singularity container relies on replacing the OpenMPI libraries inside the container by the host libraries hierarchy.

This approach rest upon the assumption that OpenMPI symbols and data structures are compatible through several versions of OpenMPI. At least combining several releases that share the same major version.

Although the empirical tests of this approach seem to work properly with some tests, benchmarks and real apps, I'm afraid of getting unexepected errors/warnings (segfaults, data errors, etc.) in the future.

What do you think about this approach?

Can you confirm that OpenMPI is compatible in this way?

Finally, I think this thread could be very interesting for other users too and I would like to keep it alive with your help.

Thank you again for your support!

BR,
Víctor

2017-07-09 23:45 GMT+02:00 Gregory M. Kurtzer <gmku...@gmail.com>:
Hiya Victor, et al., 

I didn't realize this but Ralph had to drop off of the Singularity list. Hopefully we will get him back again, as he is a fantastic resource for all of the OMPI questions and always a great source of information and ideas (poke, poke Ralph!). Ralph did send me this in response to the previous email hoping it helps to explain things:


To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.

Gregory M. Kurtzer

unread,
Jul 11, 2017, 10:56:01 AM7/11/17
to singu...@lbl.gov, Ralph Castain
Hi Victor,

I will let Ralph comment on the OMPI versions and compatibilities, but regarding using the MPI host libraries within a container is dangerous for the reason that you are mentioning. If you are running ABI compatible containers with the host, then things *might* work as expected. But this breaks container portability, and goes against the principals of containment.

We do however do exactly this for the Nvidia driver libraries, but... Nvidia builds these libraries with careful attention on ABI compatibility such that these binary libraries are indeed reasonably portable across containers.

The only way to do this portably is with using a launcher on the host, outside the container, to spin up the container and launch the MPI within. PMIx is a fantastic approach to solving this.

Hope that helps!

Greg



To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.

victor sv

unread,
Jul 12, 2017, 7:50:25 AM7/12/17
to singu...@lbl.gov, Ralph Castain
Hi Greg and Ralph,

yes Greg, I agree with you that the mentioned strategy could be dangerous and goes against the principals of containment.

sorry for the basic question ... but what do you mean with ABI compatible containers? which components of the container environment are involved with this ABI compatibility?

If we talk about libc or the kernel itself, as you say in your web page, "If you require kernel dependent features, a container platform is probably not the right solution for you."

If we focus on OpenMPI ABI compatibility, I figure out that the variables involved in this compatibility could be (1) the compiler (vendor) and (2) the OpenMPI library itself.

I'm right or I'm missing any other variables?

An interesting project called ABI-tracker has performed an OpenMPI ABI compatibility study that you can watch in the following link:

https://abi-laboratory.pro/tracker/timeline/openmpi/

I think that, at least for OpenMPI 2.X, altough been a dangerous approach, the ABI compatibility seems reasonable.

What do you think?

BR,
Víctor.

Gregory M. Kurtzer

unread,
Jul 13, 2017, 7:34:57 PM7/13/17
to singu...@lbl.gov, Ralph Castain
Hi Victor,

The are of ABI compatibility I am referring to is with the container's underlying library stack. Meaning that if you link in the libraries compiled on the host, and the container you want to run is newer then what is installed on the host, (or potentially vise-versa), you may end up with a conflict between the binary and library.

This is what Nvidia has mitigated by building their library on a very recent toolchain, thus the libraries are backwards compatible with older binaries.

Does that make sense?

Greg



To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.

victor sv

unread,
Aug 25, 2017, 4:46:22 AM8/25/17
to singu...@lbl.gov, us...@lists.open-mpi.org, Ralph Castain
Dear Singularity & OpenMPI teams, Greg and Ralph,

going back to the Ralph Castain response to this thread:

https://groups.google.com/a/lbl.gov/forum/#!topic/singularity/lQ6sWCWhIWY

In order to get portability of Singularity images containing OpenMPI distributed applications, he suggested mix some OpenMPI versions with some external PMIX to check about the interoperability across versions while using of the Singularity MPI hybrid approach (see his response in the thread).

I did some experiments and I would like to share with you my results and to discuss about the conclusions.

First of all, I'm going to describe the environment (some scripts are attached).
  • I performed this test at CESGA FinisTerrae II cluster (https://www.cesga.es/en/infraestructuras/computacion/FinisTerrae2).
  • The compiler used is GCC/6.3.0 and I had to compile some external dependencies to be linked from PMIX or OpenMPI:
    • hwloc/1.11.5
    • libevent/2.0.22
  • PMIX versions used in this experiments:
    • 1.2.1
    • 1.2.2
    • 2.0.0
  • I configure PMIX with the following options:
    • ./configure --with-hwloc= --with-munge-libdir= --with-platform=optimized --with-libevent=
  • OpenMPI versions used in this experiments:
    • 2.0.X
    • 2.1.1
    • 3.0.0_rcX
  • I configure OpenMPI with the following options:
    • ./configure --with-hwloc= --enable-shared --with-slurm --enable-mpi-thread-multiple --with-verbs-libdir= --enable-mpirun-prefix-by-default --disable-dlopen --with-pmix= --with-libevent= --with-knem
    • Version 2.1.1 was also compiled with flag --disable-pmix-dstore
  • I used the well known "Ring" OpenMPI application.
  • I used MPIRUN as process manager

What I expected from previous Ralph response was full cross-version compatibility using any OpenMPI >= 2.0.0  linked against PMIX 1.2.X both, inside the container and at the host.

In general, my results were not as good as expected, but promising.
  • The worst thing is, my results show that OpenMPI 2.X versions needs exactly the same version of OpenMPI inside & outside the container, but I can mix PMIx 1.2.1 and 1.2.2
  • The better thing, if OpenMPI 3.0.0_rc3 version is present inside or outside the container,  seems to work mixing any other OpenMPI >= 2.X version and also mixing PMIx 1.2.1 and 1.2.2. Some notes* to this result:
    • OpenMPI 2.0.0 with PMIx 1.2.2 (In&Out the container) never worked.
    • After getting the expected output from "Ring" app, I randomly get SEGFAULT if OpenMPI 3.0.0.rcX is involved.
  • As Ralph said, PMIx 1.2X and 2.0.X are not interoperable.
  • I was not able to compile OpenMPI 2.1.0 with external PMIx

I can conclude that PMIx 1.2.1 and 1.2.2 are interoperable, but only OpenMPI 3.0.0_rc3 can work*, in general, with other versions of OpenMPI (>2).

Going back again to Ralph Castain mail to this thread, I would expect full support for interoperability with different PMIx versions (>1.2) through PMIx > 2.1 (not yet released)

Some questions about this experiments and conclusions are:

  • What do you think about this results?  Do you have any suggestion? I'm missing something?
  • are these results aligned with your expectations?
  • I know that PMIx 2.1 is being developed but, any version is already available to check? How can I get it?
  • The SEGFAULT I get with  OpenMPI 3.0.0.rcX is something already tracked?

Hope to helpful!

BR,

Víctor




To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.

container_bootstrap.def
host_install.sh

Christophe Trophime

unread,
Mar 13, 2019, 4:23:35 PM3/13/19
to singularity
Hi,
I would like to know if there are any news on that subject?
Does "Having same OpenMPI version inside and outside of the container"  is still a requirement?

Best
C

Gregory M. Kurtzer

unread,
Mar 14, 2019, 10:38:06 PM3/14/19
to singularity
There was a talk on this earlier this week at the Singularity User's Group in San Diego. Ralph Castain spoke about PMIx which is what you should check out. Videos of the presentations, including Ralph's, will be available soon, so stay tuned for them!

Thanks!

--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.


--
Gregory M. Kurtzer
CEO, Sylabs Inc.

victor sv

unread,
Mar 15, 2019, 3:57:16 AM3/15/19
to singu...@lbl.gov
Hi Christophe,

I don't know if there are any brand new on this regard since the last months.

What I remember is that cross-version compatibility relies on OpenMPI and also PMIx. The following link illustrates the PMIx compatibility matrix:

Is important to remark OpenMPI/PMIx support:
  - OpenMPI <= 1.X :         Supports PMI, but not PMIx => Container and host OpenMPI/PMI versions must exactly match
  - 2.X <= OpenMPI < 3.X: Supports PMIx 1.X
  - 3.X <= OpenMPI < 4.X: Supports PMIx 1.X and 2.X
  - OpenMPI >= 4.X:          Supports also PMIx 3.X

The general rule is, if the host OpenMPI is linked with one of the latests PMIx, and your container supports PMIx (see PMIx compatibility matrix for more details), it will be compatible.

This thread contains more info: https://github.com/pmix/pmix/issues/556

I'm right Greg?

I really want to see the video presentations!

Best,
Víctor

Christophe Trophime

unread,
Mar 15, 2019, 6:27:11 AM3/15/19
to singu...@lbl.gov
Hi,
thanks for the infos. I'm also eager to see the video presentation

Still I have a question.
I've played around with Debian/Buster: singularity 3.0.3 and openmpi 3.1.3 and pmix 3.1.2 (from system packages).
As expected a singularity container with Xenial (openmpi 1.10.2) is not working.
But singularity container with Cosmic (openmpi 2.1.1 with or without pmix 2.1)

Following Victor comments and from the refrence doc on PMIX, I though that it should have been working for Cosmic image with pmix 2.1...
Am I right?

Next, for Xenial and Cosmic image I still can run mpirun within the containers..
Is it safe to do that or not?
Note that I'm planing to use a SMP machine with Debian/stretch in that case

Thanks for your explanations and/or comments
Best
C.

victor sv

unread,
Mar 15, 2019, 6:48:10 AM3/15/19
to singu...@lbl.gov
Hi Chistophe,

first, (not sure, but) I think PMIx 3 is only supported by OpenMPI 4. I did not check the combination you are referring, so I cannot give you more details. Please, check the exact versions (with PATCH_VERSION) of OpenMPI/PMIx with the PMIx compatibility matrix.

In the other hand, the hybrid approach is for distributed multinode jobs. If you are going to run single-node jobs you can run safely `mpirun` from inside the container.

Hope it helps!
Víctor

Shenglong Wang

unread,
Mar 15, 2019, 9:58:11 AM3/15/19
to singu...@lbl.gov, Shenglong Wang
For InfiniBand, are there requirements for OFED versions?

Best,
Shenglong

victor sv

unread,
Mar 15, 2019, 11:10:07 AM3/15/19
to singu...@lbl.gov, Shenglong Wang
Hi Shenglong,

in my experience (not too much), any infiniband driven inside the container works.

Hope it helps,
Víctor

Priedhorsky, Reid

unread,
Mar 15, 2019, 1:27:31 PM3/15/19
to singu...@lbl.gov

> Does "Having same OpenMPI version inside and outside of the container" is still a requirement?

Have you tried launching with the host workload manager, e.g. with “srun”?

With a properly built OpenMPI, under Charliecloud, this works fine and completely removes the need for a compatible OpenMPI on the host, or even any at all. Charliecloud does nothing special for this, so I assume it should work in Singularity too.

Charliecloud source code contains our example OpenMPI build; again, I expect this to transfer over to Singularity without much trouble:
https://github.com/hpc/charliecloud/blob/master/test/Dockerfile.debian9
https://github.com/hpc/charliecloud/blob/master/test/Dockerfile.openmpi
https://github.com/hpc/charliecloud/tree/master/examples/mpi/mpihello

Re. performance, we are working on some comprehensive performance evaluations, and so far on our OPA clusters, Charliecloud, Singularity, Shifter, and bare metal all have essentially the same performance.

HTH,
Reid

Gregory M. Kurtzer

unread,
Mar 15, 2019, 2:46:14 PM3/15/19
to singularity
Thanks for the info Reid! Is your resource manager using PMIx to do the launching through `srun`? If so, we do indeed support this too. 

Also, we just hired another developer specifically slated to further work on the compatibility of MPI and general parallel container execution wire-up for compatibility and massive scalability performance. This will of course all be open source, so I assume other projects can also leverage our investment in this area.

Greg

--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

v

unread,
Mar 15, 2019, 2:55:09 PM3/15/19
to singu...@lbl.gov
if it helps, I went through (recently) needing to compile MPI with the --with-pmix flag for another tool (petsc), here is some discussion around that.

Vanessa Villamia Sochat
Stanford University '16

victor sv

unread,
Mar 18, 2019, 4:47:27 AM3/18/19
to singu...@lbl.gov
Hi Reid,

thanks for your insights. Nice way of building an OPenMPI container.

Anyway, I cannot understand why you don't need OpenMPI/PMI[x] compatibility between the host and the container ... which is the MPI execution model while using CharliCloud? is hybrid as in Singularity?

If you run with a resources manager you are going to need (at least) the process manager layer on the host side (e.g. PMI[x])

Can you give more details?

Best regards,
Víctor

Priedhorsky, Reid

unread,
Mar 18, 2019, 11:42:30 AM3/18/19
to singu...@lbl.gov

On Mar 18, 2019, at 2:47 AM, victor sv <vict...@gmail.com> wrote:

Anyway, I cannot understand why you don't need OpenMPI/PMI[x] compatibility between the host and the container ... which is the MPI execution model while using CharliCloud? is hybrid as in Singularity?

If you run with a resources manager you are going to need (at least) the process manager layer on the host side (e.g. PMI[x])

You do need something on the host side. At Los Alamos, that’s PMI2 built into Slurm, so that’s what we test well, though the examples also build with PMIx.

What you don’t need is a compatible OpenMPI on the host. That’s where the version compatibility charts come in, and that’s not even the whole story; build options and possibly other things can affect compatibility too. PMI2/PMIx is a much looser coupling and can (in principle at least) be provided by lots of different things.

Again, I’d expect none of this is specific to any given runtime.

HTH,
Reid

victor sv

unread,
Mar 18, 2019, 12:45:00 PM3/18/19
to singu...@lbl.gov
Ok, Thanks for the quick explanation Reid,

I always talk about openmpi/pmix compatibility because in my tests I used (openmpi) `mpirun` as launcher, but as you say in your response only the process manager layer is the one involved in the compatibility issue.

Best regards,
Víctor.

--
Reply all
Reply to author
Forward
0 new messages