SIngularity and MPI implementations

883 views
Skip to first unread message

Taras Shapovalov

unread,
May 13, 2016, 11:12:49 AM5/13/16
to singularity
Hey guys,

I've heard many times that Singularity has a nice support in Open MPI 2.0, but could someone describes how exactly such integration affects the execution of MPI application? Older Open MPI and MPICH work in SAPP as well, so I don't really get what Open MPI 2.0 brings us.

Moreover I see MPI support in Singularity is positioned as one of the features that is implemented better than in Shifter (correct?). But Shifter also allows to run MPI apps, well, at least I see Cray runs MPICH in Shifter's chroot (not sure about Open MPI though). Could you explain please what is the difference (if any) between running, say, MPICH with Singularity vs running it in Shifter (from HPC prospective of course)?

Best regards,

Taras

Ralph Castain

unread,
May 13, 2016, 11:56:43 AM5/13/16
to singu...@lbl.gov
There are some significant differences between Shifter and Singularity, but I’ll let Greg address those.

The OMPI support in 2.0 isn’t complete - we will be updating it in 2.0.1. What it does is automatically detect that this is a Singularity container, setup the required paths to Singularity on the remote nodes, expand the container only once on each node, and then execute the specified number of copies. This makes it a little easier to use (don’t have to worry about paths), and somewhat faster to launch the job. You won’t notice the launch performance difference at small scale, but you will at larger scales.

 A container, of course, is just a static envelope surrounding your dynamic application, and so you can avoid the IO node bottleneck that we’ve seen on large clusters when apps call dl_open. Back in the RoadRunner days, we quantified and published the results of those studies - bottom line was that you had to avoid dl_open at scale. Without a container, your only choice was to build your apps static, which meant you took a memory footprint hit.

With a container and the OMPI support, you get the best of both worlds. Your container “envelope” comes down as a single blob (one per node), thus alleviating the bottleneck. We then expand it out, spawn the local procs - and they see shared libraries! Keeps the footprint down while improving the launch speed.

Singularity also automatically detects that this is an OMPI application you are putting in the container, and ensures that all the required OMPI libraries are included. Thus, the result is a complete “package” that can run on a remote node that doesn’t even have OMPI installed on it. So you could, for example, use SLURM or some other resource manager to directly launch the container onto nodes that don’t have OMPI installed on it.

In either case, we preserve the ability to use shared memory transports, so the application performance is unaffected.

HTH
Ralph

--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

Gregory M. Kurtzer

unread,
May 13, 2016, 12:13:39 PM5/13/16
to singularity
Hi Taras!


On Fri, May 13, 2016 at 8:12 AM, Taras Shapovalov <shapov...@gmail.com> wrote:
Hey guys,

I've heard many times that Singularity has a nice support in Open MPI 2.0, but could someone describes how exactly such integration affects the execution of MPI application? Older Open MPI and MPICH work in SAPP as well, so I don't really get what Open MPI 2.0 brings us.

I believe the answer to your question depends on the definition of "support". I will explain... There are generally two ways to implement MPI from within a container:

1. The entire MPI stack lives completely within the container. This is model I've heard most about how people integrate MPI and containers, but it has some serious drawbacks. For instance, this implies that the MPI is built with the necessary features and dependencies and tunings for where the container is *going* to run (not where it was created). This significantly impacts the portability and potential performance of the container. Additionally, this approach implies that one container knows how to reach another container both in terms of address resolution and port (e.g. the appropriate daemons must be listening within the container ... sshd?).

2. The MPI is split partially between the host and the container. This is the preferable approach in that the MPI within the container does not have to be built specifically for a target host or resource. It also alleviates much the networking complexities and satisfies a much more typical work flow and thus scheduler integration. But it isn't the easiest to do properly. This is where the integration of Open MPI 2.x comes in (but the support is not limited to Open MPI, for example I haven't tested personally but I'm aware of MPICH derivatives also working properly with Singularity in this model).
 

Moreover I see MPI support in Singularity is positioned as one of the features that is implemented better than in Shifter (correct?). But Shifter also allows to run MPI apps, well, at least I see Cray runs MPICH in Shifter's chroot (not sure about Open MPI though). Could you explain please what is the difference (if any) between running, say, MPICH with Singularity vs running it in Shifter (from HPC prospective of course)?

I will not speak definitively about Shifter on this topic, but I do believe you are correct in that it runs the CRAY MPI inside the container. I also believe Shifter has built in mechanisms to mitigate some of the issues I mentioned above and it works for Shifter for the use-case it was developed for. Singularity's use-case is somewhat different. Unlike Shifter, the images are not built on a particular HPC resource for that HPC resource. In Singularity, the images are considered the vector of portability, so they are of a different format. Additionally, Shifter is designed around the integration with the scheduler, while Singularity is focused on the concept of standard command line usage. Interestingly enough, one aspect that makes Singularity so easy to adopt with MPI is because it does adopt standard command line principals (e.g. "mpirun -np X singularity exec ~/Centos-5.img mpi_application_inside_container" [1] works exactly as you would expect as long as the image file is reachable via all nodes).

Hope that helps!


[1]: This example is with Singularity v2.x which is still in development and due to release in the coming weeks.

--
Gregory M. Kurtzer
High Performance Computing Services (HPCS)
University of California
Lawrence Berkeley National Laboratory
One Cyclotron Road, Berkeley, CA 94720

Taras Shapovalov

unread,
May 13, 2016, 12:53:38 PM5/13/16
to singu...@lbl.gov
Hi Ralph and Gregory,

Thank you the both for the so detailed answers! I see your replies complement each other. Although I am a bit confused now with the whole picture, so could you confirm that I get the ideas correctly:

1. All implementations of MPI by default should work with Singularity containers (maybe not as optimal as could be, but should start and finish correctly always). Actually I've tested recently MPICH+Singularity with several workload managers, worked fine (did not benchmark it comparing with Open MPI). I did not manage to make Singularity+MPI work in LSF, but this is a different story that deserves a separate thread.

2. MPI process calls dl_open, thus the more MPI processes starts on a node, the more times dl_open will be called. Open MPI 2.0.1 somehow solves this magically (I don't get how) and dl_open is called only once per node. Other implementations of MPI and older Open MPI are not Singularity aware, thus they still will call dl_open each time when MPI process spawns.

3. dl_open issue affects only process start time and does not effect the process execution, so on small scale with long running processes there is no difference between Open MPI 2.0.1 and older Open MPI versions (as well as other MPI implementations).

4. When sapp is built then Singularity detects Open MPI (even older then 2.0.1, right?) and resolves all dependencies automatically adding all files to the sapp. But with, say, MVAPICH2 the dependencies are not resolved automatically, so user should add some stuff manually.

5. Apart of solving dl_open issue Open MPI 2.0.1 does some splitting between the host and the container, which allows user/admin to not optimize Open MPI for a target platform. I really don't get how Singularity does this, but I get the problem. Could you explain what Singularity or Open MPI 2.0.1 does for that specificaly?

Best regards,

Taras

Ralph Castain

unread,
May 13, 2016, 1:10:46 PM5/13/16
to singu...@lbl.gov
On May 13, 2016, at 9:52 AM, Taras Shapovalov <shapov...@gmail.com> wrote:

Hi Ralph and Gregory,

Thank you the both for the so detailed answers! I see your replies complement each other. Although I am a bit confused now with the whole picture, so could you confirm that I get the ideas correctly:

1. All implementations of MPI by default should work with Singularity containers (maybe not as optimal as could be, but should start and finish correctly always). Actually I've tested recently MPICH+Singularity with several workload managers, worked fine (did not benchmark it comparing with Open MPI). I did not manage to make Singularity+MPI work in LSF, but this is a different story that deserves a separate thread.

Correct - the LSF issue is likely a problem of getting the required setup info passed by LSF


2. MPI process calls dl_open, thus the more MPI processes starts on a node, the more times dl_open will be called. Open MPI 2.0.1 somehow solves this magically (I don't get how) and dl_open is called only once per node. Other implementations of MPI and older Open MPI are not Singularity aware, thus they still will call dl_open each time when MPI process spawns.

Not exactly. Singularity will solve the dl_open problem by itself. What the container does is wrap all the dl_open libraries into the container, and so all dl_open calls by the app are locally resolved. Thus, you automatically resolve the IO node bottleneck scaling issue.

What OMPI adds is that it pulls the container only once/node. Other mpiexec implementations will pull the container again for every local process. So if you have 100 procs/node, OMPI will result in 100x fewer “pulls” thru that IO node.


3. dl_open issue affects only process start time and does not effect the process execution, so on small scale with long running processes there is no difference between Open MPI 2.0.1 and older Open MPI versions (as well as other MPI implementations).

Correct


4. When sapp is built then Singularity detects Open MPI (even older then 2.0.1, right?) and resolves all dependencies automatically adding all files to the sapp. But with, say, MVAPICH2 the dependencies are not resolved automatically, so user should add some stuff manually.

Correct


5. Apart of solving dl_open issue Open MPI 2.0.1 does some splitting between the host and the container, which allows user/admin to not optimize Open MPI for a target platform. I really don't get how Singularity does this, but I get the problem. Could you explain what Singularity or Open MPI 2.0.1 does for that specificaly?

When running under mpiexec with Singularity, OMPI’s local daemon on each node actually runs outside of the containers. We then fork/exec the container itself, and the container is defined so it auto-executes the application process. This allows us to minimize the services overhead, keeping all services outside of your container (and thus shared across all containers.

Other approaches have the daemon -inside- the container, and you get one daemon for each container - and thus, one daemon for each local application. So you get a higher overhead and therefore lower performance.

HTH
Ralph

Gregory M. Kurtzer

unread,
May 13, 2016, 2:18:54 PM5/13/16
to singularity
On Fri, May 13, 2016 at 10:10 AM, Ralph Castain <r...@open-mpi.org> wrote:

On May 13, 2016, at 9:52 AM, Taras Shapovalov <shapov...@gmail.com> wrote:

Hi Ralph and Gregory,

Thank you the both for the so detailed answers! I see your replies complement each other. Although I am a bit confused now with the whole picture, so could you confirm that I get the ideas correctly:

1. All implementations of MPI by default should work with Singularity containers (maybe not as optimal as could be, but should start and finish correctly always). Actually I've tested recently MPICH+Singularity with several workload managers, worked fine (did not benchmark it comparing with Open MPI). I did not manage to make Singularity+MPI work in LSF, but this is a different story that deserves a separate thread.

I wouldn't necessarily say they would all work by default. For example, some namespaces may necessitate being disabled in order to get proper shared memory IO performance. But ... If you have tested this, that is great news and I'd love to hear more about your findings!
 

Correct - the LSF issue is likely a problem of getting the required setup info passed by LSF


2. MPI process calls dl_open, thus the more MPI processes starts on a node, the more times dl_open will be called. Open MPI 2.0.1 somehow solves this magically (I don't get how) and dl_open is called only once per node. Other implementations of MPI and older Open MPI are not Singularity aware, thus they still will call dl_open each time when MPI process spawns.

Not exactly. Singularity will solve the dl_open problem by itself. What the container does is wrap all the dl_open libraries into the container, and so all dl_open calls by the app are locally resolved. Thus, you automatically resolve the IO node bottleneck scaling issue.

What OMPI adds is that it pulls the container only once/node. Other mpiexec implementations will pull the container again for every local process. So if you have 100 procs/node, OMPI will result in 100x fewer “pulls” thru that IO node.

Yep, and additionally I want to make sure we keep Singularity v1 and v2 features separate. Version 2 has several huge benefits (including this) over v1, but it is a departure from using SAPPs (and now uses images).
 


3. dl_open issue affects only process start time and does not effect the process execution, so on small scale with long running processes there is no difference between Open MPI 2.0.1 and older Open MPI versions (as well as other MPI implementations).

Correct

Correct, just keep in mind start times at massive scale have been stated by several large centers to approach 30 minutes. During that 30 minutes, it basically looks like a distributed denial of service attack to the file system metadata server killing file system performance to the rest of the system.
 


4. When sapp is built then Singularity detects Open MPI (even older then 2.0.1, right?) and resolves all dependencies automatically adding all files to the sapp. But with, say, MVAPICH2 the dependencies are not resolved automatically, so user should add some stuff manually.

Correct

And in v2, this will get handled either by an RPM installation of Open MPI, or the 'singularity exec --writable /path/to/Container.img make install'. 


5. Apart of solving dl_open issue Open MPI 2.0.1 does some splitting between the host and the container, which allows user/admin to not optimize Open MPI for a target platform. I really don't get how Singularity does this, but I get the problem. Could you explain what Singularity or Open MPI 2.0.1 does for that specificaly?

When running under mpiexec with Singularity, OMPI’s local daemon on each node actually runs outside of the containers. We then fork/exec the container itself, and the container is defined so it auto-executes the application process. This allows us to minimize the services overhead, keeping all services outside of your container (and thus shared across all containers.

Other approaches have the daemon -inside- the container, and you get one daemon for each container - and thus, one daemon for each local application. So you get a higher overhead and therefore lower performance.

Maybe this will help to articulate it... I have described this via the MPI/Singularity invocation pathway as follows (hopefully it is reasonably correct and doesn't cause Ralph to kick me). Considering the command:

$ mpirun -np X singularity exec ~/Centos-7.img mpi_program

1. From shell (or resource manager) mpirun gets called
2. mpirun forks and exec orte daemon
3. Orted process creates PMI
4. Orted forks == to the number of process per node requested
5. Orted children exec to original command passed to mpirun (Singularity)
6. Each Singularity execs the command passed inside the given container
7. Each MPI program links in the dynamic Open MPI libraries (ldd)
8. Open MPI libraries continue to open the non-ldd shared libraries (dlopen)
9. Open MPI libraries connect back to original orted via PMI
10. All non-shared memory communication occurs through the PMI and then to local interfaces (e.g. InfiniBand)

While the workflow from the MPI perspective seems simpler with the daemon inside the container, it is much more complicated from the system perspective because each orted process must also know about the other hosts, be able to communicate with them and mitigate performance factors of host/resource/interconnect tuning.

Hope that helps!


 

HTH
Ralph



Best regards,

Taras


--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

Taras Shapovalov

unread,
May 16, 2016, 4:17:49 AM5/16/16
to singu...@lbl.gov
Hi guys,

Thanks for the great answers! Now it is more or less clear how it works. To be absolutely sure, can you please confirm also these statements (got from your answers):

1. Ralph's answer mentions mpiexec, but Gregory's answer is about mpirun. So, all the discussed here can be applied to the both utilities included in Open MPI distribution.

2. Running Open MPI processes in a single container is impleneted only in Singularity v2. In v1 each Open MPI process still will be executed in different containers.

3. Lets compare these 2 scenarios: Singularity runs child processes in a single container agains scenario when each child runs in a separate container each. The optimization with dlopen call happens in the first scenario, because the opened library is loaded into the memory per Singularity container, then dlopen magically returns the same handler for each child process inside the container, which should be faster. Or there is some other low level optimization occurs in the first scenario regarding dlopen?

Best regards,

Taras

/T

Gregory M. Kurtzer

unread,
May 16, 2016, 8:54:10 AM5/16/16
to singularity
On Mon, May 16, 2016 at 1:17 AM, Taras Shapovalov <shapov...@gmail.com> wrote:
Hi guys,

Thanks for the great answers! Now it is more or less clear how it works. To be absolutely sure, can you please confirm also these statements (got from your answers):

1. Ralph's answer mentions mpiexec, but Gregory's answer is about mpirun. So, all the discussed here can be applied to the both utilities included in Open MPI distribution.

Ralph can speak definitively here, but I believe my answer applies to both.
 

2. Running Open MPI processes in a single container is impleneted only in Singularity v2. In v1 each Open MPI process still will be executed in different containers.

For technical Q&A we should probably use the word namespaces in addition to containers, I'll explain.

Singularity v1 will cache the container on each node, so processes within a node will share the container cache but operate in some different namespaces (the specific namespaces are somewhat application/necessity dependent).

Singularity v2 has no need to cache the container, but it does need to bind it to a loop device. This happens once per node, but again there is no cache so all nodes are sharing the same container image and also operate in some separate namespaces (again dependent on need).
 

3. Lets compare these 2 scenarios: Singularity runs child processes in a single container agains scenario when each child runs in a separate container each. The optimization with dlopen call happens in the first scenario, because the opened library is loaded into the memory per Singularity container, then dlopen magically returns the same handler for each child process inside the container, which should be faster. Or there is some other low level optimization occurs in the first scenario regarding dlopen?

I am not sure I follow completely, but if you are asking what I think you're asking... Singularity v2 will optimize all calls to open() (including dlopen()) within the container because what is within the container all exist within a single image (there is no need to make additional metadata requests to files that exist within the container image). Additionally there is no launch penalty taken because there is no need to cache the image. On average, launch time when using this method is about .020s on my test system and writes/changes never require a rebuild.

With Singularity v1, files are pulled out of the container archive (SAPP) and spilled out to the storage. If the storage is local to nodes, then calls to open() and thus the required metadata will not goto shared storage. By default the container is cached to shared storage (unless launching a SAPP file directly through Open MPI). Launch time for v1 is about .050s after the image has been cached, and caching of the image usually takes anywhere from .5s to as high as you want to go depending on image size (I've seen in my testing upwards of 10 seconds).

Hopefully that helps!

Ralph Castain

unread,
May 16, 2016, 9:05:07 AM5/16/16
to singu...@lbl.gov
On May 16, 2016, at 5:54 AM, Gregory M. Kurtzer <gmku...@lbl.gov> wrote:



On Mon, May 16, 2016 at 1:17 AM, Taras Shapovalov <shapov...@gmail.com> wrote:
Hi guys,

Thanks for the great answers! Now it is more or less clear how it works. To be absolutely sure, can you please confirm also these statements (got from your answers):

1. Ralph's answer mentions mpiexec, but Gregory's answer is about mpirun. So, all the discussed here can be applied to the both utilities included in Open MPI distribution.

Ralph can speak definitively here, but I believe my answer applies to both.

The two names are for the identical binary - in the MPI world, folks use both names interchangeably

Taras Shapovalov

unread,
May 16, 2016, 9:40:39 AM5/16/16
to singu...@lbl.gov
Thank you guys,

You've really helped to understand how it works with MPI!
No more questions, at least for now.

Best regards,

Taras

Dave Love

unread,
May 16, 2016, 10:55:47 AM5/16/16
to singu...@lbl.gov
Apologies for tagging this onto the end of the thread, but I wasn't
subscribed before to reply cite a more appropriate message.

I haven't had a chance to check how the OMPI stuff is done, but it
sounds wrong if it's specific to singularity rather than specializing
something more general. It's the sort of thing the batch resource
manager might have an interest in, too. (I actually thought there was
already some sort of staging in ORTE, but I can't see anything now.) Is
it actually specific, and if so, does it need to be?

[I'm sure the issue is worth bothering about, but I can see hints in
talking about the effective DoS on _the_ metadata server and from all
the nodes/ranks doing the same thing.]

Gregory M. Kurtzer

unread,
May 16, 2016, 11:32:21 AM5/16/16
to singu...@lbl.gov
There is nothing specific about the OMPI integration that is specific for Singularity aside from runtime optimizations (like the staging you mentioned). That code can be found currently in the OMPI v2.(1/0.1) development branch. That staging is not necessary with Singularity v2, but there are some other optimizations that Ralph and I have been brainstorming.

With that said, what makes Singularity an appealing container system for MPI integration is it's workflow and container bootstrapping model which makes it easily compatible with things like MPI's PMI among other things like X11.



Sent from my iPhone
Reply all
Reply to author
Forward
0 new messages