Singularity instances for running multiple container processes per host

796 views

Skip to first unread message

Mike

unread,

Apr 10, 2019, 11:14:07 AM4/10/19

to singularity

Hi,

I believe I have discovered an interesting use case for Singularity instances.

If you run multiple "singularity exec xxx.sif cmd" on the same host (typical e.g. for single-threaded MPI jobs on multi-CPU systems), each invocation will get its own mount, and file system contents will be buffered separately for each mount. If processes are accessing essentially the same data in the container, they will compete for buffer cache space to hold multiple copies of identical data, possibly amounting to a significant portion of the entire memory, thus detracting from memory effectively available to processes' address space.

One can easily demonstrate this behavior by preparing e.g. a ubuntu:latest container which has e.g. a 1GB file in its /data directory and then starting two or more containers on the same host. I could reproduce this for Singularity 2.6 and 3.1, and with kernels 3.10, 4.18, and 5.0.7.

Demonstration of multiple buffer cache allocation

For the example given below, I used Singularity 3.1.1 on a virtual host running Ubuntu 18.04 LTS; kernel = 4.18. I monitored buffer cache usage with top(1).

Start two separate shell sessions:   singularity shell xxx.sif

From another window, empty the buffer cache:   sync; echo 3 | sudo tee /proc/sys/vm/drop_caches

While monitoring buffer cache usage, issue in each Singularity session:
cp /data/file1GB /dev/null

Buffer cache usage (MiB):
before:               209.363 buff/cache         (a)
after first cp:      3393.145 buff/cache   delta (a,1) = 3184
after second cp:     5458.289 buff/cache   delta (1,2) = 2065
after termination:   1236.008 buff/cache   delta (a,b) = 1027

Noteworthy observations:
It appears that we need approx 1GB to cache the relevant portion of the SIF file in the host context, and roughly another 2GB of cache per invocation for buffering the data within the conainer.

Reducing buffer cache usage by Singularity instances

Repeating the same experiment with Singularity instances...

singularity instance start xxx.sif c.i

In both sessions:
singularity shell instance://c.i

Drop buffer cache
sync; echo 3 | sudo tee /proc/sys/vm/drop_caches

In each session as above:
cp /data/file1GB /dev/null

before:               199.441 buff/cache         (a)
after first cp:      3386.508 buff/cache   delta (a,1) = 3187
after second cp:     3386.965 buff/cache   delta (1,2) <    1
after termination:   1246.730 buff/cache   delta (a,b) = 1047

As expected, the buffer cache is shared between all Singularity processes because they are running in the same name space.

Still, I wonder why reading a 1GB file uses 2GB to cache file system contents inside the container.

Practical considerations

To take advantage of buffer cache sharing in MPI jobs, one needs to set up and terminate the Singularity instances before and after running your actual MPI program. I have set up a proof-of-concept using SGE:

module load singularity/2.x openmpi/1.8.5
singularity=$(which singularity)
cwd=$(/bin/pwd)

# start one instance on each host
awk '{print $1}' $PE_HOSTFILE | ###    sort | uniq |
   parallel -k -j0 "qrsh -inherit -nostdin -V {} \
   $singularity instance.start $cwd/insttest.simg it.$JOB_ID"

mpirun -np $NSLOTS $singularity run instance://it.$JOB_ID myprogram myargs

awk '{print $1}' $PE_HOSTFILE | ###   sort | uniq |
   parallel -k -j0 "qrsh -inherit -nostdin -V {} \
   $singularity instance.stop it.$JOB_ID"

For SGE+OpenMPI, each host occurs only once in the host file; for other implementations, add the "sort|uniq" portion which is commented out above.

This appears to work fine, but is not exactly what you want to explain to nontechnical users, so this needs to be scripted.

I am still looking for a solution for job arrays when multiple executions on individual nodes may overlap in time. To avoid race conditions, some type of counting semaphores to coordinate creation and destruction of Singularity instances between independently running shell scripts is required.

Any comments / suggestions? Is this considered to be best practice?

Best regards, Michael

Thomas Hartmann

unread,

Apr 10, 2019, 11:46:06 AM4/10/19

to singu...@lbl.gov

Hi Mike,

that sounds interesting - I wonder how the behaviour might look like
with explicit bind mounts to the file system?

We recently wondered an odd behaviour with Docker containers with NFS
mounts, where we bound paths from the NFS mounts into the container
namespace (and learnt a bit about how the kernel interprets bind mounts
compared to file system mounts).

Thing was, that even after umounting the fs mount under root, the *fs
mount as such* still persisted, i.e., no mount was visible for root, but
from the container context one could still read/write into the
fs/namespace and everything got synced (cross checked with Wireshark,
that indeed NFS traffic was ongoing with the fs 'umounted' for root).

So, we learnt on the way, that while the 'original' mount was not
anymore, the bind mount to the container kept the fs mount (it appeared
to me a bit like hardlinks with inodes...)

I documented a bit in
https://confluence.desy.de/display/~hartmath/Containers%2C+file+systems+and+bind+mount+oddities

Would be interesting to see, how the processes' mountinfos for each of
the containers look like (assuming that the behaviour is the same for
block devices as for NFS)?

Cheers,
Thomas

On 10/04/2019 17.14, Mike wrote:
> Hi,
>
> I believe I have discovered an interesting use case for Singularity
> instances.
>

> If you run multiple "singularity exec /xxx/.sif /cmd/" on the same host

> (typical e.g. for single-threaded MPI jobs on multi-CPU systems), each
> invocation will get its own mount, and file system contents will be
> buffered separately for each mount. If processes are accessing
> essentially the same data in the container, they will compete for buffer
> cache space to hold multiple copies of identical data, possibly
> amounting to a significant portion of the entire memory, thus detracting
> from memory effectively available to processes' address space.
>
> One can easily demonstrate this behavior by preparing e.g. a
> ubuntu:latest container which has e.g. a 1GB file in its /data directory
> and then starting two or more containers on the same host. I could
> reproduce this for Singularity 2.6 and 3.1, and with kernels 3.10, 4.18,
> and 5.0.7.
>

> *Demonstration of multiple buffer cache allocation
> *

>
> For the example given below, I used Singularity 3.1.1 on a virtual host
> running Ubuntu 18.04 LTS; kernel = 4.18. I monitored buffer cache usage
> with top(1).
>
> Start two separate shell sessions:   singularity shell xxx.sif
>
> From another window, empty the buffer cache:   sync; echo 3 | sudo tee
> /proc/sys/vm/drop_caches
>
> While monitoring buffer cache usage, issue in each Singularity session:
> cp /data/file1GB /dev/null
>
> Buffer cache usage (MiB):
> before:               209.363 buff/cache         (a)
> after first cp:      3393.145 buff/cache   delta (a,1) = 3184
> after second cp:     5458.289 buff/cache   delta (1,2) = 2065
> after termination:   1236.008 buff/cache   delta (a,b) = 1027
>
> Noteworthy observations:
> It appears that we need approx 1GB to cache the relevant portion of the
> SIF file in the host context, and roughly another 2GB of cache per
> invocation for buffering the data within the conainer.
>

> *Reducing buffer cache usage by Singularity instances*

>
> Repeating the same experiment with Singularity instances...
>
> singularity instance start xxx.sif c.i
>
> In both sessions:
> singularity shell instance://c.i
>
> Drop buffer cache
> sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
>
> In each session as above:
> cp /data/file1GB /dev/null
>
> before:               199.441 buff/cache         (a)
> after first cp:      3386.508 buff/cache   delta (a,1) = 3187
> after second cp:     3386.965 buff/cache   delta (1,2) <    1
> after termination:   1246.730 buff/cache   delta (a,b) = 1047
>
> As expected, the buffer cache is shared between all Singularity
> processes because they are running in the same name space.
>
> Still, I wonder why reading a 1GB file uses 2GB to cache file system
> contents inside the container.
>

> *Practical considerations*

> --
> You received this message because you are subscribed to the Google
> Groups "singularity" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to singularity...@lbl.gov
> <mailto:singularity...@lbl.gov>.

Reply all

Reply to author

Forward

0 new messages