Difference in network namespace behavior between Singularity 2 and Singularity 3

574 views

Skip to first unread message

Sean Mahoney

unread,

Sep 7, 2018, 9:44:46 PM9/7/18

to singularity

Hi All,

In Singularity 2 (I'm testing with 2.6.0), if you 'shell' or 'exec' to a running instance that was started in a custom network namespace (created with the 'ip netns add' command, not using the -n|--net Singularity flag) the process(es) spawned will also use that custom network namespace. This is highly beneficial because it means that each Singularity container instance can use its own 'ip netns' managed custom network namespace and that users of each container end up in that custom network namespace by default when running 'shell' or 'exec' against the instance, regardless of the network namespace of the user's shell.

In Singularity 3 (I'm testing with 3.0.0-alpha.1-144-g345371f, git clone from 08/31/2018), it appears that behavior has changed. If an instance is started in a custom network namespace (created with the 'ip netns add' command, not using the -n|--net Singularity flag), the initial processes start in the custom network namespace but future process(es) spawned by running 'shell' or 'exec' to that instance end up associated with the root (default) network namespace of the host system instead of the custom network namespace. Since non-root users cannot use 'ip netns exec' or 'nsenter' to explicitly start a process in a particular namespace, non-root user 'shell' and 'exec' interactions with that Singularity 3 instance will result in processes that essentially "escape" the custom network namespace and are instead exposed to the root network namespace and everything it contains (network interfaces, iptables rules, routing tables, etc)...

Are there plans to incorporate Singularity 2's behavior described above into Singularity 3?

Although I realize that the -n|--net flag can be passed to 'instance start' to cause Singularity 3 to generate a new network namespace, using 'ip netns add' instead not only creates what is essentially a "named" custom network namespace (e.g. /var/run/netns/name-of-custom-network-namespace) but also makes it possible to use namespace-specific configuration files in /etc (see http://man7.org/linux/man-pages/man8/ip-netns.8.html ).

Below I've included two sections ('Examples' and 'Relevant Code') to provide examples and references to relevant code, both with an excruciating (but hopefully helpful) amount of detail. ;)

If you have any questions or need additional information, please let me know.

Thanks!

Sean

Examples

Singularity 2

As an example, consider the following script being run as root when a host is started:

#!/bin/bash
ip netns add container1-hostname
ip netns exec container1-hostname su -c "singularity instance.start /home/nonrootuser/centos.simg container1" - nonrootuser

The script above creates a new custom network namespace named 'container1-hostname' and starts a Singularity instance named 'container1' (running as nonrootuser) inside that custom network namespace. The result in Singularity 2 is a single process for the instance ('singularity-instance: nonrootuser [container1]') that is using the 'container1-hostname' custom network namespace:

[root@singularity2 ~]# lsns -t net -o NS,PATH,TYPE,NPROCS,PPID,PID,USER,UID,COMMAND
        NS PATH              TYPE NPROCS PPID   PID USER        UID COMMAND
4026531956 /proc/1/ns/net    net      94    0     1 root        0 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
4026532163 /proc/2695/ns/net net       1    1  2695 nonrootuser 1000 singularity-instance: nonrootuser [container1]

[root@singularity2 ~]# ls -i /var/run/netns/container1-hostname
4026532163 /var/run/netns/container1-hostname

In this example, 4026531956 is the inode containing the root network namespace used by default for all processes running on the host system (94 processes in this example) and 4026532163 is the inode containing our custom network namespace. Now let's say nonrootuser shells into the running instance that was started as their user when the host system booted:

[nonrootuser@singularity2 ~]$ singularity instance.list
DAEMON NAME      PID      CONTAINER IMAGE
container1       2695     /home/nonrootuser/centos.simg

[nonrootuser@singularity2 ~]$ singularity shell instance://container1
Singularity: Invoking an interactive shell within container...

Singularity centos.simg:~> readlink /proc/self/ns/net
net:[4026532163]

The resulting bash process for the Singularity shell ends up with our correct custom network namespace (the network namespace file descriptor for its process refers to inode 4026532163, which we saw earlier as the inode containing our 'container1-hostname' custom network namespace in this example). The same behavior applies to exec:

[nonrootuser@singularity2 ~]$ singularity exec instance://container1 readlink /proc/self/ns/net
net:[4026532163]

This is excellent since it means that all processes resulting from nonrootuser's interaction with that instance will use our custom network namespace!

Singularity 3

Now let's look at the different behavior in Singularity 3. Consider the following script (identical to the script in the Singularity 2 example except for using 'instance start' instead of 'instance.start') being run as root when a host is started:

#!/bin/bash
ip netns add container1-hostname
ip netns exec container1-hostname su -c "singularity instance start /home/nonrootuser/centos.simg container1" - nonrootuser

The script above creates a new custom network namespace named 'container1-hostname' and starts a Singularity instance named 'container1' (running as nonrootuser) inside that custom network namespace. The result in Singularity 3 is two processes for the instance (parent process 'Singularity instance: nonrootuser [container1]' and child process 'sinit') that are both using the 'container1-hostname' custom network namespace:

[root@singularity3 ~]# lsns -t net -o NS,PATH,TYPE,NPROCS,PPID,PID,USER,UID,COMMAND
        NS PATH              TYPE NPROCS PPID   PID USER        UID COMMAND
4026531956 /proc/1/ns/net    net      99    0     1 root        0 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
4026532163 /proc/2676/ns/net net       2    1  2676 nonrootuser 1000 Singularity instance: nonrootuser [container1]

[root@singularity3 ~]# ls -i /var/run/netns/container1-hostname
4026532163 /var/run/netns/container1-hostname

In this example, 4026531956 is the inode containing the root network namespace used by default for all processes running on the host system (99 processes in this example) and 4026532163 is the inode containing our custom network namespace. Now let's say nonrootuser shells into the running instance that was started as their user when the host system booted:

[nonrootuser@singularity3 ~]$ singularity instance list
INSTANCE NAME    PID      IMAGE
container1       2677     /home/nonrootuser/centos.simg

[nonrootuser@singularity3 ~]$ singularity shell instance://container1
Singularity :~> readlink /proc/self/ns/net
net:[4026531956]

Instead of the resulting bash process for the Singularity shell ending up in our correct custom network namespace like it did in Singularity 2, it ends up in the root network namespace (the network namespace file descriptor for its process refers to inode 4026531956, which we saw earlier as the inode containing the root/default network namespace in this example). :( The same behavior unfortunately applies to exec:

[nonrootuser@singularity3 ~]$ singularity exec instance://container1 readlink /proc/self/ns/net
net:[4026531956]

The result is that all processes started by a user connecting to that instance essentially "escape" the custom network namespace and are instead exposed to the root network namespace and everything it contains (network interfaces, iptables rules, routing tables, etc)...

Relevant Code

Singularity 2

In Singularity 2, when 'singularity exec instance://instancename command' is run, my current understanding is that the following occurs:

* /usr/bin/singularity (or whatever its path may be on your system) - 'exec' is passed as $SINGULARITY_COMMAND to the line 'exec $SINGULARITY_libexecdir/singularity/cli/$SINGULARITY_COMMAND.exec "$@"'.

- See https://github.com/singularityware/singularity/blob/release-2.6/bin/singularity.in#L135

* /usr/libexec/singularity/cli/exec.exec (or whatever its path may be on your system) - The line '. "$SINGULARITY_libexecdir/singularity/image-handler.sh"' runs image-handler.sh.

- See https://github.com/singularityware/singularity/blob/release-2.6/libexec/cli/exec.exec#L74

* /usr/libexec/singularity/image-handler.sh (or whatever its path may be on your system) - 'instance://instancename' being passed to exec results in the line '. "$SINGULARITY_libexecdir/singularity/handlers/image-instance.sh"' being run.

- See https://github.com/singularityware/singularity/blob/release-2.6/libexec/image-handler.sh#L44

* /usr/libexec/singularity/handlers/image-instance.sh (or whatever its path may be on your system) - If an instance by that name is indeed running, the line 'SINGULARITY_DAEMON_JOIN=1' is run and SINGULARITY_DAEMON_JOIN is exported as a bash environment variable.

- See https://github.com/singularityware/singularity/blob/release-2.6/libexec/handlers/image-instance.sh#L31

* /usr/libexec/singularity/cli/exec.exec (or whatever its path may be on your system) - The line 'exec "$SINGULARITY_libexecdir/singularity/bin/action-suid" "$@" <&0' is run if SINGULARITY_NOSUID isn't set.

- See https://github.com/singularityware/singularity/blob/release-2.6/libexec/cli/exec.exec#L79

* /usr/libexec/singularity/bin/action-suid (or whatever its path may be on your system) - Regardless of whether or not DAEMON_JOIN was set in the "Singularity registry" (i.e. if 'SINGULARITY_DAEMON_JOIN=1' was run in image-instance.sh), 'singularity_runtime_ns(SR_NS_ALL);' is run.

- See https://github.com/singularityware/singularity/blob/release-2.6/src/action.c#L109

* /usr/lib64/singularity/libsingularity-runtime.so.1 (or whatever its path may be on your system) - In function definition 'int singularity_runtime_ns(unsigned int flags)', conditional 'if ( singularity_registry_get("DAEMON_JOIN")' returns true since DAEMON_JOIN was set to 1 in the "Singularity registry" ('SINGULARITY_DAEMON_JOIN=1' was run in image-instance.sh). As a result, the line 'return(_singularity_runtime_ns_join(flags));' is run (with 'flags' being an integer representation of SR_NS_ALL at this point).

- See https://github.com/singularityware/singularity/blob/release-2.6/src/lib/runtime/runtime.c#L59

* /usr/lib64/singularity/libsingularity-runtime.so.1 (or whatever its path may be on your system) - In function definition 'int _singularity_runtime_ns_join(unsigned int flags)', conditional 'if ( flags & SR_NS_NET )' returns true since SR_NS_NET is part of SR_NS_ALL (passed earlier) and the line 'retval += _singularity_runtime_ns_net_join();' is run.

- See https://github.com/singularityware/singularity/blob/release-2.6/src/lib/runtime/ns/ns.c#L86

* /usr/lib64/singularity/libsingularity-runtime.so.1 (or whatever its path may be on your system) - In function definition 'int _singularity_runtime_ns_net_join(void)', the file descriptor for the running instance's network namespace is opened and 'setns(net_fd, CLONE_NEWNET)' is ran as part of a conditional. The setns system call (see http://man7.org/linux/man-pages/man2/setns.2.html ) associates the calling thread (in our case, the exec process) with a particular namespace of the file descriptor its passed. In this case, the file descriptor for the running instance's network namespace is passed in along with 'CLONE_NEWNET' (used for network namespaces) as the namespace type. The end result is that the process created from running 'singularity exec instance://instancename command' is assigned to whatever network namespace the process of the running instance is in. Cool! :)

- See https://github.com/singularityware/singularity/blob/release-2.6/src/lib/runtime/ns/net/net.c#L100

Singularity 3

I haven't yet had time to examine the code execution steps for the same example command ('singularity exec instance://instancename command') in Singularity 3, but I have determined that it does indeed have code that supports a 'shell' or 'exec' process being assigned to a custom network namespace of a running instance. The issue is that the aforementioned behavior (assignment of a 'shell' or 'exec' process to a running instance's custom network namespace) only occurs if that instance was started with the -n|--net Singularity flag (which of course instructs Singularity to create its own custom network namespace). If the instance was started without that flag from a custom network namespace that was instead created with 'ip netns add', 'shell' and 'exec' processes run against an instance end up in the root network namespace instead of the desired custom network namespace.

One fundamental difference is that the JSON configuration for the running instance located at /var/run/singularity/instances/${USER}/nameofinstance.json only includes 'network' in its list of namespaces if that instance was started with the -n|--net Singularity flag. If that instance was started without that flag from a custom network namespace that was instead created with 'ip netns add', 'network' is not included in the list of namespaces...

For example, consider the following script being run as root:

#!/bin/bash
ip netns add container1-hostname
ip netns exec container1-hostname su -c "singularity instance start /home/nonrootuser/centos.simg container1" - nonrootuser
su -c "singularity instance start --net /home/nonrootuser/centos.simg container2" - nonrootuser

The result is two Singularity instances, container1 and container2. container1 was started inside of the 'container1-hostname' custom network namespace that was created with the 'ip netns add' command and the two resulting initial processes (parent process 'Singularity instance: nonrootuser [container1]' and child process 'sinit') are both using that namespace. container2 was instead started with the -n|--net Singularity flag to instruct Singularity to create its own custom network namespace. container2 also starts with two initial processes (parent process 'Singularity instance: nonrootuser [container2]' and child process 'sinit') but only the 'sinit' process ends up using the custom network namespace that Singularity created. The parent process 'Singularity instance: nonrootuser [container2]' ends up in the root (default) network namespace instead of the custom network namespace. Consider the following four processes:

nonrootuser   1157  Singularity instance: vagrant [container1]
nonrootuser   1158  \_ sinit
nonrootuser   1205  Singularity instance: vagrant [container2]
nonrootuser   1206  \_ sinit

PIDs 1157 and 1158 are the initial parent and child processes for container1 and PIDs 1205 and 1206 are the initial parent and child processes for container2. Let's confirm which network namespaces those processes are using:

[root@singularity3 ~]# lsns -t net -o NS,PATH,TYPE,NPROCS,PPID,PID,USER,UID,COMMAND
        NS PATH              TYPE NPROCS  PPID   PID USER        UID COMMAND
4026531956 /proc/1/ns/net    net      99     0     1 root        0 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
4026532163 /proc/1157/ns/net net       2     1  1157 nonrootuser 1000 Singularity instance: nonrootuser [container1]
4026532242 /proc/1206/ns/net net       1  1205  1206 nonrootuser 1000 sinit

In this example, 4026531956 is the inode containing the root network namespace used by default for all processes running on the host system (99 processes in this example), 4026532163 is the inode containing the 'container1-hostname' custom network namespace we created with 'ip netns add', and 4026532242 is the inode containing the custom network namespace that Singularity created when the container2 instance was started with the --net flag.

[root@singularity3 ~]# ls -i /var/run/netns/container1-hostname
4026532163 /var/run/netns/container1-hostname

[root@singularity3 ~]# readlink /proc/1157/ns/net
net:[4026532163]

[root@singularity3 ~]# readlink /proc/1158/ns/net
net:[4026532163]

[root@singularity3 ~]# readlink /proc/1205/ns/net
net:[4026531956]

[root@singularity3 ~]# readlink /proc/1206/ns/net
net:[4026532242]

Now let's compare the JSON configuration of both running instances, specifically the values from '.engineConfig.ociConfig.linux.namespaces':

[nonrootuser@singularity3 ~]$ cat /var/run/singularity/instances/nonrootuser/container1.json | jq '.config' | perl -pi -e "s/\"//g" | base64 --decode | jq '.engineConfig.ociConfig.linux.namespaces'
[
  {
    "type": "pid",
    "path": "/proc/1158/ns/pid"
  },
  {
    "type": "ipc",
    "path": "/proc/1158/ns/ipc"
  },
  {
    "type": "mount",
    "path": "/proc/1158/ns/mnt"
  }
]

[nonrootuser@singularity3 ~]$ cat /var/run/singularity/instances/nonrootuser/container2.json | jq '.config' | perl -pi -e "s/\"//g" | base64 --decode | jq '.engineConfig.ociConfig.linux.namespaces'
[
  {
    "type": "network",
    "path": "/proc/1206/ns/net"
  },
  {
    "type": "pid",
    "path": "/proc/1206/ns/pid"
  },
  {
    "type": "ipc",
    "path": "/proc/1206/ns/ipc"
  },
  {
    "type": "mount",
    "path": "/proc/1206/ns/mnt"
  }
]

As you can see, the 'network' namespace only appears in the JSON configuration for a running instance if that instance was started with the -n|--net Singularity flag. Now let's test and see what network namespace ends up being assigned to 'exec' processes run against both of those running instances:

[nonrootuser@singularity3 ~]$ singularity exec instance://container1 readlink /proc/self/ns/net
net:[4026531956]

[nonrootuser@singularity3 ~]$ singularity exec instance://container2 readlink /proc/self/ns/net
net:[4026532242]

'exec' (and 'shell') processes run against container1 end up in the root (default) network namespace of the host system instead of the desired 'container1-hostname' custom network namespace we created with 'ip netns add' earlier, even though both of the container1 processes are in that custom namespace! 'exec' (and 'shell') processes run against container2 end up in the custom network namespace created by Singularity.

It would be highly beneficial if Singularity 3 could be updated such that 'shell' and 'exec' processes run against an instance automatically receive the network namespace of the instance itself. That would be consistent with the behavior of Singularity 2 and would prevent 'shell' and 'exec' processes run against an instance from "escaping" a custom network namespace and ending up with completely different interfaces, iptables rules, routing tables, etc than anticipated.

Cédric Clerget

unread,

Sep 8, 2018, 12:02:10 PM9/8/18

to singularity

Hi Sean,

Indeed shell/exec actually just join the requested namespaces during instance start, so it's not compatible with the use case you demonstrated.

I opened an issue https://github.com/singularityware/singularity/issues/1946 to fix that asap.

Thank you very much for your detailed analysis, very helpful ;)

Cédric

Reply all

Reply to author

Forward

0 new messages